planner support functions: handle GROUP BY estimates ?

Started by Justin Pryzbyabout 6 years ago15 messages
#1Justin Pryzby
pryzby@telsasoft.com

Tom implemented "Planner support functions":
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=a391ff3c3d418e404a2c6e4ff0865a107752827b
https://www.postgresql.org/docs/12/xfunc-optimization.html

I wondered whether there was any consideration to extend that to allow
providing improved estimates of "group by". That currently requires manually
by creating an expression index, if the function is IMMUTABLE (which is not
true for eg. date_trunc of timestamptz).

ts=# explain analyze SELECT date_trunc('day', start_time) FROM child.alu_amms_201911 GROUP BY 1;
HashAggregate (cost=87.34..98.45 rows=889 width=8) (actual time=1.476..1.482 rows=19 loops=1)

ts=# explain analyze SELECT date_trunc('year', start_time) FROM child.alu_amms_201911 GROUP BY 1;
HashAggregate (cost=87.34..98.45 rows=889 width=8) (actual time=1.499..1.500 rows=1 loops=1)

ts=# CREATE INDEX ON child.alu_amms_201911 (date_trunc('year',start_time));
ts=# ANALYZE child.alu_amms_201911;
ts=# explain analyze SELECT date_trunc('year', start_time) FROM child.alu_amms_201911 GROUP BY 1;
HashAggregate (cost=87.34..87.35 rows=1 width=8) (actual time=1.414..1.414 rows=1 loops=1)

#2Justin Pryzby
pryzby@telsasoft.com
In reply to: Justin Pryzby (#1)
1 attachment(s)
Re: planner support functions: handle GROUP BY estimates ?

On Tue, Nov 19, 2019 at 01:34:21PM -0600, Justin Pryzby wrote:

Tom implemented "Planner support functions":
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=a391ff3c3d418e404a2c6e4ff0865a107752827b
https://www.postgresql.org/docs/12/xfunc-optimization.html

I wondered whether there was any consideration to extend that to allow
providing improved estimates of "group by". That currently requires manually
by creating an expression index, if the function is IMMUTABLE (which is not
true for eg. date_trunc of timestamptz).

I didn't hear back so tried implementing this for date_trunc(). Currently, the
planner assumes that functions output equally many groups as their input
variables. Most invocations of our reports use date_trunc (or similar), so my
earlier attempt to alert on rowcount misestimates was very brief.

I currently assume that the input data has 1 second granularity:
|postgres=# CREATE TABLE t(i) AS SELECT date_trunc('second',a)a FROM generate_series(now(), now()+'7 day'::interval, '1 seconds')a; ANALYZE t;
|postgres=# explain analyze SELECT date_trunc('hour',i) i FROM t GROUP BY 1;
| Group (cost=9021.85..9042.13 rows=169 width=8) (actual time=1365.934..1366.453 rows=169 loops=1)
|
|postgres=# explain analyze SELECT date_trunc('minute',i) i FROM t GROUP BY 1;
| Finalize HashAggregate (cost=10172.79..10298.81 rows=10081 width=8) (actual time=1406.057..1413.413 rows=10081 loops=1)
|
|postgres=# explain analyze SELECT date_trunc('day',i) i FROM t GROUP BY 1;
| Group (cost=9013.71..9014.67 rows=8 width=8) (actual time=1582.998..1583.030 rows=8 loops=1)

If the input timestamps have (say) hourly granularity, rowcount will be
*underestimated* by 3600x, which is worse than the behavior in master of
overestimating by (for "day") 24x.

I'm trying to think of ways to address that:

0) Add a fudge factor of 4x or maybe 30x;

1) Avoid applying a corrective factor for seconds or minutes that makes the
rowcount less than (say) 2 or 100. That would divide 24 but might then avoid
the last /60 or /60/60. Ultimately, that's more "fudge" than anything else;

2) Leave alone pg_catalog.date_trunc(), but provide "template" support
functions like timestamp_support_10pow1, 10pow2, 10pow3, etc, which include the
given corrective factor, which should allow more accurate rowcount for input
data with granularity of the given number of seconds.

Ideally, that would be user-specified factor, but I don't think that's possible
to specify in SQL; the constant has to be built into the C function. At
telsasoft, our data mostly has 15minute granularity (900sec), so we'd maybe
make a "date_trunc" function in the user schema which calls the
pg_catalog.date_trunc with support function timestamp_support_10pow3;

There could be a "base" support function that accepts a multiplier argument,
and then any user-provided C extension would be a one-liner specifing an
arbitrary value;

3) Maybe there are better functions than date_trunc() to address;

4) Leave it as a patch in the archives for people to borrow from;

Justin

Attachments:

v1-0001-Planner-support-functions-for-GROUP-BY-f.patchtext/x-diff; charset=us-asciiDownload
From 94f7791a7f82dea8757ef139befbf26feb970685 Mon Sep 17 00:00:00 2001
From: Justin Pryzby <pryzbyj@telsasoft.com>
Date: Sun, 15 Dec 2019 20:27:24 -0600
Subject: [PATCH v1] Planner support functions for GROUP BY f()..

..implemented for date_trunc()

See also a391ff3c3d418e404a2c6e4ff0865a107752827b
---
 src/backend/optimizer/util/plancat.c | 44 ++++++++++++++++
 src/backend/utils/adt/selfuncs.c     | 28 +++++++++++
 src/backend/utils/adt/timestamp.c    | 97 ++++++++++++++++++++++++++++++++++++
 src/include/catalog/catversion.h     |  2 +-
 src/include/catalog/pg_proc.dat      | 15 ++++--
 src/include/nodes/nodes.h            |  3 +-
 src/include/nodes/supportnodes.h     | 15 ++++++
 src/include/optimizer/plancat.h      |  2 +
 8 files changed, 201 insertions(+), 5 deletions(-)

diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 5e889d1..9438b3c 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -2008,6 +2008,50 @@ get_function_rows(PlannerInfo *root, Oid funcid, Node *node)
 	return result;
 }
 
+/* */
+double
+get_function_groupby(PlannerInfo *root, Oid funcid, Node *node, Node *var)
+{
+	HeapTuple	proctup;
+	Form_pg_proc procform;
+
+	proctup = SearchSysCache1(PROCOID, ObjectIdGetDatum(funcid));
+	if (!HeapTupleIsValid(proctup))
+		elog(ERROR, "cache lookup failed for function %u", funcid);
+	procform = (Form_pg_proc) GETSTRUCT(proctup);
+
+	if (OidIsValid(procform->prosupport))
+	{
+		SupportRequestGroupBy req;
+		SupportRequestGroupBy *sresult;
+
+		req.type = T_SupportRequestGroupBy;
+		req.root = root;
+		req.funcid = funcid;
+		req.node = node;
+		req.var = var;
+		req.factor = 1;			/* just for sanity */
+
+		sresult = (SupportRequestGroupBy *)
+			DatumGetPointer(OidFunctionCall1(procform->prosupport,
+											 PointerGetDatum(&req)));
+
+		if (sresult == &req)
+		{
+			/* Success */
+			ReleaseSysCache(proctup);
+			return req.factor;
+		}
+	}
+
+	/* XXX No support function, or it failed */
+
+	ReleaseSysCache(proctup);
+
+	return 1;
+}
+
+
 /*
  * has_unique_index
  *
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index ff02b5a..eb0b86f 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3154,10 +3154,38 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
 		 */
 		foreach(l2, varshere)
 		{
+			double ret;
 			Node	   *var = (Node *) lfirst(l2);
 
 			examine_variable(root, var, 0, &vardata);
 			varinfos = add_unique_group_var(root, varinfos, var, &vardata);
+
+			/* If we group by a function of a simple var, try to call its support function to help estimate GROUP BY */
+			if (HeapTupleIsValid(vardata.statsTuple) &&
+					IsA(groupexpr, FuncExpr) && IsA(var, Var))
+					// && (varRelid == 0 || varRelid == ((Var *) basenode)->varno))
+			{
+				Form_pg_statistic stats = (Form_pg_statistic)GETSTRUCT(vardata.statsTuple);
+				FuncExpr   *expr = (FuncExpr *) groupexpr;
+
+				Var *v = (Var*) var;
+				RangeTblEntry *rte = root->simple_rte_array[v->varno];
+				char *reln = get_rel_name(rte->relid);
+				char *coln = get_attname(rte->relid, v->varattno, true);
+				ret = get_function_groupby(root, expr->funcid, groupexpr, var);
+
+				fprintf(stderr, "HERE %s %d %s.%s ndistinct=%f ret=%f\n", __FUNCTION__, __LINE__,
+						reln?reln:"null",
+						coln?coln:"null",
+						stats->stadistinct, ret);
+
+				Assert(ret>=0);
+				Assert(ret<=1);
+				numdistinct *= ret;
+
+				/* Can we do anything special with stats->stadistinct that's not already done in general? */
+			}
+
 			ReleaseVariableStats(vardata);
 		}
 	}
diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c
index 945b8f8..359ad32 100644
--- a/src/backend/utils/adt/timestamp.c
+++ b/src/backend/utils/adt/timestamp.c
@@ -5554,3 +5554,100 @@ generate_series_timestamptz(PG_FUNCTION_ARGS)
 		SRF_RETURN_DONE(funcctx);
 	}
 }
+
+
+/*
+ * Planner support function for date_trunc
+ * Try to estimate the factor by which to correct the estimate of ngroups for GROUP BY.
+ */
+Datum
+date_trunc_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq;
+	SupportRequestGroupBy *req;
+	List	   *args;
+	Node	   *arg1, *arg2;
+	Node		 	*var;
+	char		*start;
+	int			typmod;
+
+	rawreq = (Node *) PG_GETARG_POINTER(0);
+	if (!IsA(rawreq, SupportRequestGroupBy))
+		PG_RETURN_POINTER(NULL);
+
+	req = (SupportRequestGroupBy *) rawreq;
+	if (!is_funcclause(req->node))	/* be paranoid */
+		PG_RETURN_POINTER(NULL);
+
+	args = ((FuncExpr *) req->node)->args;
+	arg1 = linitial(args);
+	arg2 = lsecond(args);
+	/* arg3 may be the timezone */
+
+	var = req->var;
+
+	/* XXX Assumes the input has 1-second granularity */
+
+	// XXX: only work on const?
+	start = TextDatumGetCString(((Const*)arg1)->constvalue);
+
+	// XXX: not working due to promotion ?
+	// exprType(var) is still not right, since date_trunc(a, b::date) uses b's type and not date..
+	// ...but date_trunc('', x::date) is weird
+	// exprType(arg2)==TIMESTAMPOID || exprType(arg2)==TIMESTAMPTZOID)
+	// if (req->funcid==1217 || req->funcid==1284 || req->funcid==2020)
+	/* Reset if it's not a timestamp */
+	if (exprType(var)==DATEOID) {
+		req->factor = 60*60*24;
+	} else if (exprType(var)==TIMESTAMPOID || exprType(var)==TIMESTAMPTZOID) {
+		int typmod = exprTypmod(arg2); // XXX: vartypmod ?
+		/* If the input has decimal digits, the grouping effect is stronger */
+		if (typmod != -1) {
+			req->factor /= 2<<typmod;
+			if (strcmp(start, "microseconds")==0) {
+				/* do nothing? */
+			} else if (strcmp(start, "milliseconds")==0) {
+				/* do nothing? */
+			}
+		}
+
+		if (strcmp(start, "second")==0) {
+			/* do nothing */
+		} else if (strcmp(start, "minute")==0) {
+			req->factor /= 60;
+		} else if (strcmp(start, "hour")==0) {
+			req->factor /= 60*60;
+		}
+	}
+
+	// else { elog(ERROR, "unknown type %u", exprType(var)); }
+
+	if (strcmp(start, "day")==0) {
+		req->factor /= 60*60*24;
+	} else if (strcmp(start, "week")==0) {
+		req->factor /= 60*60*24*7;
+	} else if (strcmp(start, "month")==0) {
+		/* 30 days */
+		req->factor /= 60*60*24*30;
+	} else if (strcmp(start, "quarter")==0) {
+		req->factor /= 60*60*24*30*3;
+	} else if (strcmp(start, "year")==0) {
+		req->factor /= 60*60*24*365.24;
+	} else if (strcmp(start, "decade")==0) {
+		req->factor /= 60*60*24*365.25*10;
+	} else if (strcmp(start, "century")==0) {
+		req->factor /= 60*60*24*365.25*100;
+	} else if (strcmp(start, "millennium")==0) {
+		req->factor /= 60*60*24*365.25*1000;
+	} else if (req->factor > 1) {
+		/* Maybe a DATE with finer graularity trunc */
+		req->factor = 1;
+	}
+	// else { elog(ERROR, "", ); }
+
+	/* Fudge Factor, since the input may be already "grouped", say at multiples of 15min, */
+	/* or otherwise have course granularity to begin with */
+	// factor/=4;
+
+	PG_RETURN_POINTER(req);
+}
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index eca67a1..e2c05be 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	201911242
+#define CATALOG_VERSION_NO	201911243
 
 #endif
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index aae50d6..7a8c3d1 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -2350,11 +2350,14 @@
 { oid => '1217',
   descr => 'truncate timestamp with time zone to specified units',
   proname => 'date_trunc', provolatile => 's', prorettype => 'timestamptz',
-  proargtypes => 'text timestamptz', prosrc => 'timestamptz_trunc' },
+  proargtypes => 'text timestamptz', prosrc => 'timestamptz_trunc',
+  prosupport => 'date_trunc_support', },
 { oid => '1284',
   descr => 'truncate timestamp with time zone to specified units in specified time zone',
   proname => 'date_trunc', provolatile => 's', prorettype => 'timestamptz',
-  proargtypes => 'text timestamptz text', prosrc => 'timestamptz_trunc_zone' },
+  proargtypes => 'text timestamptz text', prosrc => 'timestamptz_trunc_zone',
+  prosupport => 'date_trunc_support', },
+# XXX:
 { oid => '1218', descr => 'truncate interval to specified units',
   proname => 'date_trunc', prorettype => 'interval',
   proargtypes => 'text interval', prosrc => 'interval_trunc' },
@@ -5632,7 +5635,13 @@
   proargtypes => 'timestamptz', prosrc => 'timestamptz_time' },
 { oid => '2020', descr => 'truncate timestamp to specified units',
   proname => 'date_trunc', prorettype => 'timestamp',
-  proargtypes => 'text timestamp', prosrc => 'timestamp_trunc' },
+  proargtypes => 'text timestamp', prosrc => 'timestamp_trunc',
+  prosupport => 'date_trunc_support', },
+
+{ oid => '5449', descr => 'planner support for date_trunc',
+  proname => 'date_trunc_support', prorettype => 'internal',
+  proargtypes => 'internal', prosrc => 'date_trunc_support', },
+
 { oid => '2021', descr => 'extract field from timestamp',
   proname => 'date_part', prorettype => 'float8',
   proargtypes => 'text timestamp', prosrc => 'timestamp_part' },
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index bce2d59..499fed1 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -513,7 +513,8 @@ typedef enum NodeTag
 	T_SupportRequestSelectivity,	/* in nodes/supportnodes.h */
 	T_SupportRequestCost,		/* in nodes/supportnodes.h */
 	T_SupportRequestRows,		/* in nodes/supportnodes.h */
-	T_SupportRequestIndexCondition	/* in nodes/supportnodes.h */
+	T_SupportRequestIndexCondition,	/* in nodes/supportnodes.h */
+	T_SupportRequestGroupBy,	/* in nodes/supportnodes.h */
 } NodeTag;
 
 /*
diff --git a/src/include/nodes/supportnodes.h b/src/include/nodes/supportnodes.h
index 460d75b..83f14a3 100644
--- a/src/include/nodes/supportnodes.h
+++ b/src/include/nodes/supportnodes.h
@@ -168,6 +168,21 @@ typedef struct SupportRequestRows
 	double		rows;			/* number of rows expected to be returned */
 } SupportRequestRows;
 
+/* How many fewer rows are output after GROUPing BY a function */
+typedef struct SupportRequestGroupBy
+{
+	NodeTag		type;
+
+	/* Input fields: */
+	struct PlannerInfo *root;	/* Planner's infrastructure (could be NULL) */
+	Oid			funcid;			/* function we are inquiring about */
+	Node			*var;			/* original (2nd) argument */
+	Node	   *node;			/* parse node invoking function */
+
+	/* Output fields: */
+	double		factor;			/* fraction of rows: 0..1 */
+} SupportRequestGroupBy;
+
 /*
  * The IndexCondition request allows the support function to generate
  * a directly-indexable condition based on a target function call that is
diff --git a/src/include/optimizer/plancat.h b/src/include/optimizer/plancat.h
index bbb27f8..cbe5386 100644
--- a/src/include/optimizer/plancat.h
+++ b/src/include/optimizer/plancat.h
@@ -70,6 +70,8 @@ extern void add_function_cost(PlannerInfo *root, Oid funcid, Node *node,
 
 extern double get_function_rows(PlannerInfo *root, Oid funcid, Node *node);
 
+extern double get_function_groupby(PlannerInfo *root, Oid funcid, Node *node, Node *var);
+
 extern bool has_row_triggers(PlannerInfo *root, Index rti, CmdType event);
 
 extern bool has_stored_generated_columns(PlannerInfo *root, Index rti);
-- 
2.7.4

#3Justin Pryzby
pryzby@telsasoft.com
In reply to: Justin Pryzby (#2)
2 attachment(s)
Re: planner support functions: handle GROUP BY estimates ?

On Sun, Dec 22, 2019 at 06:16:48PM -0600, Justin Pryzby wrote:

On Tue, Nov 19, 2019 at 01:34:21PM -0600, Justin Pryzby wrote:

Tom implemented "Planner support functions":
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=a391ff3c3d418e404a2c6e4ff0865a107752827b
https://www.postgresql.org/docs/12/xfunc-optimization.html

I wondered whether there was any consideration to extend that to allow
providing improved estimates of "group by". That currently requires manually
by creating an expression index, if the function is IMMUTABLE (which is not
true for eg. date_trunc of timestamptz).

I didn't hear back so tried implementing this for date_trunc(). Currently, the

I currently assume that the input data has 1 second granularity:

...

If the input timestamps have (say) hourly granularity, rowcount will be
*underestimated* by 3600x, which is worse than the behavior in master of
overestimating by (for "day") 24x.

I'm trying to think of ways to address that:

In the attached, I handled that by using histogram and variable's initial
ndistinct estimate, giving good estimates even for intermediate granularities
of input timestamps.

|postgres=# DROP TABLE IF EXISTS t; CREATE TABLE t(i) AS SELECT a FROM generate_series(now(), now()+'11 day'::interval, '15 minutes')a,generate_series(1,9)b; ANALYZE t;
|
|postgres=# explain analyze SELECT date_trunc('hour',i) i FROM t GROUP BY 1;
| HashAggregate (cost=185.69..188.99 rows=264 width=8) (actual time=42.110..42.317 rows=265 loops=1)
|
|postgres=# explain analyze SELECT date_trunc('minute',i) i FROM t GROUP BY 1;
| HashAggregate (cost=185.69..198.91 rows=1057 width=8) (actual time=41.685..42.264 rows=1057 loops=1)
|
|postgres=# explain analyze SELECT date_trunc('day',i) i FROM t GROUP BY 1;
| HashAggregate (cost=185.69..185.83 rows=11 width=8) (actual time=46.672..46.681 rows=12 loops=1)
|
|postgres=# explain analyze SELECT date_trunc('second',i) i FROM t GROUP BY 1;
| HashAggregate (cost=185.69..198.91 rows=1057 width=8) (actual time=41.816..42.435 rows=1057 loops=1)

Attachments:

v2-0001-Planner-support-functions-for-GROUP-BY-f.patchtext/x-diff; charset=us-asciiDownload
From 772876dbd64ea0b1d2bb28f9ab67f577c4050468 Mon Sep 17 00:00:00 2001
From: Justin Pryzby <pryzbyj@telsasoft.com>
Date: Sun, 15 Dec 2019 20:27:24 -0600
Subject: [PATCH v2 1/2] Planner support functions for GROUP BY f()..

..implemented for date_trunc()

See also a391ff3c3d418e404a2c6e4ff0865a107752827b
---
 src/backend/optimizer/util/plancat.c | 47 +++++++++++++++++
 src/backend/utils/adt/selfuncs.c     | 28 +++++++++++
 src/backend/utils/adt/timestamp.c    | 97 ++++++++++++++++++++++++++++++++++++
 src/include/catalog/catversion.h     |  2 +-
 src/include/catalog/pg_proc.dat      | 15 ++++--
 src/include/nodes/nodes.h            |  3 +-
 src/include/nodes/supportnodes.h     | 17 +++++++
 src/include/optimizer/plancat.h      |  2 +
 8 files changed, 206 insertions(+), 5 deletions(-)

diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index c15654e..2469ca6 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -2009,6 +2009,53 @@ get_function_rows(PlannerInfo *root, Oid funcid, Node *node)
 }
 
 /*
+ * Return a multiplier [0..1] to help estimate effect on rowcount of GROUP BY
+ * f(x), relative to input x.
+ */
+double
+get_function_groupby(PlannerInfo *root, Oid funcid, Node *node, Node *var)
+{
+	HeapTuple	proctup;
+	Form_pg_proc procform;
+
+	proctup = SearchSysCache1(PROCOID, ObjectIdGetDatum(funcid));
+	if (!HeapTupleIsValid(proctup))
+		elog(ERROR, "cache lookup failed for function %u", funcid);
+	procform = (Form_pg_proc) GETSTRUCT(proctup);
+
+	if (OidIsValid(procform->prosupport))
+	{
+		SupportRequestGroupBy *sresult;
+		SupportRequestGroupBy req;
+
+		req.type = T_SupportRequestGroupBy;
+		req.root = root;
+		req.funcid = funcid;
+		req.node = node;
+		req.var = var;
+		req.factor = 1;			/* just for sanity */
+
+		sresult = (SupportRequestGroupBy *)
+			DatumGetPointer(OidFunctionCall1(procform->prosupport,
+											 PointerGetDatum(&req)));
+
+		if (sresult == &req)
+		{
+			/* Success */
+			ReleaseSysCache(proctup);
+			return req.factor;
+		}
+	}
+
+	/* XXX No support function, or it failed */
+
+	ReleaseSysCache(proctup);
+
+	return 1;
+}
+
+
+/*
  * has_unique_index
  *
  * Detect whether there is a unique index on the specified attribute
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index ff02b5a..eb0b86f 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3154,10 +3154,38 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
 		 */
 		foreach(l2, varshere)
 		{
+			double ret;
 			Node	   *var = (Node *) lfirst(l2);
 
 			examine_variable(root, var, 0, &vardata);
 			varinfos = add_unique_group_var(root, varinfos, var, &vardata);
+
+			/* If we group by a function of a simple var, try to call its support function to help estimate GROUP BY */
+			if (HeapTupleIsValid(vardata.statsTuple) &&
+					IsA(groupexpr, FuncExpr) && IsA(var, Var))
+					// && (varRelid == 0 || varRelid == ((Var *) basenode)->varno))
+			{
+				Form_pg_statistic stats = (Form_pg_statistic)GETSTRUCT(vardata.statsTuple);
+				FuncExpr   *expr = (FuncExpr *) groupexpr;
+
+				Var *v = (Var*) var;
+				RangeTblEntry *rte = root->simple_rte_array[v->varno];
+				char *reln = get_rel_name(rte->relid);
+				char *coln = get_attname(rte->relid, v->varattno, true);
+				ret = get_function_groupby(root, expr->funcid, groupexpr, var);
+
+				fprintf(stderr, "HERE %s %d %s.%s ndistinct=%f ret=%f\n", __FUNCTION__, __LINE__,
+						reln?reln:"null",
+						coln?coln:"null",
+						stats->stadistinct, ret);
+
+				Assert(ret>=0);
+				Assert(ret<=1);
+				numdistinct *= ret;
+
+				/* Can we do anything special with stats->stadistinct that's not already done in general? */
+			}
+
 			ReleaseVariableStats(vardata);
 		}
 	}
diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c
index 945b8f8..359ad32 100644
--- a/src/backend/utils/adt/timestamp.c
+++ b/src/backend/utils/adt/timestamp.c
@@ -5554,3 +5554,100 @@ generate_series_timestamptz(PG_FUNCTION_ARGS)
 		SRF_RETURN_DONE(funcctx);
 	}
 }
+
+
+/*
+ * Planner support function for date_trunc
+ * Try to estimate the factor by which to correct the estimate of ngroups for GROUP BY.
+ */
+Datum
+date_trunc_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq;
+	SupportRequestGroupBy *req;
+	List	   *args;
+	Node	   *arg1, *arg2;
+	Node		 	*var;
+	char		*start;
+	int			typmod;
+
+	rawreq = (Node *) PG_GETARG_POINTER(0);
+	if (!IsA(rawreq, SupportRequestGroupBy))
+		PG_RETURN_POINTER(NULL);
+
+	req = (SupportRequestGroupBy *) rawreq;
+	if (!is_funcclause(req->node))	/* be paranoid */
+		PG_RETURN_POINTER(NULL);
+
+	args = ((FuncExpr *) req->node)->args;
+	arg1 = linitial(args);
+	arg2 = lsecond(args);
+	/* arg3 may be the timezone */
+
+	var = req->var;
+
+	/* XXX Assumes the input has 1-second granularity */
+
+	// XXX: only work on const?
+	start = TextDatumGetCString(((Const*)arg1)->constvalue);
+
+	// XXX: not working due to promotion ?
+	// exprType(var) is still not right, since date_trunc(a, b::date) uses b's type and not date..
+	// ...but date_trunc('', x::date) is weird
+	// exprType(arg2)==TIMESTAMPOID || exprType(arg2)==TIMESTAMPTZOID)
+	// if (req->funcid==1217 || req->funcid==1284 || req->funcid==2020)
+	/* Reset if it's not a timestamp */
+	if (exprType(var)==DATEOID) {
+		req->factor = 60*60*24;
+	} else if (exprType(var)==TIMESTAMPOID || exprType(var)==TIMESTAMPTZOID) {
+		int typmod = exprTypmod(arg2); // XXX: vartypmod ?
+		/* If the input has decimal digits, the grouping effect is stronger */
+		if (typmod != -1) {
+			req->factor /= 2<<typmod;
+			if (strcmp(start, "microseconds")==0) {
+				/* do nothing? */
+			} else if (strcmp(start, "milliseconds")==0) {
+				/* do nothing? */
+			}
+		}
+
+		if (strcmp(start, "second")==0) {
+			/* do nothing */
+		} else if (strcmp(start, "minute")==0) {
+			req->factor /= 60;
+		} else if (strcmp(start, "hour")==0) {
+			req->factor /= 60*60;
+		}
+	}
+
+	// else { elog(ERROR, "unknown type %u", exprType(var)); }
+
+	if (strcmp(start, "day")==0) {
+		req->factor /= 60*60*24;
+	} else if (strcmp(start, "week")==0) {
+		req->factor /= 60*60*24*7;
+	} else if (strcmp(start, "month")==0) {
+		/* 30 days */
+		req->factor /= 60*60*24*30;
+	} else if (strcmp(start, "quarter")==0) {
+		req->factor /= 60*60*24*30*3;
+	} else if (strcmp(start, "year")==0) {
+		req->factor /= 60*60*24*365.24;
+	} else if (strcmp(start, "decade")==0) {
+		req->factor /= 60*60*24*365.25*10;
+	} else if (strcmp(start, "century")==0) {
+		req->factor /= 60*60*24*365.25*100;
+	} else if (strcmp(start, "millennium")==0) {
+		req->factor /= 60*60*24*365.25*1000;
+	} else if (req->factor > 1) {
+		/* Maybe a DATE with finer graularity trunc */
+		req->factor = 1;
+	}
+	// else { elog(ERROR, "", ); }
+
+	/* Fudge Factor, since the input may be already "grouped", say at multiples of 15min, */
+	/* or otherwise have course granularity to begin with */
+	// factor/=4;
+
+	PG_RETURN_POINTER(req);
+}
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index eca67a1..e2c05be 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	201911242
+#define CATALOG_VERSION_NO	201911243
 
 #endif
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index aae50d6..7a8c3d1 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -2350,11 +2350,14 @@
 { oid => '1217',
   descr => 'truncate timestamp with time zone to specified units',
   proname => 'date_trunc', provolatile => 's', prorettype => 'timestamptz',
-  proargtypes => 'text timestamptz', prosrc => 'timestamptz_trunc' },
+  proargtypes => 'text timestamptz', prosrc => 'timestamptz_trunc',
+  prosupport => 'date_trunc_support', },
 { oid => '1284',
   descr => 'truncate timestamp with time zone to specified units in specified time zone',
   proname => 'date_trunc', provolatile => 's', prorettype => 'timestamptz',
-  proargtypes => 'text timestamptz text', prosrc => 'timestamptz_trunc_zone' },
+  proargtypes => 'text timestamptz text', prosrc => 'timestamptz_trunc_zone',
+  prosupport => 'date_trunc_support', },
+# XXX:
 { oid => '1218', descr => 'truncate interval to specified units',
   proname => 'date_trunc', prorettype => 'interval',
   proargtypes => 'text interval', prosrc => 'interval_trunc' },
@@ -5632,7 +5635,13 @@
   proargtypes => 'timestamptz', prosrc => 'timestamptz_time' },
 { oid => '2020', descr => 'truncate timestamp to specified units',
   proname => 'date_trunc', prorettype => 'timestamp',
-  proargtypes => 'text timestamp', prosrc => 'timestamp_trunc' },
+  proargtypes => 'text timestamp', prosrc => 'timestamp_trunc',
+  prosupport => 'date_trunc_support', },
+
+{ oid => '5449', descr => 'planner support for date_trunc',
+  proname => 'date_trunc_support', prorettype => 'internal',
+  proargtypes => 'internal', prosrc => 'date_trunc_support', },
+
 { oid => '2021', descr => 'extract field from timestamp',
   proname => 'date_part', prorettype => 'float8',
   proargtypes => 'text timestamp', prosrc => 'timestamp_part' },
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 8692a32..97e7796 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -513,7 +513,8 @@ typedef enum NodeTag
 	T_SupportRequestSelectivity,	/* in nodes/supportnodes.h */
 	T_SupportRequestCost,		/* in nodes/supportnodes.h */
 	T_SupportRequestRows,		/* in nodes/supportnodes.h */
-	T_SupportRequestIndexCondition	/* in nodes/supportnodes.h */
+	T_SupportRequestIndexCondition,	/* in nodes/supportnodes.h */
+	T_SupportRequestGroupBy,	/* in nodes/supportnodes.h */
 } NodeTag;
 
 /*
diff --git a/src/include/nodes/supportnodes.h b/src/include/nodes/supportnodes.h
index 460d75b..cb4ea44 100644
--- a/src/include/nodes/supportnodes.h
+++ b/src/include/nodes/supportnodes.h
@@ -168,6 +168,23 @@ typedef struct SupportRequestRows
 	double		rows;			/* number of rows expected to be returned */
 } SupportRequestRows;
 
+/* How many fewer rows are output after GROUPing BY a function */
+typedef struct SupportRequestGroupBy
+{
+	NodeTag		type;
+
+	/* Input fields: */
+	struct PlannerInfo *root;	/* Planner's infrastructure (could be NULL) */
+	Oid			funcid;			/* function we are inquiring about */
+	Node		*var;			/* original (2nd) argument */
+	Node		*node;			/* parse node invoking function */
+
+	/* Output fields: */
+	double		factor;			/* [0..1] fraction of rows in GROUP BY f(x)
+								   relative to GROUP BY x */
+
+} SupportRequestGroupBy;
+
 /*
  * The IndexCondition request allows the support function to generate
  * a directly-indexable condition based on a target function call that is
diff --git a/src/include/optimizer/plancat.h b/src/include/optimizer/plancat.h
index bbb27f8..cbe5386 100644
--- a/src/include/optimizer/plancat.h
+++ b/src/include/optimizer/plancat.h
@@ -70,6 +70,8 @@ extern void add_function_cost(PlannerInfo *root, Oid funcid, Node *node,
 
 extern double get_function_rows(PlannerInfo *root, Oid funcid, Node *node);
 
+extern double get_function_groupby(PlannerInfo *root, Oid funcid, Node *node, Node *var);
+
 extern bool has_row_triggers(PlannerInfo *root, Index rti, CmdType event);
 
 extern bool has_stored_generated_columns(PlannerInfo *root, Index rti);
-- 
2.7.4

v2-0002-Pass-ndistinct-and-minmax-to-allow-good-estimates.patchtext/x-diff; charset=us-asciiDownload
From e287fa474fea487ace0ee7d476f84b6f787cc2a7 Mon Sep 17 00:00:00 2001
From: Justin Pryzby <pryzbyj@telsasoft.com>
Date: Tue, 24 Dec 2019 22:02:22 -0600
Subject: [PATCH v2 2/2] Pass ndistinct and minmax to allow good estimates even
 with timestamps of granularity other than 1sec

---
 src/backend/optimizer/util/plancat.c |   8 ++-
 src/backend/utils/adt/selfuncs.c     |  36 ++++++++---
 src/backend/utils/adt/timestamp.c    | 113 ++++++++++++++++++++++-------------
 src/include/nodes/supportnodes.h     |   2 +
 src/include/optimizer/plancat.h      |   2 +-
 5 files changed, 110 insertions(+), 51 deletions(-)

diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 2469ca6..aab794c 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -2011,9 +2011,13 @@ get_function_rows(PlannerInfo *root, Oid funcid, Node *node)
 /*
  * Return a multiplier [0..1] to help estimate effect on rowcount of GROUP BY
  * f(x), relative to input x.
+ *
+ * minmax is an array of (min,max) values for the variable, which might be
+ * useful to determine its granularity (like timestamps per second, minute or
+ * hour).
  */
 double
-get_function_groupby(PlannerInfo *root, Oid funcid, Node *node, Node *var)
+get_function_groupby(PlannerInfo *root, Oid funcid, Node *node, Node *var, double ndistinct, Datum *minmax)
 {
 	HeapTuple	proctup;
 	Form_pg_proc procform;
@@ -2033,6 +2037,8 @@ get_function_groupby(PlannerInfo *root, Oid funcid, Node *node, Node *var)
 		req.funcid = funcid;
 		req.node = node;
 		req.var = var;
+		req.ndistinct = ndistinct;
+		req.minmax = minmax;
 		req.factor = 1;			/* just for sanity */
 
 		sresult = (SupportRequestGroupBy *)
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index eb0b86f..a7f396d 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -116,6 +116,7 @@
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
+#include "nodes/supportnodes.h"
 #include "optimizer/clauses.h"
 #include "optimizer/cost.h"
 #include "optimizer/optimizer.h"
@@ -3168,16 +3169,37 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
 				Form_pg_statistic stats = (Form_pg_statistic)GETSTRUCT(vardata.statsTuple);
 				FuncExpr   *expr = (FuncExpr *) groupexpr;
 
-				Var *v = (Var*) var;
+				Datum	minmax[2];
+				TypeCacheEntry *tce;
+				Var 	*v = (Var*) var;
 				RangeTblEntry *rte = root->simple_rte_array[v->varno];
-				char *reln = get_rel_name(rte->relid);
-				char *coln = get_attname(rte->relid, v->varattno, true);
-				ret = get_function_groupby(root, expr->funcid, groupexpr, var);
-
-				fprintf(stderr, "HERE %s %d %s.%s ndistinct=%f ret=%f\n", __FUNCTION__, __LINE__,
+				char 	*reln = get_rel_name(rte->relid);
+				char 	*coln = get_attname(rte->relid, v->varattno, true);
+
+				/* Seems like maybe this should be defined here and pass a single argument to groupby helper? */
+				SupportRequestGroupBy req = {
+					.type = T_SupportRequestGroupBy,
+					.root = root,
+					.funcid = expr->funcid,
+					.node = groupexpr,
+					.var = var,
+					.ndistinct = stats->stadistinct >= 0 ? stats->stadistinct :
+						-stats->stadistinct * vardata.rel->tuples,
+					.minmax = minmax,
+					.factor = 1,	/* just for sanity */
+				};
+
+				fprintf(stderr, "HERE %s %d %s.%s tuples=%f, stadistinct=%f ndistinct=%f ret=%f\n", __FUNCTION__, __LINE__,
 						reln?reln:"null",
 						coln?coln:"null",
-						stats->stadistinct, ret);
+						vardata.rel->tuples,
+						stats->stadistinct, req.ndistinct, ret);
+
+				tce = lookup_type_cache(v->vartype, TYPECACHE_LT_OPR);
+				if (get_variable_range(root, &vardata, tce->lt_opr, minmax, minmax+1))
+					ret = get_function_groupby(root, expr->funcid, groupexpr, var, req.ndistinct, minmax);
+				else
+					ret = get_function_groupby(root, expr->funcid, groupexpr, var, req.ndistinct, NULL);
 
 				Assert(ret>=0);
 				Assert(ret<=1);
diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c
index 359ad32..0314ef9 100644
--- a/src/backend/utils/adt/timestamp.c
+++ b/src/backend/utils/adt/timestamp.c
@@ -5563,13 +5563,44 @@ generate_series_timestamptz(PG_FUNCTION_ARGS)
 Datum
 date_trunc_support(PG_FUNCTION_ARGS)
 {
-	Node	   *rawreq;
+	Node		*rawreq;
 	SupportRequestGroupBy *req;
-	List	   *args;
-	Node	   *arg1, *arg2;
-	Node		 	*var;
+	List		*args;
+	Node		*arg1, *arg2;
+	Node	 	*var;
 	char		*start;
-	int			typmod;
+	int			typmod, i;
+	long unsigned int	diff;
+
+	const struct {
+		const char		*name;	/* Granularity name */
+		const int		factor;	/* Multiplier */
+		const double	minsecsec; /* Minimum number of distinct second values per second in range of the histogram */
+	} grans[] = {
+		/* XXX: these factors not applied unless...typmod>1 ? */
+		{ "microseconds", 1 },
+		{ "milliseconds", 1 },
+
+		/*
+		 * These factors not applied unless minmax indicates the incoming
+		 * timestamp is at fine enough granularity that truncating to courser
+		 * granularity would affect result.
+		 */
+		{ "second", 1, 1.0 },
+		{ "minute", 60, 1.0/60 },
+		{ "hour", 60, 1.0/60/60 },
+		{ "day", 24, 1.0/60/60/24 }, // XXX: should not handle for DATEOID
+
+		/* These factors applied up to the specified granularity */
+		{ "week", 7 },
+		{ "month", 30 },
+		{ "quarter", 3 },
+		{ "year", 4 },
+		{ "decade", 10 },
+		{ "century", 10 },
+		{ "millennium", 10 },
+
+	};
 
 	rawreq = (Node *) PG_GETARG_POINTER(0);
 	if (!IsA(rawreq, SupportRequestGroupBy))
@@ -5584,13 +5615,45 @@ date_trunc_support(PG_FUNCTION_ARGS)
 	arg2 = lsecond(args);
 	/* arg3 may be the timezone */
 
-	var = req->var;
+	// XXX: handle if these are null ?
+	diff = req->minmax ? timestamptz_to_time_t(DatumGetTimestamp(req->minmax[1])) -
+			timestamptz_to_time_t(DatumGetTimestamp(req->minmax[0]))
+			: 0;
 
-	/* XXX Assumes the input has 1-second granularity */
+	fprintf(stderr, "got distinct %f diff=%ld\n", req->ndistinct, diff);
 
 	// XXX: only work on const?
 	start = TextDatumGetCString(((Const*)arg1)->constvalue);
 
+	for (i=0; ; ++i) {
+		if (i >= sizeof(grans)/sizeof(*grans))
+			/* Unhandled truncation granularity */
+			PG_RETURN_POINTER(NULL);
+
+		fprintf(stderr, "applying factor %s %d: cur=%f %f -gt %f\n",
+				grans[i].name, grans[i].factor, req->factor,
+				req->ndistinct/diff, grans[i].minsecsec );
+
+		if (req->ndistinct / diff >= grans[i].minsecsec) {
+			if (req->ndistinct / diff > grans[i-1].minsecsec)
+				/* Apply the factor in full strength */
+				req->factor /= grans[i].factor;
+			else {
+				/* interpolate: if at (say) 15 minute granularity, then apply a 4x hourly correction, not 60x */
+				// req->factor /= grans[i].factor / (req->ndistinct / diff / grans[i].minsecsec);
+				req->factor /= req->ndistinct / diff / grans[i].minsecsec; // XXX: is this right
+				fprintf(stderr, "applying partial factor %f\n", req->ndistinct / diff / grans[i].minsecsec);
+			}
+		}
+
+		if (strcmp(start, grans[i].name) == 0)
+			break;
+	}
+
+	PG_RETURN_POINTER(req);
+
+#if 0
+	// var = req->var;
 	// XXX: not working due to promotion ?
 	// exprType(var) is still not right, since date_trunc(a, b::date) uses b's type and not date..
 	// ...but date_trunc('', x::date) is weird
@@ -5604,50 +5667,16 @@ date_trunc_support(PG_FUNCTION_ARGS)
 		/* If the input has decimal digits, the grouping effect is stronger */
 		if (typmod != -1) {
 			req->factor /= 2<<typmod;
-			if (strcmp(start, "microseconds")==0) {
-				/* do nothing? */
-			} else if (strcmp(start, "milliseconds")==0) {
-				/* do nothing? */
-			}
 		}
 
-		if (strcmp(start, "second")==0) {
-			/* do nothing */
-		} else if (strcmp(start, "minute")==0) {
-			req->factor /= 60;
-		} else if (strcmp(start, "hour")==0) {
-			req->factor /= 60*60;
-		}
 	}
 
 	// else { elog(ERROR, "unknown type %u", exprType(var)); }
 
-	if (strcmp(start, "day")==0) {
-		req->factor /= 60*60*24;
-	} else if (strcmp(start, "week")==0) {
-		req->factor /= 60*60*24*7;
-	} else if (strcmp(start, "month")==0) {
-		/* 30 days */
-		req->factor /= 60*60*24*30;
-	} else if (strcmp(start, "quarter")==0) {
-		req->factor /= 60*60*24*30*3;
-	} else if (strcmp(start, "year")==0) {
-		req->factor /= 60*60*24*365.24;
-	} else if (strcmp(start, "decade")==0) {
-		req->factor /= 60*60*24*365.25*10;
-	} else if (strcmp(start, "century")==0) {
-		req->factor /= 60*60*24*365.25*100;
-	} else if (strcmp(start, "millennium")==0) {
-		req->factor /= 60*60*24*365.25*1000;
-	} else if (req->factor > 1) {
 		/* Maybe a DATE with finer graularity trunc */
 		req->factor = 1;
 	}
-	// else { elog(ERROR, "", ); }
 
-	/* Fudge Factor, since the input may be already "grouped", say at multiples of 15min, */
-	/* or otherwise have course granularity to begin with */
-	// factor/=4;
+#endif
 
-	PG_RETURN_POINTER(req);
 }
diff --git a/src/include/nodes/supportnodes.h b/src/include/nodes/supportnodes.h
index cb4ea44..f5f33bf 100644
--- a/src/include/nodes/supportnodes.h
+++ b/src/include/nodes/supportnodes.h
@@ -178,6 +178,8 @@ typedef struct SupportRequestGroupBy
 	Oid			funcid;			/* function we are inquiring about */
 	Node		*var;			/* original (2nd) argument */
 	Node		*node;			/* parse node invoking function */
+	double		ndistinct;		/* Initial estimate of ndistinct of the variable */
+	Datum		*minmax;		/* Array of (min,max) values for variable */
 
 	/* Output fields: */
 	double		factor;			/* [0..1] fraction of rows in GROUP BY f(x)
diff --git a/src/include/optimizer/plancat.h b/src/include/optimizer/plancat.h
index cbe5386..c13d40d 100644
--- a/src/include/optimizer/plancat.h
+++ b/src/include/optimizer/plancat.h
@@ -70,7 +70,7 @@ extern void add_function_cost(PlannerInfo *root, Oid funcid, Node *node,
 
 extern double get_function_rows(PlannerInfo *root, Oid funcid, Node *node);
 
-extern double get_function_groupby(PlannerInfo *root, Oid funcid, Node *node, Node *var);
+extern double get_function_groupby(PlannerInfo *root, Oid funcid, Node *node, Node *var, double ndistinct, Datum *minmax);
 
 extern bool has_row_triggers(PlannerInfo *root, Index rti, CmdType event);
 
-- 
2.7.4

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Justin Pryzby (#3)
Re: planner support functions: handle GROUP BY estimates ?

Justin Pryzby <pryzby@telsasoft.com> writes:

On Sun, Dec 22, 2019 at 06:16:48PM -0600, Justin Pryzby wrote:

On Tue, Nov 19, 2019 at 01:34:21PM -0600, Justin Pryzby wrote:

Tom implemented "Planner support functions":
https://www.postgresql.org/docs/12/xfunc-optimization.html
I wondered whether there was any consideration to extend that to allow
providing improved estimates of "group by". That currently requires manually
by creating an expression index, if the function is IMMUTABLE (which is not
true for eg. date_trunc of timestamptz).

I didn't hear back so tried implementing this for date_trunc(). Currently, the
...
If the input timestamps have (say) hourly granularity, rowcount will be
*underestimated* by 3600x, which is worse than the behavior in master of
overestimating by (for "day") 24x.

While I don't have any objection in principle to extending the set of
things planner support functions can do, it doesn't seem like the idea is
giving you all that much traction for this problem. There isn't that much
knowledge that's specific to date_trunc in this, and instead you've got a
bunch of generic problems (that would have to be solved again in every
other function's planner support).

Another issue is that it seems like this doesn't compose nicely ---
if the GROUP BY expression is "f(g(x))", how do f's support function
and g's support function interact?

The direction that I've been wanting to go in for this kind of problem
is to allow CREATE STATISTICS on an expression, ie if you were concerned
about the estimation accuracy for GROUP BY or anything else, you could do
something like

CREATE STATISTICS foo ON date_trunc('day', mod_time) FROM my_table;

This would have the effect of cueing ANALYZE to gather stats on the
value of that expression, which the planner could then use, very much
as if you'd created an index on the expression. The advantages of
doing this rather than making an index are

(1) you don't have to pay the maintenance costs for an index,

(2) we don't have to restrict it to immutable expressions. (Volatile
expressions would have to be disallowed, if only because of fear of
side-effects; but I think we could allow stable expressions just fine.
Worst case problem is that the stats are stale, but so what?)

With a solution like this, we don't have to solve any of the difficult
problems of how the pieces of the expression interact with each other
or with the statistics of the underlying column(s). We just use the
stats if available, and the estimate will be as good as it'd be for
a plain column reference.

I'm not sure how much new infrastructure would have to be built
for this. We designed the CREATE STATISTICS syntax to support
this (partly at my insistence IIRC) but I do not think any of the
existing plumbing is ready for it. I don't think it'd be very
hard to plug this into ANALYZE or the planner, but there might be
quite some work to be done on the catalog infrastructure, pg_dump,
etc.

cc'ing Tomas in case he has any thoughts about it.

regards, tom lane

#5Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#4)
Re: planner support functions: handle GROUP BY estimates ?

On Tue, Jan 14, 2020 at 03:12:21PM -0500, Tom Lane wrote:

Justin Pryzby <pryzby@telsasoft.com> writes:

On Sun, Dec 22, 2019 at 06:16:48PM -0600, Justin Pryzby wrote:

On Tue, Nov 19, 2019 at 01:34:21PM -0600, Justin Pryzby wrote:

Tom implemented "Planner support functions":
https://www.postgresql.org/docs/12/xfunc-optimization.html
I wondered whether there was any consideration to extend that to allow
providing improved estimates of "group by". That currently requires manually
by creating an expression index, if the function is IMMUTABLE (which is not
true for eg. date_trunc of timestamptz).

I didn't hear back so tried implementing this for date_trunc(). Currently, the
...
If the input timestamps have (say) hourly granularity, rowcount will be
*underestimated* by 3600x, which is worse than the behavior in master of
overestimating by (for "day") 24x.

While I don't have any objection in principle to extending the set of
things planner support functions can do, it doesn't seem like the idea is
giving you all that much traction for this problem. There isn't that much
knowledge that's specific to date_trunc in this, and instead you've got a
bunch of generic problems (that would have to be solved again in every
other function's planner support).

Another issue is that it seems like this doesn't compose nicely ---
if the GROUP BY expression is "f(g(x))", how do f's support function
and g's support function interact?

The direction that I've been wanting to go in for this kind of problem
is to allow CREATE STATISTICS on an expression, ie if you were concerned
about the estimation accuracy for GROUP BY or anything else, you could do
something like

CREATE STATISTICS foo ON date_trunc('day', mod_time) FROM my_table;

This would have the effect of cueing ANALYZE to gather stats on the
value of that expression, which the planner could then use, very much
as if you'd created an index on the expression. The advantages of
doing this rather than making an index are

(1) you don't have to pay the maintenance costs for an index,

(2) we don't have to restrict it to immutable expressions. (Volatile
expressions would have to be disallowed, if only because of fear of
side-effects; but I think we could allow stable expressions just fine.
Worst case problem is that the stats are stale, but so what?)

With a solution like this, we don't have to solve any of the difficult
problems of how the pieces of the expression interact with each other
or with the statistics of the underlying column(s). We just use the
stats if available, and the estimate will be as good as it'd be for
a plain column reference.

I'm not sure how much new infrastructure would have to be built
for this. We designed the CREATE STATISTICS syntax to support
this (partly at my insistence IIRC) but I do not think any of the
existing plumbing is ready for it. I don't think it'd be very
hard to plug this into ANALYZE or the planner, but there might be
quite some work to be done on the catalog infrastructure, pg_dump,
etc.

cc'ing Tomas in case he has any thoughts about it.

Well, I certainly do thoughts about this - it's pretty much exactly what
I proposed yesterday in this thread:

/messages/by-id/20200113230008.g67iyk4cs3xbnjju@development

The third part of that patch series is exactly about supporting extended
statistics on expressions, about the way you described here. The current
status of the WIP patch is that grammar + ANALYZE mostly works, but
there is no support in the planner. It's obviously still very hackish.

The main thing I'm not sure about is how to represent this in catalogs,
whether to have two fields (like for indexes) or maybe a single list of
expressions.

I'm also wondering if we could/should 100% rely on extended statistics,
because those are really meant to track correlations between columns,
which means we currently require at least two attributes in CREATE
STATISTICS and so on. So maybe what we want is collecting "regular"
per-column stats just like we do for indexes, but without the index
maintenance overhead?

The advantage would be we'd get exactly the same stats as for indexes,
and we could use them in the same places out of the box. While with
extended stats we'll have to tweak those places.

Now, the trouble is we can't store stuff in pg_statistic without having
a relation (i.e. table / index / ...) but maybe we could invent a new
relation type for this purpose. Of course, it'd require some catalog
work to represent this ...

Ultimately I think we'd want both things, it's not one or the other.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#5)
Re: planner support functions: handle GROUP BY estimates ?

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

On Tue, Jan 14, 2020 at 03:12:21PM -0500, Tom Lane wrote:

cc'ing Tomas in case he has any thoughts about it.

Well, I certainly do thoughts about this - it's pretty much exactly what
I proposed yesterday in this thread:
/messages/by-id/20200113230008.g67iyk4cs3xbnjju@development
The third part of that patch series is exactly about supporting extended
statistics on expressions, about the way you described here. The current
status of the WIP patch is that grammar + ANALYZE mostly works, but
there is no support in the planner. It's obviously still very hackish.

Cool. We should probably take the discussion to that thread, then.

I'm also wondering if we could/should 100% rely on extended statistics,
because those are really meant to track correlations between columns,

Yeah, it seems likely to me that the infrastructure for this would be
somewhat different --- the user-facing syntax could be basically the
same, but ultimately we want to generate entries in pg_statistic not
pg_statistic_ext_data. Or at least entries that look the same as what
you could find in pg_statistic.

regards, tom lane

#7Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#6)
Re: planner support functions: handle GROUP BY estimates ?

On Tue, Jan 14, 2020 at 04:21:57PM -0500, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

On Tue, Jan 14, 2020 at 03:12:21PM -0500, Tom Lane wrote:

cc'ing Tomas in case he has any thoughts about it.

Well, I certainly do thoughts about this - it's pretty much exactly what
I proposed yesterday in this thread:
/messages/by-id/20200113230008.g67iyk4cs3xbnjju@development
The third part of that patch series is exactly about supporting extended
statistics on expressions, about the way you described here. The current
status of the WIP patch is that grammar + ANALYZE mostly works, but
there is no support in the planner. It's obviously still very hackish.

Cool. We should probably take the discussion to that thread, then.

I'm also wondering if we could/should 100% rely on extended statistics,
because those are really meant to track correlations between columns,

Yeah, it seems likely to me that the infrastructure for this would be
somewhat different --- the user-facing syntax could be basically the
same, but ultimately we want to generate entries in pg_statistic not
pg_statistic_ext_data. Or at least entries that look the same as what
you could find in pg_statistic.

Yeah. I think we could invent a new type of statistics "expressions"
which would simply built this per-column stats. So for example

CREATE STATISTICS s (expressions) ON (a*b), sqrt(c) FROM t;

would build per-column stats stored in pg_statistics, while

CREATE STATISTICS s (mcv) ON (a*b), sqrt(c) FROM t;

would build the multi-column MCV list on expressions.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#7)
Re: planner support functions: handle GROUP BY estimates ?

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

On Tue, Jan 14, 2020 at 04:21:57PM -0500, Tom Lane wrote:

Yeah, it seems likely to me that the infrastructure for this would be
somewhat different --- the user-facing syntax could be basically the
same, but ultimately we want to generate entries in pg_statistic not
pg_statistic_ext_data. Or at least entries that look the same as what
you could find in pg_statistic.

Yeah. I think we could invent a new type of statistics "expressions"
which would simply built this per-column stats. So for example
CREATE STATISTICS s (expressions) ON (a*b), sqrt(c) FROM t;

I was imagining the type keyword as being "standard" or something
like that, since what it's going to build are the "standard" kinds
of stats for the expression's datatype. But yeah, has to be some other
keyword than the existing ones.

The main issue for sticking the results into pg_statistic is that
the primary key there is (starelid, staattnum), and we haven't got
a suitable attnum. I wouldn't much object to putting the data into
pg_statistic_ext_data, but it doesn't really have a suitable
rowtype ...

regards, tom lane

#9Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#8)
Re: planner support functions: handle GROUP BY estimates ?

On Tue, Jan 14, 2020 at 04:52:44PM -0500, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

On Tue, Jan 14, 2020 at 04:21:57PM -0500, Tom Lane wrote:

Yeah, it seems likely to me that the infrastructure for this would be
somewhat different --- the user-facing syntax could be basically the
same, but ultimately we want to generate entries in pg_statistic not
pg_statistic_ext_data. Or at least entries that look the same as what
you could find in pg_statistic.

Yeah. I think we could invent a new type of statistics "expressions"
which would simply built this per-column stats. So for example
CREATE STATISTICS s (expressions) ON (a*b), sqrt(c) FROM t;

I was imagining the type keyword as being "standard" or something
like that, since what it's going to build are the "standard" kinds
of stats for the expression's datatype. But yeah, has to be some other
keyword than the existing ones.

The main issue for sticking the results into pg_statistic is that
the primary key there is (starelid, staattnum), and we haven't got
a suitable attnum. I wouldn't much object to putting the data into
pg_statistic_ext_data, but it doesn't really have a suitable
rowtype ...

Well, that's why I proposed to essentially build a fake "relation" just
for this purpose. So we'd have a pg_class entry with a special relkind,
attnums and all that. And the expressions would be stored either in
pg_statistic_ext or in a new catalog. But maybe that's nonsense.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#9)
Re: planner support functions: handle GROUP BY estimates ?

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

On Tue, Jan 14, 2020 at 04:52:44PM -0500, Tom Lane wrote:

The main issue for sticking the results into pg_statistic is that
the primary key there is (starelid, staattnum), and we haven't got
a suitable attnum. I wouldn't much object to putting the data into
pg_statistic_ext_data, but it doesn't really have a suitable
rowtype ...

Well, that's why I proposed to essentially build a fake "relation" just
for this purpose. So we'd have a pg_class entry with a special relkind,
attnums and all that. And the expressions would be stored either in
pg_statistic_ext or in a new catalog. But maybe that's nonsense.

Seems pretty yucky. I realize we've already got "fake relations" like
foreign tables and composite types, but the number of special cases
those create is very annoying. And you still don't have anyplace to
put the expressions themselves in such a structure --- I hope you
weren't going to propose fake pg_index rows for that.

I wonder just how messy it would be to add a column to pg_statistic_ext
whose type is the composite type "pg_statistic", and drop the required
data into that. We've not yet used any composite types in the system
catalogs, AFAIR, but since pg_statistic_ext isn't a bootstrap catalog
it seems like we might be able to get away with it.

regards, tom lane

#11Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#10)
Re: planner support functions: handle GROUP BY estimates ?

On Tue, Jan 14, 2020 at 05:37:53PM -0500, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

On Tue, Jan 14, 2020 at 04:52:44PM -0500, Tom Lane wrote:

The main issue for sticking the results into pg_statistic is that
the primary key there is (starelid, staattnum), and we haven't got
a suitable attnum. I wouldn't much object to putting the data into
pg_statistic_ext_data, but it doesn't really have a suitable
rowtype ...

Well, that's why I proposed to essentially build a fake "relation" just
for this purpose. So we'd have a pg_class entry with a special relkind,
attnums and all that. And the expressions would be stored either in
pg_statistic_ext or in a new catalog. But maybe that's nonsense.

Seems pretty yucky. I realize we've already got "fake relations" like
foreign tables and composite types, but the number of special cases
those create is very annoying. And you still don't have anyplace to
put the expressions themselves in such a structure --- I hope you
weren't going to propose fake pg_index rows for that.

No, I wasn't going to propose fake pg_index rows, because - I actually
wrote "stored either in pg_statistic_ext or in a new catalog" so I was
thinking about a new catalog (so a dedicated and simplified copy of
pg_index).

I wonder just how messy it would be to add a column to pg_statistic_ext
whose type is the composite type "pg_statistic", and drop the required
data into that. We've not yet used any composite types in the system
catalogs, AFAIR, but since pg_statistic_ext isn't a bootstrap catalog
it seems like we might be able to get away with it.

I don't know, but feels a bit awkward to store this type of stats into
pg_statistic_ext, which was meant for multi-column stats. Maybe it'd
work fine, not sure.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#11)
Re: planner support functions: handle GROUP BY estimates ?

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

On Tue, Jan 14, 2020 at 05:37:53PM -0500, Tom Lane wrote:

I wonder just how messy it would be to add a column to pg_statistic_ext
whose type is the composite type "pg_statistic", and drop the required
data into that. We've not yet used any composite types in the system
catalogs, AFAIR, but since pg_statistic_ext isn't a bootstrap catalog
it seems like we might be able to get away with it.

[ I meant pg_statistic_ext_data, obviously ]

I don't know, but feels a bit awkward to store this type of stats into
pg_statistic_ext, which was meant for multi-column stats. Maybe it'd
work fine, not sure.

If we wanted to allow a single statistics object to contain data for
multiple expressions, we'd actually need that to be array-of-pg_statistic
not just pg_statistic. Seems do-able, but on the other hand we could
just prohibit having more than one output column in the "query" for this
type of extended statistic. Either way, this seems far less invasive
than either a new catalog or a new relation relkind (to say nothing of
needing both, which is where you seemed to be headed).

regards, tom lane

#13Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Tom Lane (#12)
Re: planner support functions: handle GROUP BY estimates ?

On 1/15/20 12:44 AM, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

On Tue, Jan 14, 2020 at 05:37:53PM -0500, Tom Lane wrote:

I wonder just how messy it would be to add a column to pg_statistic_ext
whose type is the composite type "pg_statistic", and drop the required
data into that. We've not yet used any composite types in the system
catalogs, AFAIR, but since pg_statistic_ext isn't a bootstrap catalog
it seems like we might be able to get away with it.

[ I meant pg_statistic_ext_data, obviously ]

I don't know, but feels a bit awkward to store this type of stats into
pg_statistic_ext, which was meant for multi-column stats. Maybe it'd
work fine, not sure.

If we wanted to allow a single statistics object to contain data for
multiple expressions, we'd actually need that to be array-of-pg_statistic
not just pg_statistic. Seems do-able, but on the other hand we could
just prohibit having more than one output column in the "query" for this
type of extended statistic. Either way, this seems far less invasive
than either a new catalog or a new relation relkind (to say nothing of
needing both, which is where you seemed to be headed).

I've started looking at statistics on expressions too, mostly because it
seems the extended stats improvements (as discussed in [1]/messages/by-id/ad7891d2-e90c-b446-9fe2-7419143847d7@enterprisedb.com) need that.

The "stash pg_statistic records into pg_statistics_ext_data" approach
seems simple, but it's not clear to me how to make it work, so I'd
appreciate some guidance.

1) Considering we don't have any composite types in any catalog yet, and
naive attempts to just use something like

pg_statistic stxdexprs[1]/messages/by-id/ad7891d2-e90c-b446-9fe2-7419143847d7@enterprisedb.com;

did not work. So I suppose this will require changes to genbki.pl, but
honestly, my Perl-fu is non-existent :-(

2) Won't it be an issue that pg_statistic contains pseudo-types? That
is, this does not work, for example:

test=# create table t (a pg_statistic[]);
ERROR: column "stavalues1" has pseudo-type anyarray

and it seems unlikely just using this in a catalog would make it work.

regards

[1]: /messages/by-id/ad7891d2-e90c-b446-9fe2-7419143847d7@enterprisedb.com
/messages/by-id/ad7891d2-e90c-b446-9fe2-7419143847d7@enterprisedb.com

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#14Justin Pryzby
pryzby@telsasoft.com
In reply to: Tomas Vondra (#13)
2 attachment(s)
Re: planner support functions: handle GROUP BY estimates ?

On Mon, Nov 16, 2020 at 06:24:41PM +0100, Tomas Vondra wrote:

On 1/15/20 12:44 AM, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

On Tue, Jan 14, 2020 at 05:37:53PM -0500, Tom Lane wrote:

I wonder just how messy it would be to add a column to pg_statistic_ext
whose type is the composite type "pg_statistic", and drop the required
data into that. We've not yet used any composite types in the system
catalogs, AFAIR, but since pg_statistic_ext isn't a bootstrap catalog
it seems like we might be able to get away with it.

[ I meant pg_statistic_ext_data, obviously ]

I don't know, but feels a bit awkward to store this type of stats into
pg_statistic_ext, which was meant for multi-column stats. Maybe it'd
work fine, not sure.

I've started looking at statistics on expressions too, mostly because it
seems the extended stats improvements (as discussed in [1]) need that.

The "stash pg_statistic records into pg_statistics_ext_data" approach
seems simple, but it's not clear to me how to make it work, so I'd
appreciate some guidance.

1) Considering we don't have any composite types in any catalog yet, and
naive attempts to just use something like

pg_statistic stxdexprs[1];

did not work. So I suppose this will require changes to genbki.pl, but
honestly, my Perl-fu is non-existent :-(

In the attached, I didn't need to mess with perl.

2) Won't it be an issue that pg_statistic contains pseudo-types? That
is, this does not work, for example:

test=# create table t (a pg_statistic[]);
ERROR: column "stavalues1" has pseudo-type anyarray

It works during initdb for the reasons that it's allowed for pg_statistic.

--
Justin

Attachments:

0001-Allow-composite-types-in-bootstrap.patchtext/x-diff; charset=us-asciiDownload
From b9dabd3b773b077e55bb5ea23b89eb3d650029ee Mon Sep 17 00:00:00 2001
From: Justin Pryzby <pryzbyj@telsasoft.com>
Date: Tue, 17 Nov 2020 09:28:33 -0600
Subject: [PATCH 1/2] Allow composite types in bootstrap

---
 src/backend/bootstrap/bootstrap.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index a7ed93fdc1..ed6b3906ee 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -937,6 +937,21 @@ gettype(char *type)
 				return (*app)->am_oid;
 			}
 		}
+
+		/* The type wasn't known; check again to handle composite
+		 * types, added since first populating the array. */
+		Typ = NULL;
+		populate_typ_array();
+
+		/* Need to avoid infinite recursion... */
+		for (app = Typ; *app != NULL; app++)
+		{
+			if (strncmp(NameStr((*app)->am_typ.typname), type, NAMEDATALEN) == 0)
+			{
+				Ap = *app;
+				return (*app)->am_oid;
+			}
+		}
 	}
 	else
 	{
-- 
2.17.0

0002-Add-pg_statistic_ext_data.stxdexpr.patchtext/x-diff; charset=us-asciiDownload
From d7246706640d9a7e40805f1861d269cccb3f05ef Mon Sep 17 00:00:00 2001
From: Justin Pryzby <pryzbyj@telsasoft.com>
Date: Tue, 17 Nov 2020 09:47:32 -0600
Subject: [PATCH 2/2] Add pg_statistic_ext_data.stxdexpr

---
 src/backend/catalog/Makefile                | 8 ++++----
 src/backend/commands/statscmds.c            | 2 ++
 src/include/catalog/pg_statistic_ext_data.h | 1 +
 3 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/src/backend/catalog/Makefile b/src/backend/catalog/Makefile
index 2519771210..91db93f64d 100644
--- a/src/backend/catalog/Makefile
+++ b/src/backend/catalog/Makefile
@@ -49,15 +49,15 @@ include $(top_srcdir)/src/backend/common.mk
 
 # Note: the order of this list determines the order in which the catalog
 # header files are assembled into postgres.bki.  BKI_BOOTSTRAP catalogs
-# must appear first, and there are reputedly other, undocumented ordering
-# dependencies.
+# must appear first, and pg_statistic before pg_statistic_ext_data, and there
+# are reputedly other, undocumented ordering dependencies.
 CATALOG_HEADERS := \
 	pg_proc.h pg_type.h pg_attribute.h pg_class.h \
 	pg_attrdef.h pg_constraint.h pg_inherits.h pg_index.h pg_operator.h \
 	pg_opfamily.h pg_opclass.h pg_am.h pg_amop.h pg_amproc.h \
 	pg_language.h pg_largeobject_metadata.h pg_largeobject.h pg_aggregate.h \
-	pg_statistic_ext.h pg_statistic_ext_data.h \
-	pg_statistic.h pg_rewrite.h pg_trigger.h pg_event_trigger.h pg_description.h \
+	pg_statistic.h pg_statistic_ext.h pg_statistic_ext_data.h \
+	pg_rewrite.h pg_trigger.h pg_event_trigger.h pg_description.h \
 	pg_cast.h pg_enum.h pg_namespace.h pg_conversion.h pg_depend.h \
 	pg_database.h pg_db_role_setting.h pg_tablespace.h \
 	pg_authid.h pg_auth_members.h pg_shdepend.h pg_shdescription.h \
diff --git a/src/backend/commands/statscmds.c b/src/backend/commands/statscmds.c
index 3057d89d50..476a6314a6 100644
--- a/src/backend/commands/statscmds.c
+++ b/src/backend/commands/statscmds.c
@@ -366,6 +366,7 @@ CreateStatistics(CreateStatsStmt *stmt)
 	datanulls[Anum_pg_statistic_ext_data_stxdndistinct - 1] = true;
 	datanulls[Anum_pg_statistic_ext_data_stxddependencies - 1] = true;
 	datanulls[Anum_pg_statistic_ext_data_stxdmcv - 1] = true;
+	datanulls[Anum_pg_statistic_ext_data_stxdexpr - 1] = true;
 
 	/* insert it into pg_statistic_ext_data */
 	htup = heap_form_tuple(datarel->rd_att, datavalues, datanulls);
@@ -638,6 +639,7 @@ UpdateStatisticsForTypeChange(Oid statsOid, Oid relationOid, int attnum,
 
 	replaces[Anum_pg_statistic_ext_data_stxdmcv - 1] = true;
 	nulls[Anum_pg_statistic_ext_data_stxdmcv - 1] = true;
+	nulls[Anum_pg_statistic_ext_data_stxdexpr - 1] = true;
 
 	rel = table_open(StatisticExtDataRelationId, RowExclusiveLock);
 
diff --git a/src/include/catalog/pg_statistic_ext_data.h b/src/include/catalog/pg_statistic_ext_data.h
index c9515df117..c494d159a0 100644
--- a/src/include/catalog/pg_statistic_ext_data.h
+++ b/src/include/catalog/pg_statistic_ext_data.h
@@ -37,6 +37,7 @@ CATALOG(pg_statistic_ext_data,3429,StatisticExtDataRelationId)
 	pg_ndistinct stxdndistinct; /* ndistinct coefficients (serialized) */
 	pg_dependencies stxddependencies;	/* dependencies (serialized) */
 	pg_mcv_list stxdmcv;		/* MCV (serialized) */
+	pg_statistic stxdexpr;		/* expressions */
 
 #endif
 
-- 
2.17.0

#15Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Justin Pryzby (#14)
Re: planner support functions: handle GROUP BY estimates ?

On 11/17/20 5:18 PM, Justin Pryzby wrote:

On Mon, Nov 16, 2020 at 06:24:41PM +0100, Tomas Vondra wrote:

On 1/15/20 12:44 AM, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

On Tue, Jan 14, 2020 at 05:37:53PM -0500, Tom Lane wrote:

I wonder just how messy it would be to add a column to pg_statistic_ext
whose type is the composite type "pg_statistic", and drop the required
data into that. We've not yet used any composite types in the system
catalogs, AFAIR, but since pg_statistic_ext isn't a bootstrap catalog
it seems like we might be able to get away with it.

[ I meant pg_statistic_ext_data, obviously ]

I don't know, but feels a bit awkward to store this type of stats into
pg_statistic_ext, which was meant for multi-column stats. Maybe it'd
work fine, not sure.

I've started looking at statistics on expressions too, mostly because it
seems the extended stats improvements (as discussed in [1]) need that.

The "stash pg_statistic records into pg_statistics_ext_data" approach
seems simple, but it's not clear to me how to make it work, so I'd
appreciate some guidance.

1) Considering we don't have any composite types in any catalog yet, and
naive attempts to just use something like

pg_statistic stxdexprs[1];

did not work. So I suppose this will require changes to genbki.pl, but
honestly, my Perl-fu is non-existent :-(

In the attached, I didn't need to mess with perl.

2) Won't it be an issue that pg_statistic contains pseudo-types? That
is, this does not work, for example:

test=# create table t (a pg_statistic[]);
ERROR: column "stavalues1" has pseudo-type anyarray

It works during initdb for the reasons that it's allowed for pg_statistic.

Oh, wow! I haven't expected a patch implementing this, that's great. I
owe you a beer or a drink of your choice.

Thanks!

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company