Multi-column distinctness.

Started by Kyotaro HORIGUCHIover 10 years ago26 messages
#1Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
6 attachment(s)

Hello, this patch enables planner to be couscious of inter-column
correlation.

Sometimes two or more columns in a table has some correlation
which brings underestimate, which leads to wrong join method and
ends with slow execution.

Tomas Vondra is now working on heavily-equipped multivariate
statistics for OLAP usage. In contrast, this is a lightly
implemented solution which calculates only the ratio between a
rows estimated by current method and a actual row number. I think
this doesn't conflict with his work except the grammar part.

This would apply fewer cases but I suppose still in many cases
the correlated colums would be in simple proportional
relationship, so this can help the cases. The previous discussion
is

https://wiki.postgresql.org/wiki/Cross_Columns_Stats
/messages/by-id/4D0BA4D5.8080707@fuzzy.cz

This patch is covers only the type A (Discrete values and
equality conditions) but I think it is usable in many cases seen
in the field. So I'd like to repropose for the latest version of
PostgreSQL.

- design outline

Provide new system catalog pg_mvcoefficient to store the
information required to do this.

A user can instruct planner to correct the wrong estimation
caused by inter-column correlation by registering the columns in
pg_mvcoefficient using new DDL ALTER TABLE... ADD STATISTICS.

Analyzing of the target table also stores the 'multivariate
coefficient' calculated by using the following formula into
pg_mvcoefficient.

mv_coef(c1, c2, ..) =
ndistinct(c1 * c2 * ...) / (ndistinct(c1) * ndistinct(c2) * ...)

In clauselist_selectivity, planner corrects the estimate if
given clauselist has equivalence-classes-compatible clauses for
required columns at the top-level.

- Example

The attached perl script gentbl.pl generates test data resembles
some tables in DBT-3 benchmark.

$ perl gentbl.pl | psql postgres

=# EXPLAIN ANALYZE SELECT * FROM t1 WHERE a = 1 AND b = 2501;

...

Seq Scan on t1 (cost=0.00..653.00 rows=1 width=12) (actual time=0.021..6.348 rows=8 loops=1)

This doesn't have no harm but in a join case,

=# EXPLAIN ANALYZE SELECT * FROM t1 JOIN t2 ON t1.a = t2.a AND t1.b = t2.b;
Hash Join (cost=122.00..855.32 rows=32 width=24)
(actual time=2.009..29.208 rows=32000 loops=1)

The correlation between a and b makes the estimate too
small. Then register correlation setting.

=# ALTER TABLE t1 ADD STATISTICS (mvndistinct) ON (a, b);
=# ANALYZE t1;

Then the estimate will be corrected.

=# EXPLAIN ANALYZE SELECT * FROM t1 JOIN t2 ON t1.a = t2.a AND t1.b = t2.b;
Hash Join (cost=122.00..855.32 rows=32000 width=24)
(actual time=1.907..29.025 rows=32000 loops=1)

- Known limitations

The coefficient calculated by this feature is applicble only for
conjunctions of simple var-exprs on merge-joinable operator.

The coefficient is applied regardless of whether the base
estimate has been calculated using MCV, so estimates for
non-join cases on the columns which has MCV can rather become
inaccurate.

Uniform correlation is assumed so some extent of correlation
ununiformity would lead to wrong estimation.

This patch set doesn't contain any document yet.

- Patche Files

This patch consists of the following files.

- 0001-New-system-catalog-pg_mvcoefficient.patch
Adds new system catalog pg_mvcoefficient.

- 0002-Analyze-part-for-multivariate-coefficient.patch
Analyze part of multivariate coefficient.

- 0003-Make-use-of-multivariate-coefficeient-in-estimation-.patch
Planner part to make it use the multivariate coefficient.

- 0004-Syntactical-part-of-multivariate-coefficient.patch
Add new DDL to define mv coefficient columns.

The four files above are essential. The two following files are
experimental patch to add mvcattrs to index columns. One of them
adds a new opclass for int2vector of btree but it would be
overkill.

- 0005-Add-btree-operator-class-for-int2vector.patch
Add btree operator class for int2vector.

- 0006-Use-modified-index-of-pg_mvcoefficient.patch
Use modified index of pg_mvcoefficient.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-New-system-catalog-pg_mvcoefficient.patchtext/x-patch; charset=us-asciiDownload
>From ce2beeba56b96af4f9289be53646f83e807637dc Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 6 Aug 2015 16:49:04 +0900
Subject: [PATCH 1/6] New system catalog pg_mvcoefficient.

---
 src/backend/catalog/Makefile               |  3 +-
 src/backend/utils/cache/syscache.c         | 12 +++++
 src/include/catalog/indexing.h             |  3 ++
 src/include/catalog/pg_mvcoefficient.h     | 74 ++++++++++++++++++++++++++++++
 src/include/utils/syscache.h               |  1 +
 src/test/regress/expected/sanity_check.out |  1 +
 6 files changed, 93 insertions(+), 1 deletion(-)
 create mode 100644 src/include/catalog/pg_mvcoefficient.h

diff --git a/src/backend/catalog/Makefile b/src/backend/catalog/Makefile
index 25130ec..4ce1653 100644
--- a/src/backend/catalog/Makefile
+++ b/src/backend/catalog/Makefile
@@ -33,7 +33,8 @@ POSTGRES_BKI_SRCS = $(addprefix $(top_srcdir)/src/include/catalog/,\
 	pg_opfamily.h pg_opclass.h pg_am.h pg_amop.h pg_amproc.h \
 	pg_language.h pg_largeobject_metadata.h pg_largeobject.h pg_aggregate.h \
 	pg_statistic.h pg_rewrite.h pg_trigger.h pg_event_trigger.h pg_description.h \
-	pg_cast.h pg_enum.h pg_namespace.h pg_conversion.h pg_depend.h \
+	pg_cast.h pg_enum.h pg_mvcoefficient.h pg_namespace.h pg_conversion.h \
+	pg_depend.h \
 	pg_database.h pg_db_role_setting.h pg_tablespace.h pg_pltemplate.h \
 	pg_authid.h pg_auth_members.h pg_shdepend.h pg_shdescription.h \
 	pg_ts_config.h pg_ts_config_map.h pg_ts_dict.h \
diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c
index efce7b9..a895405 100644
--- a/src/backend/utils/cache/syscache.c
+++ b/src/backend/utils/cache/syscache.c
@@ -43,6 +43,7 @@
 #include "catalog/pg_foreign_server.h"
 #include "catalog/pg_foreign_table.h"
 #include "catalog/pg_language.h"
+#include "catalog/pg_mvcoefficient.h"
 #include "catalog/pg_namespace.h"
 #include "catalog/pg_opclass.h"
 #include "catalog/pg_operator.h"
@@ -501,6 +502,17 @@ static const struct cachedesc cacheinfo[] = {
 		},
 		4
 	},
+	{MvCoefficientRelationId,		/* MVCOEFFICIENT */
+		MvCoefficientIndexId,
+		1,
+		{
+			Anum_pg_mvcoefficient_mvcreloid,
+			0,
+			0,
+			0
+		},
+		4
+	},
 	{NamespaceRelationId,		/* NAMESPACENAME */
 		NamespaceNameIndexId,
 		1,
diff --git a/src/include/catalog/indexing.h b/src/include/catalog/indexing.h
index c38958d..fb28747 100644
--- a/src/include/catalog/indexing.h
+++ b/src/include/catalog/indexing.h
@@ -173,6 +173,9 @@ DECLARE_UNIQUE_INDEX(pg_largeobject_loid_pn_index, 2683, on pg_largeobject using
 DECLARE_UNIQUE_INDEX(pg_largeobject_metadata_oid_index, 2996, on pg_largeobject_metadata using btree(oid oid_ops));
 #define LargeObjectMetadataOidIndexId	2996
 
+DECLARE_UNIQUE_INDEX(pg_mvcoefficient_index, 3578, on pg_mvcoefficient using btree(mvcreloid oid_ops));
+#define MvCoefficientIndexId  3578
+
 DECLARE_UNIQUE_INDEX(pg_namespace_nspname_index, 2684, on pg_namespace using btree(nspname name_ops));
 #define NamespaceNameIndexId  2684
 DECLARE_UNIQUE_INDEX(pg_namespace_oid_index, 2685, on pg_namespace using btree(oid oid_ops));
diff --git a/src/include/catalog/pg_mvcoefficient.h b/src/include/catalog/pg_mvcoefficient.h
new file mode 100644
index 0000000..5936070
--- /dev/null
+++ b/src/include/catalog/pg_mvcoefficient.h
@@ -0,0 +1,74 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_mvcoefficient.h
+ *	  definition of the system multivariate coefficient relation
+ *	  (pg_mvcoefficient) along with the relation's initial contents.
+ *
+ * Copyright (c) 2015, PostgreSQL Global Development Group
+ *
+ * src/include/catalog/pg_mvcoefficient.h
+ *
+ * NOTES
+ *	  the genbki.pl script reads this file and generates .bki
+ *	  information from the DATA() statements.
+ *
+ *	  XXX do NOT break up DATA() statements into multiple lines!
+ *		  the scripts are not as smart as you might think...
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_MVCOEFFICIENT_H
+#define PG_MVCOEFFICIENT_H
+
+#include "catalog/genbki.h"
+#include "nodes/pg_list.h"
+
+/* ----------------
+ *		pg_mvcoefficient definition.  cpp turns this into
+ *		typedef struct FormData_pg_mvcoefficient
+ * ----------------
+ */
+#define MvCoefficientRelationId	3577
+
+/* This values is arbitrary number, but 6 seems already too much. */
+#define MVCOEF_MAX_COLS 6
+
+CATALOG(pg_mvcoefficient,3577) BKI_WITHOUT_OIDS
+{
+	Oid			mvcreloid;		/* OID of target relation */
+	int16		mvcnattrs;		/* number of mv columns */
+	float8	mvccoefficient;		/* multivariate distinctness coefficient */
+
+	/*
+	 * variable-length fields start here, but we allow direct access to
+	 * mvcattrs.
+	 */
+	int2vector	mvcattrs;		/* Column numbers */
+} FormData_pg_mvcoefficient;
+
+/* ----------------
+ *		Form_pg_mvcoefficient corresponds to a pointer to a tuple with the
+ *		format of pg_mvcoefficient relation.
+ * ----------------
+ */
+typedef FormData_pg_mvcoefficient *Form_pg_mvcoefficient;
+
+/* ----------------
+ *		compiler constants for pg_mvcoefficient
+ * ----------------
+< */
+#define Natts_pg_mvcoefficient				4
+#define Anum_pg_mvcoefficient_mvcreloid		1
+#define Anum_pg_mvcoefficient_mvcnattrs			2
+#define Anum_pg_mvcoefficient_mvccoefficient	3
+#define Anum_pg_mvcoefficient_mvcattrs			4
+
+/* ----------------
+ *		pg_mvcoefficient has no initial contents
+ * ----------------
+ */
+
+/*
+ * prototypes for functions in pg_enum.c
+ */
+#endif   /* PG_MVCOEFFICIENT_H */
diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h
index 18404e2..1f48d61 100644
--- a/src/include/utils/syscache.h
+++ b/src/include/utils/syscache.h
@@ -66,6 +66,7 @@ enum SysCacheIdentifier
 	INDEXRELID,
 	LANGNAME,
 	LANGOID,
+	MVCOEFFICIENT,
 	NAMESPACENAME,
 	NAMESPACEOID,
 	OPERNAMENSP,
diff --git a/src/test/regress/expected/sanity_check.out b/src/test/regress/expected/sanity_check.out
index eb0bc88..7c77796 100644
--- a/src/test/regress/expected/sanity_check.out
+++ b/src/test/regress/expected/sanity_check.out
@@ -113,6 +113,7 @@ pg_inherits|t
 pg_language|t
 pg_largeobject|t
 pg_largeobject_metadata|t
+pg_mvcoefficient|t
 pg_namespace|t
 pg_opclass|t
 pg_operator|t
-- 
1.8.3.1

0002-Analyze-part-for-multivariate-coefficient.patchtext/x-patch; charset=us-asciiDownload
>From 04207bbe8a5800fcc2c76cdacb77798639113b80 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 6 Aug 2015 17:12:28 +0900
Subject: [PATCH 2/6] Analyze part for multivariate coefficient.

Calculates multivarite coefficients on ANALYZE and stores them into
pg_mvcoefficient.  This code is considering multiple definitions for
one relation. (However definition part registers no more than one
definition per relation.)

It calculates the following number

 ndistinct(A * B * ...)
----------------------------------
 ndistint(A) * ndistinct(B) * ...
---
 src/backend/commands/analyze.c | 402 ++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 401 insertions(+), 1 deletion(-)

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 861048f..45f7cf5 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -27,6 +27,7 @@
 #include "catalog/indexing.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_inherits_fn.h"
+#include "catalog/pg_mvcoefficient.h"
 #include "catalog/pg_namespace.h"
 #include "commands/dbcommands.h"
 #include "commands/tablecmds.h"
@@ -46,6 +47,7 @@
 #include "utils/acl.h"
 #include "utils/attoptcache.h"
 #include "utils/datum.h"
+#include "utils/fmgroids.h"
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -56,6 +58,10 @@
 #include "utils/timestamp.h"
 #include "utils/tqual.h"
 
+#ifndef CHAR_BIT
+#include <limits.h>
+#endif
+
 
 /* Per-index data for ANALYZE */
 typedef struct AnlIndexData
@@ -96,7 +102,11 @@ static void update_attstats(Oid relid, bool inh,
 				int natts, VacAttrStats **vacattrstats);
 static Datum std_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull);
 static Datum ind_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull);
-
+static float4 compute_mv_ndistinct(int nattrs,
+								   VacAttrStats **colstats,
+								   AnalyzeAttrFetchFunc fetchfunc,
+								   int samplerows,
+								   double totalrows);
 
 /*
  *	analyze_rel() -- analyze one relation
@@ -281,6 +291,128 @@ analyze_rel(Oid relid, RangeVar *relation, int options,
 	LWLockRelease(ProcArrayLock);
 }
 
+/* Compute and store multivariate distinctness */
+static void
+update_mv_distinct(Oid reloid, VacAttrStats **vacattrstats,
+				   int attr_cnt, int numrows, double totalrows)
+{
+	ScanKeyData	scankey;
+	SysScanDesc	sysscan;
+	Relation	mvcrel;
+	HeapTuple	oldtup, newtup;
+
+	mvcrel = heap_open(MvCoefficientRelationId, RowExclusiveLock);
+
+	ScanKeyInit(&scankey,
+				Anum_pg_mvcoefficient_mvcreloid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(reloid));
+	sysscan = systable_beginscan(mvcrel, MvCoefficientIndexId, true,
+								 NULL, 1, &scankey);
+	oldtup = systable_getnext(sysscan);
+
+	while (HeapTupleIsValid(oldtup))
+	{
+		VacAttrStats *colstats[MVCOEF_MAX_COLS];
+		int		ncols;
+		double	nd_cartesian, nd, ncol_totalrows;
+		Datum	values[Natts_pg_mvcoefficient];
+		bool	nulls[Natts_pg_mvcoefficient];
+		bool	replaces[Natts_pg_mvcoefficient];
+		int		i;
+		float4	nd_coef;
+				
+		Form_pg_mvcoefficient mvc =
+			(Form_pg_mvcoefficient) GETSTRUCT (oldtup);
+
+		Assert(mvc->mvcnattrs > 1 && mvc->mvcnattrs <= MVCOEF_MAX_COLS);
+		ncols = mvc->mvcnattrs;
+
+		/* find vacattrstats entry for required columns */
+		for (i = 0 ; i < ncols ; i++)
+		{
+			int j;
+
+			for(j = 0 ; j < attr_cnt ; j++)
+			{
+				if (mvc->mvcattrs.values[i] == vacattrstats[j]->tupattnum)
+				{
+					colstats[i] = vacattrstats[j];
+					break;
+				}
+			}
+
+			/* required column is not in vacattrstats, give up */
+			if (j == attr_cnt)
+				break;
+		}
+
+		/* This mvdistinct is not computable with this vacattrstats. */
+		if (i != ncols)
+		{
+			oldtup = systable_getnext(sysscan);
+			continue;
+		}
+
+		/*
+		 * compute cartesian ndistinct =
+		 *      ndistinct(col0) * ndistinct(col1) ...
+		 *
+		 * This calculation is performed with distinct ratio to preserve
+		 * precision.
+		 */
+		nd_cartesian = 1.0;
+		for (i = 0 ; i < ncols ; i++)
+		{
+			double t = -colstats[i]->stadistinct;
+
+			/* ignore colums where stadistinct is "unknown" */
+			if (t == 0.0) continue;
+
+			if (t < 0)
+				t = -t / totalrows;
+
+			nd_cartesian *= t;
+		}
+
+		nd_coef = 1.0;
+
+		/* compute ndistinct(col0 * col1 * ... ) */
+		nd = compute_mv_ndistinct(i, colstats,
+								  std_fetch_func, numrows, totalrows);
+
+		ncol_totalrows = pow(totalrows, ncols);
+		if (!isinf(ncol_totalrows))	/* This won't be happen? */
+			nd_coef = nd / nd_cartesian / ncol_totalrows;
+		else
+			nd_coef = 0.0;
+
+		/* Update statistics record */
+		for (i = 0; i < Natts_pg_mvcoefficient ; ++i)
+		{
+			nulls[i] = false;
+			replaces[i] = false;
+		}
+		values[Anum_pg_mvcoefficient_mvccoefficient - 1] =
+			Float8GetDatum(nd_coef);
+		replaces[Anum_pg_mvcoefficient_mvccoefficient - 1] = true;
+		newtup = heap_modify_tuple(oldtup,
+								   RelationGetDescr(mvcrel),
+								   values,
+								   nulls,
+								   replaces);
+		simple_heap_update(mvcrel, &oldtup->t_self, newtup);
+				
+		CatalogUpdateIndexes(mvcrel, newtup);
+				
+		oldtup = systable_getnext(sysscan);
+	}
+
+	systable_endscan(sysscan);
+	heap_close(mvcrel, RowExclusiveLock);
+}
+
+
 /*
  *	do_analyze_rel() -- analyze one relation, recursively or not
  *
@@ -538,6 +670,10 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
 			MemoryContextResetAndDeleteChildren(col_context);
 		}
 
+		/* update multivariate distinctness if requested */
+		update_mv_distinct(onerel->rd_id, vacattrstats, attr_cnt,
+						   numrows, totalrows);
+
 		if (hasindex)
 			compute_index_stats(onerel, totalrows,
 								indexdata, nindexes,
@@ -1676,6 +1812,16 @@ typedef struct
 	int			tupno;			/* position index for tuple it came from */
 } ScalarItem;
 
+#define VI_UBITS	(CHAR_BIT * sizeof(unsigned))
+#define VI_BSET(bm,an) ((bm)[(an)/VI_UBITS] |= (unsigned)1 << ((an)%VI_UBITS))
+#define VI_ISBSET(bm,an) ((bm)[(an)/VI_UBITS] & ((unsigned)1 << ((an)%VI_UBITS)))
+typedef struct
+{
+	unsigned	nulls[MVCOEF_MAX_COLS / VI_UBITS + 1]; /* null bitmap */
+	Datum	   *datums;			/* Values list */
+	int			tupno;			/* position index for tuple it came from */
+} VectorItem;
+
 typedef struct
 {
 	int			count;			/* # of duplicates */
@@ -1698,6 +1844,7 @@ static void compute_scalar_stats(VacAttrStatsP stats,
 					 int samplerows,
 					 double totalrows);
 static int	compare_scalars(const void *a, const void *b, void *arg);
+static int	compare_mv_scalars(const void *a, const void *b, void *arg);
 static int	compare_mcvs(const void *a, const void *b);
 
 
@@ -2628,6 +2775,220 @@ compute_scalar_stats(VacAttrStatsP stats,
 }
 
 /*
+ *	compute_mv_ndistinct() -- compute multicolumn distinctness
+ */ 
+static float4
+compute_mv_ndistinct(int nattrs,
+					 VacAttrStats **colstats,
+					 AnalyzeAttrFetchFunc fetchfunc,
+					 int samplerows,
+					 double totalrows)
+{
+	int			i;
+	int			null_cnt = 0;
+	int			nonnull_cnt = 0;
+	int			toowide_cnt = 0;
+	bool		is_varlena[MVCOEF_MAX_COLS];
+	SortSupportData ssup[MVCOEF_MAX_COLS];
+	Datum	   *datums;
+	VectorItem *values;
+	int			values_cnt = 0;
+	int		   *tupnoLink;
+	StdAnalyzeData *mystats[MVCOEF_MAX_COLS];
+	float4		fndistinct;
+
+	Assert (nattrs > 1 && nattrs <= MVCOEF_MAX_COLS);
+
+	/* set up row storage for sorting */
+	datums = (Datum*) palloc(samplerows * nattrs * sizeof(Datum));
+	values = (VectorItem *) palloc0(samplerows * sizeof(VectorItem));
+	tupnoLink = (int *) palloc(samplerows * sizeof(int));
+
+	for (i = 0 ; i < samplerows ; i++)
+		values[i].datums = &datums[i * nattrs];
+
+	memset(ssup, 0, sizeof(ssup));
+	for (i = 0 ; i < nattrs ; i++)
+	{
+		VacAttrStats *vas = colstats[i];
+
+		is_varlena[i] =
+			!vas->attrtype->typbyval && vas->attrtype->typlen == -1;
+		mystats[i] =
+			(StdAnalyzeData*) vas->extra_data;
+
+		ssup[i].ssup_cxt = CurrentMemoryContext;
+		/* We always use the default collation for statistics */
+		ssup[i].ssup_collation = DEFAULT_COLLATION_OID;
+		/* Don't use abbreviated key conversion for now.  See
+		   compute_scalar_stats. */
+		/* either will do for sort order and nulls_first */
+		PrepareSortSupportFromOrderingOp(mystats[i]->ltopr, &ssup[i]);
+	}
+	ssup[nattrs].ssup_cxt = NULL;
+
+	/* Initial scan to find sortable values */
+	for (i = 0; i < samplerows; i++)
+	{
+		Datum	colvalues[nattrs];
+		bool	colnulls[nattrs];
+		bool	all_is_null = true;
+		bool	toowide = false;
+		int 	j;
+
+		vacuum_delay_point();
+
+		for (j = 0 ; j < nattrs ; j++)
+		{
+			colnulls[nattrs] = false;
+
+			colvalues[j] = fetchfunc(colstats[j], i, &colnulls[j]);
+
+			/*
+			 * Check for null/nonnull. Don't abandon this row if null is
+			 * contained
+			 */
+			if (colnulls[j])
+				continue;
+
+			all_is_null = false;
+
+			/*
+			 * If it's a variable-width field, detoast it now to avoid
+			 * repeated detoasting during the comparisons. If at least one of
+			 * the columns is too long, exclude this row from accurate
+			 * counting.
+			 */
+			if (is_varlena[j])
+			{
+				if (toast_raw_datum_size(colvalues[j]) > WIDTH_THRESHOLD)
+				{
+					toowide = true;
+					break;
+				}
+				colvalues[j] = PointerGetDatum(PG_DETOAST_DATUM(colvalues[j]));
+			}
+		}
+		if (all_is_null)
+		{
+			null_cnt++;
+			continue;
+		}
+		nonnull_cnt++;
+
+		if (toowide)
+		{
+			toowide_cnt++;
+			continue;
+		}
+
+		/* Add it to the list to be sorted */
+		for (j = 0 ; j < nattrs ; j++)
+		{
+			if (colnulls[j])
+				VI_BSET(values[values_cnt].nulls, j);
+			else
+				values[values_cnt].datums[j] = colvalues[j];
+		}
+
+		values[values_cnt].tupno = values_cnt;
+		tupnoLink[values_cnt] = values_cnt;
+		values_cnt++;
+	}
+
+	/* We can only compute real stats if we found some sortable values. */
+	if (values_cnt > 0)
+	{
+		int			ndistinct,	/* # distinct values in sample */
+					nmultiple,	/* # that appear multiple times */
+					dups_cnt;
+		CompareScalarsContext cxt;
+
+		/* Sort the collected values */
+		cxt.ssup = ssup;
+		cxt.tupnoLink = tupnoLink;
+		qsort_arg((void *) values, values_cnt, sizeof(VectorItem),
+				  compare_mv_scalars, (void *) &cxt);
+
+		ndistinct = 0;
+		nmultiple = 0;
+		dups_cnt = 0;
+		for (i = 0; i < values_cnt; i++)
+		{
+			int			tupno = values[i].tupno;
+
+			dups_cnt++;
+			if (tupnoLink[tupno] == tupno)
+			{
+				/* Reached end of duplicates of this value */
+				ndistinct++;
+				if (dups_cnt > 1)
+					nmultiple++;
+
+				dups_cnt = 0;
+			}
+		}
+
+		if (nmultiple == 0)
+		{
+			/* If we found no repeated values, assume it's a unique column */
+			fndistinct = totalrows;
+		}
+		else if (toowide_cnt == 0 && nmultiple == ndistinct)
+		{
+			/*
+			 * Every value in the sample appeared more than once.  Assume the
+			 * columns have just these values.
+			 */
+			fndistinct = (float4)ndistinct;
+		}
+		else
+		{
+			/*----------
+			 * Estimate the number of distinct values using the estimator.
+			 * See compute_scalar_stats for the algorithm.
+			 *----------
+			 */
+			int			f1 = ndistinct - nmultiple + toowide_cnt;
+			int			d = f1 + nmultiple;
+			double		numer,
+						denom,
+						stadistinct;
+
+			numer = (double) samplerows *(double) d;
+
+			denom = (double) (samplerows - f1) +
+				(double) f1 *(double) samplerows / totalrows;
+
+			stadistinct = numer / denom;
+			/* Clamp to sane range in case of roundoff error */
+			if (stadistinct < (double) d)
+				stadistinct = (double) d;
+			if (stadistinct > totalrows)
+				stadistinct = totalrows;
+			fndistinct = floor(stadistinct + 0.5);
+		}
+	}
+	else if (nonnull_cnt > 0)
+	{
+		/* We found some non-null values, but they were all too wide */
+		Assert(nonnull_cnt == toowide_cnt);
+
+		/* Assume all too-wide values are distinct, so all rows are unique */
+		fndistinct = totalrows;
+	}
+	else if (null_cnt > 0)
+	{
+		/* We found only all-null rows; assume the rows are entirely null */
+		fndistinct =  0.0;		/* "unknown" */
+	}
+
+	/* We don't need to bother cleaning up any of our temporary palloc's */
+	return fndistinct;
+}
+
+
+/*
  * qsort_arg comparator for sorting ScalarItems
  *
  * Aside from sorting the items, we update the tupnoLink[] array
@@ -2664,6 +3025,45 @@ compare_scalars(const void *a, const void *b, void *arg)
 	return ta - tb;
 }
 
+static int
+compare_mv_scalars(const void *a, const void *b, void *arg)
+{
+	CompareScalarsContext *cxt = (CompareScalarsContext *) arg;
+	VectorItem *va = (VectorItem*)a;
+	VectorItem *vb = (VectorItem*)b;
+	Datum		da, db;
+	int			ta, tb;
+	int			compare;
+	int i;
+
+	for (i = 0 ; cxt->ssup[i].ssup_cxt ; i++)
+	{
+		bool da_isnull = VI_ISBSET(va->nulls, i);
+		bool db_isnull = VI_ISBSET(vb->nulls, i);
+		da = va->datums[i];
+		db = vb->datums[i];
+
+		compare = ApplySortComparator(da, da_isnull, db, db_isnull, &cxt->ssup[i]);
+		if (compare != 0)
+			return compare;
+	}
+
+	/*
+	 * The two datums are equal, so update cxt->tupnoLink[].
+	 */
+	ta = va->tupno;
+	tb = vb->tupno;
+	if (cxt->tupnoLink[ta] < tb)
+		cxt->tupnoLink[ta] = tb;
+	if (cxt->tupnoLink[tb] < ta)
+		cxt->tupnoLink[tb] = ta;
+
+	/*
+	 * For equal datums, sort by tupno
+	 */
+	return ta - tb;
+}
+
 /*
  * qsort comparator for sorting ScalarMCVItems by position
  */
-- 
1.8.3.1

0003-Make-use-of-multivariate-coefficeient-in-estimation-.patchtext/x-patch; charset=us-asciiDownload
>From 19f5733988c75732a6fd4612c822fa99f05ed9ab Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 6 Aug 2015 17:31:20 +0900
Subject: [PATCH 3/6] Make use of multivariate coefficeient in estimation of
 clauselist.

This is rather simple but walks on whole clauselist to collect column
information..
---
 src/backend/optimizer/path/clausesel.c | 100 +++++++++++++++++++++++++++++++++
 1 file changed, 100 insertions(+)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index dcac1c1..b28c271 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -14,8 +14,15 @@
  */
 #include "postgres.h"
 
+#include "access/genam.h"
+#include "access/heapam.h"
+#include "access/htup_details.h"
+#include "catalog/indexing.h"
+#include "catalog/pg_class.h"
 #include "catalog/pg_operator.h"
+#include "catalog/pg_mvcoefficient.h"
 #include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "optimizer/cost.h"
 #include "optimizer/pathnode.h"
@@ -43,6 +50,96 @@ static void addRangeClause(RangeQueryClause **rqlist, Node *clause,
 			   bool varonleft, bool isLTsel, Selectivity s2);
 
 
+static bool
+collect_collist_walker(Node *node, Bitmapset **colsetlist)
+{
+	if (node == NULL)
+		return false;
+	if (IsA(node, Var))
+	{
+		Var *var = (Var*)node;
+
+		if (AttrNumberIsForUserDefinedAttr(var->varattno))
+			colsetlist[var->varno] = 
+				bms_add_member(colsetlist[var->varno], var->varattno);
+	}
+	return expression_tree_walker(node, collect_collist_walker,
+								  (void*)colsetlist);
+}
+
+/* Find multivariate distinctness coefficient for clauselist */
+static double
+find_mv_coeffeicient(PlannerInfo *root, List *clauses)
+{
+	int relid;
+	ListCell   *l;
+	Bitmapset **colsetlist = NULL;
+	double mv_coef = 1.0;
+
+	/* Collect columns this clauselist on */
+	colsetlist = (Bitmapset**)
+		palloc0(root->simple_rel_array_size * sizeof(Bitmapset*));
+
+	foreach(l, clauses)
+	{
+		RestrictInfo *rti = (RestrictInfo *) lfirst(l);
+
+		/* Consider only EC-related clauses */
+		if (rti->left_ec && (rti->left_ec == rti->right_ec))
+		{
+			if (IsA(rti, RestrictInfo))
+				collect_collist_walker((Node*)rti->clause, colsetlist);
+		}
+	}
+
+	/* Find pg_mv_coefficient entries match this columlist */
+	for (relid = 1 ; relid < root->simple_rel_array_size ; relid++)
+	{
+		Relation mvcrel;
+		SysScanDesc sscan;
+		ScanKeyData skeys[1];
+		HeapTuple tuple;
+		
+		/* at least two colums required for on rel*/
+		if (bms_num_members(colsetlist[relid]) < 2) continue;
+
+		/* tables other than ordinary ones have no mv statistics */
+		if (root->simple_rte_array[relid]->rtekind != RTE_RELATION ||
+			root->simple_rte_array[relid]->relkind != RELKIND_RELATION)
+			continue;
+
+		ScanKeyInit(&skeys[0],
+					Anum_pg_mvcoefficient_mvcreloid,
+					BTEqualStrategyNumber, F_OIDEQ,
+					ObjectIdGetDatum(root->simple_rte_array[relid]->relid));
+		
+		mvcrel = heap_open(MvCoefficientRelationId, AccessShareLock);
+		sscan = systable_beginscan(mvcrel, MvCoefficientIndexId, true,
+								   NULL, 1, skeys);
+		while (HeapTupleIsValid(tuple = systable_getnext(sscan)))
+		{
+			int i;
+			Bitmapset *mvccols = NULL;
+			Form_pg_mvcoefficient mvc =
+				(Form_pg_mvcoefficient) GETSTRUCT (tuple);
+			
+			for (i = 0 ; i < mvc->mvcnattrs ; i++)
+				mvccols = bms_add_member(mvccols, mvc->mvcattrs.values[i]);
+
+			if (!bms_is_subset(mvccols, colsetlist[relid]))
+				continue;
+
+			/* Prefer smaller one */
+			if (mvc->mvccoefficient > 0 && mvc->mvccoefficient < mv_coef)
+				mv_coef = mvc->mvccoefficient;
+		}
+		systable_endscan(sscan);
+		heap_close(mvcrel, AccessShareLock);
+	}
+
+	return mv_coef;
+}
+
 /****************************************************************************
  *		ROUTINES TO COMPUTE SELECTIVITIES
  ****************************************************************************/
@@ -200,6 +297,9 @@ clauselist_selectivity(PlannerInfo *root,
 		s1 = s1 * s2;
 	}
 
+	/* Try multivariate distinctness correction for clauses */
+	s1 /= find_mv_coeffeicient(root, clauses);
+
 	/*
 	 * Now scan the rangequery pair list.
 	 */
-- 
1.8.3.1

0004-Syntactical-part-of-multivariate-coefficient.patchtext/x-patch; charset=us-asciiDownload
>From 693c9b0f67a0fabc38212e436453a29f9ae275c5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 6 Aug 2015 16:59:47 +0900
Subject: [PATCH 4/6] Syntactical part of multivariate coefficient.

This adds new two syntax

 ALTER TABLE <name> (ADD|DROP) STATISTICS (<type>,..) ON (<col>,..)

To avoid shift/reduce conflict caused by adding these syntaxes, two
existing rules are splitted into two rules each. They simply add or
remove one definition tuple into/from pg_mvcoefficeint and currently
only one definition for one relation is allowed. The tuples are
removed on deletion of the target relation or covering columns.
---
 src/backend/catalog/heap.c       |  49 ++++++++++
 src/backend/commands/tablecmds.c | 191 +++++++++++++++++++++++++++++++++++++++
 src/backend/parser/gram.y        |  46 +++++++++-
 src/include/catalog/heap.h       |   1 +
 src/include/nodes/parsenodes.h   |   2 +
 5 files changed, 287 insertions(+), 2 deletions(-)

diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index d04e94d..32fafb7 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -46,6 +46,7 @@
 #include "catalog/pg_constraint.h"
 #include "catalog/pg_foreign_table.h"
 #include "catalog/pg_inherits.h"
+#include "catalog/pg_mvcoefficient.h"
 #include "catalog/pg_namespace.h"
 #include "catalog/pg_statistic.h"
 #include "catalog/pg_tablespace.h"
@@ -2647,6 +2648,51 @@ cookConstraint(ParseState *pstate,
 }
 
 
+void
+RemoveMvStatistics(Oid relid, AttrNumber attnum)
+{
+	Relation pgmvcoef;
+	SysScanDesc scan;
+	ScanKeyData key[1];
+	HeapTuple	tuple;
+
+	pgmvcoef = heap_open(MvCoefficientRelationId, RowExclusiveLock);
+
+	ScanKeyInit(&key[0],
+				Anum_pg_mvcoefficient_mvcreloid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(relid));
+
+	scan = systable_beginscan(pgmvcoef, MvCoefficientIndexId, true,
+							  NULL, 1, key);
+
+	while (HeapTupleIsValid(tuple = systable_getnext(scan)))
+	{
+		if (attnum != 0)
+		{
+			/* Check if this attnum is contained in this mvc entry */
+			int i;
+			Form_pg_mvcoefficient mvc =
+				(Form_pg_mvcoefficient) GETSTRUCT (tuple);
+
+			for (i = 0 ; i < mvc->mvcnattrs ; i++)
+			{
+				if (mvc->mvcattrs.values[i] == attnum)
+					break;
+			}
+
+			/* This mvcoef entry has no relation with this attribute set */
+			if (i == mvc->mvcnattrs)
+				continue;
+		}
+
+		simple_heap_delete(pgmvcoef, &tuple->t_self);
+	}
+
+	systable_endscan(scan);
+
+	heap_close(pgmvcoef, RowExclusiveLock);
+}
 /*
  * RemoveStatistics --- remove entries in pg_statistic for a rel or column
  *
@@ -2690,6 +2736,9 @@ RemoveStatistics(Oid relid, AttrNumber attnum)
 	systable_endscan(scan);
 
 	heap_close(pgstatistic, RowExclusiveLock);
+
+	/* Remove mutivariate coefficient entry */
+	RemoveMvStatistics(relid, attnum);
 }
 
 
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 126b119..843a14e 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -36,6 +36,7 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_inherits_fn.h"
 #include "catalog/pg_namespace.h"
+#include "catalog/pg_mvcoefficient.h"
 #include "catalog/pg_opclass.h"
 #include "catalog/pg_tablespace.h"
 #include "catalog/pg_trigger.h"
@@ -402,6 +403,7 @@ static bool ATPrepChangePersistence(Relation rel, bool toLogged);
 static void ATPrepSetTableSpace(AlteredTableInfo *tab, Relation rel,
 					char *tablespacename, LOCKMODE lockmode);
 static void ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode);
+static void	ATExecAddDropMvStatistics(Relation rel, List *param, bool drop);
 static void ATExecSetRelOptions(Relation rel, List *defList,
 					AlterTableType operation,
 					LOCKMODE lockmode);
@@ -3015,6 +3017,8 @@ AlterTableGetLockLevel(List *cmds)
 				 * applies: we don't currently allow concurrent catalog
 				 * updates.
 				 */
+			case AT_AddMvStatistics:
+			case AT_DropMvStatistics:
 			case AT_SetStatistics:		/* Uses MVCC in getTableAttrs() */
 			case AT_ClusterOn:	/* Uses MVCC in getIndexes() */
 			case AT_DropCluster:		/* Uses MVCC in getIndexes() */
@@ -3175,6 +3179,8 @@ ATPrepCmd(List **wqueue, Relation rel, AlterTableCmd *cmd,
 			break;
 		case AT_SetOptions:		/* ALTER COLUMN SET ( options ) */
 		case AT_ResetOptions:	/* ALTER COLUMN RESET ( options ) */
+		case AT_AddMvStatistics:
+		case AT_DropMvStatistics:
 			ATSimplePermissions(rel, ATT_TABLE | ATT_MATVIEW | ATT_INDEX | ATT_FOREIGN_TABLE);
 			/* This command never recurses */
 			pass = AT_PASS_MISC;
@@ -3591,6 +3597,12 @@ ATExecCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
 			 * Nothing to do here; Phase 3 does the work
 			 */
 			break;
+		case AT_AddMvStatistics: /* ADD STATISTICS */
+			ATExecAddDropMvStatistics(rel, (List *)cmd->def, false);
+			break;
+		case AT_DropMvStatistics: /* DROP STATISTICS */
+			ATExecAddDropMvStatistics(rel, (List *)cmd->def, true);
+			break;
 		case AT_SetRelOptions:	/* SET (...) */
 		case AT_ResetRelOptions:		/* RESET (...) */
 		case AT_ReplaceRelOptions:		/* replace entire option list */
@@ -9300,6 +9312,185 @@ ATPrepSetTableSpace(AlteredTableInfo *tab, Relation rel, char *tablespacename, L
 	tab->newTableSpace = tablespaceId;
 }
 
+static int
+cmpint16(const void *p1, const void *p2)
+{
+	int a = *(int16 *)p1;
+	int b = *(int16 *)p2;
+
+	return a - b;
+}
+
+static void
+ATExecAddDropMvStatistics(Relation rel, List * params, bool drop)
+{
+	Oid			relid = RelationGetRelid(rel);
+	int			ncols = 0;
+	int16		colids[MVCOEF_MAX_COLS];
+	Relation 	mvcrel;
+	HeapTuple	oldtup, newtup;
+	ScanKeyData	scankey;
+	SysScanDesc	sysscan;
+	Datum	values[Natts_pg_mvcoefficient];
+	bool	nulls[Natts_pg_mvcoefficient];
+	bool	replaces[Natts_pg_mvcoefficient];
+	List	   *stattypes = (List*) linitial(params);
+	List	   *colnames  = (List*) lsecond(params);
+	char	   *stattype = strVal(linitial(stattypes));
+	ListCell   *lc;
+	int i;
+
+	if (list_length(stattypes) > 1)
+		ereport(ERROR,
+				(errcode(ERRCODE_SYNTAX_ERROR),
+				 errmsg("multiple stat types specfied")));
+	
+	if(strcasecmp(stattype, "mvndistinct") != 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_SYNTAX_ERROR),
+				 errmsg("unknown multivariate stat type: %s", stattype)));
+
+	if (colnames && list_length(colnames) < 2)
+		ereport(ERROR,
+				(errcode(ERRCODE_SYNTAX_ERROR),
+				 errmsg("at least two columns needed")));
+
+	if (list_length(colnames) > MVCOEF_MAX_COLS)
+		ereport(ERROR,
+				(errcode(ERRCODE_TOO_MANY_COLUMNS),
+				 errmsg("number of columns (%d) exceeds limit (%d)",
+						list_length(colnames), MVCOEF_MAX_COLS)));
+
+	foreach (lc, colnames)
+	{
+		char	   *attname = strVal(lfirst (lc));
+		HeapTuple	atttup;
+		int16		colid;
+
+		atttup = SearchSysCacheAttName(relid, attname);
+		if (!HeapTupleIsValid(atttup))
+			ereport(ERROR,
+					(errcode(ERRCODE_UNDEFINED_COLUMN),
+					 errmsg("column \"%s\" of relation \"%s\" does not exist",
+							attname, get_rel_name(relid))));
+		colid = ((Form_pg_attribute) GETSTRUCT(atttup))->attnum;
+		ReleaseSysCache(atttup);
+
+		if (!AttrNumberIsForUserDefinedAttr(colid))
+			ereport(ERROR,
+					(errcode(ERRCODE_UNDEFINED_COLUMN),
+					 errmsg("system columns cannot be specified")));
+			
+		for (i = 0 ; i < ncols ; i++)
+		{
+			if (colids[i] == colid)
+				ereport(ERROR,
+						(errcode(ERRCODE_DUPLICATE_COLUMN),
+						 errmsg("column \"%s\" specified more than once",
+							 attname)));
+		}
+
+		colids[ncols++] = colid;
+	}
+
+	mvcrel = heap_open(MvCoefficientRelationId, RowExclusiveLock);
+
+	ScanKeyInit(&scankey,
+				Anum_pg_mvcoefficient_mvcreloid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(relid));
+	sysscan = systable_beginscan(mvcrel, MvCoefficientIndexId, true,
+								 NULL, 1, &scankey);
+
+	/* colum list is required to be sorted in comparison */
+	qsort(colids, ncols, sizeof(int16), cmpint16);
+
+	if (drop)
+	{
+		bool dropped = false;
+
+		while (HeapTupleIsValid(oldtup = systable_getnext(sysscan)))
+		{
+			Form_pg_mvcoefficient mvc =
+				(Form_pg_mvcoefficient) GETSTRUCT (oldtup);
+			int i;
+
+			/* Fast exit, number of columns don't match  */
+			if (mvc->mvcnattrs != ncols)
+				continue;
+
+			/*
+			 * Check for individual columns. This assumes that both of the
+			 * column vector in mvc and colids are sorted.
+			 */
+			for (i = 0 ; i < ncols ; i++)
+			{
+				if (mvc->mvcattrs.values[i] != colids[i])
+					break;
+			}
+
+			/*
+			 * Remove it if match. Unconditionally do this for the case of
+			 * ncols == 0.
+			 */
+			if (i == ncols)
+			{
+				simple_heap_delete(mvcrel, &oldtup->t_self);
+				dropped = true;
+			}
+		}
+
+		if (!dropped)
+			ereport(ERROR,
+					(errcode(ERRCODE_NO_DATA_FOUND),
+					 errmsg("no matching statistics found")));
+	}
+	else
+	{
+		int2vector *colvec;
+
+		oldtup = systable_getnext(sysscan);
+
+		for (i = 0; i < Natts_pg_mvcoefficient ; ++i)
+		{
+			nulls[i] = false;
+			values[i] = false;
+			replaces[i] = false;
+		}
+
+		colvec = buildint2vector(colids, ncols);
+
+		values[Anum_pg_mvcoefficient_mvcreloid - 1] = ObjectIdGetDatum(relid);
+		values[Anum_pg_mvcoefficient_mvcnattrs - 1] = Int16GetDatum(ncols);
+		values[Anum_pg_mvcoefficient_mvccoefficient - 1]  = Float8GetDatum(1.0);
+		values[Anum_pg_mvcoefficient_mvcattrs - 1]  = PointerGetDatum(colvec);
+
+		if (HeapTupleIsValid(oldtup))
+		{
+			replaces[Anum_pg_mvcoefficient_mvcnattrs - 1] = true;
+			replaces[Anum_pg_mvcoefficient_mvccoefficient - 1] = true;
+			replaces[Anum_pg_mvcoefficient_mvcattrs - 1] = true;
+			newtup = heap_modify_tuple(oldtup, RelationGetDescr(mvcrel),
+									   values, nulls, replaces);
+			simple_heap_update(mvcrel, &oldtup->t_self, newtup);
+
+			elog(NOTICE,
+				 "multivariate distinctness for relation %s is cleard",
+				 get_rel_name(relid));
+		}
+		else
+		{
+			newtup = heap_form_tuple(RelationGetDescr(mvcrel), values, nulls);
+			simple_heap_insert(mvcrel, newtup);
+		}
+		CatalogUpdateIndexes(mvcrel, newtup);
+	}
+
+	systable_endscan(sysscan);
+
+	heap_close(mvcrel, RowExclusiveLock);
+}
+
 /*
  * Set, reset, or replace reloptions.
  */
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 1efc6d6..a637df0 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -2033,7 +2033,7 @@ alter_table_cmd:
 					$$ = (Node *)n;
 				}
 			/* ALTER TABLE <name> DROP [COLUMN] IF EXISTS <colname> [RESTRICT|CASCADE] */
-			| DROP opt_column IF_P EXISTS ColId opt_drop_behavior
+			| DROP COLUMN IF_P EXISTS ColId opt_drop_behavior
 				{
 					AlterTableCmd *n = makeNode(AlterTableCmd);
 					n->subtype = AT_DropColumn;
@@ -2042,8 +2042,17 @@ alter_table_cmd:
 					n->missing_ok = TRUE;
 					$$ = (Node *)n;
 				}
+			| DROP IF_P EXISTS ColId opt_drop_behavior
+				{
+					AlterTableCmd *n = makeNode(AlterTableCmd);
+					n->subtype = AT_DropColumn;
+					n->name = $4;
+					n->behavior = $5;
+					n->missing_ok = TRUE;
+					$$ = (Node *)n;
+				}
 			/* ALTER TABLE <name> DROP [COLUMN] <colname> [RESTRICT|CASCADE] */
-			| DROP opt_column ColId opt_drop_behavior
+			| DROP COLUMN ColId opt_drop_behavior
 				{
 					AlterTableCmd *n = makeNode(AlterTableCmd);
 					n->subtype = AT_DropColumn;
@@ -2052,6 +2061,15 @@ alter_table_cmd:
 					n->missing_ok = FALSE;
 					$$ = (Node *)n;
 				}
+			| DROP ColId opt_drop_behavior
+				{
+					AlterTableCmd *n = makeNode(AlterTableCmd);
+					n->subtype = AT_DropColumn;
+					n->name = $2;
+					n->behavior = $3;
+					n->missing_ok = FALSE;
+					$$ = (Node *)n;
+				}
 			/*
 			 * ALTER TABLE <name> ALTER [COLUMN] <colname> [SET DATA] TYPE <typename>
 			 *		[ USING <expression> ]
@@ -2087,6 +2105,14 @@ alter_table_cmd:
 					n->def = $2;
 					$$ = (Node *)n;
 				}
+			/* ALTER TABLE <name> ADD STATISTICS (...) ON ( <cols> )  */
+			| ADD_P STATISTICS '(' name_list ')' ON '(' columnList ')'
+				{
+					AlterTableCmd *n = makeNode(AlterTableCmd);
+					n->subtype = AT_AddMvStatistics;
+					n->def = (Node*) list_make2((Node*) $4, (Node*) $8);
+					$$ = (Node *)n;
+				}
 			/* ALTER TABLE <name> ALTER CONSTRAINT ... */
 			| ALTER CONSTRAINT name ConstraintAttributeSpec
 				{
@@ -2130,6 +2156,22 @@ alter_table_cmd:
 					n->missing_ok = FALSE;
 					$$ = (Node *)n;
 				}
+			/* ALTER TABLE <name> DROP STATISTICS (...) ON ( <cols> )  */
+			| DROP STATISTICS '(' name_list ')' ON '(' columnList ')'
+				{
+					AlterTableCmd *n = makeNode(AlterTableCmd);
+					n->subtype = AT_DropMvStatistics;
+					n->def = (Node*) list_make2((Node*) $4, (Node*) $8);
+					$$ = (Node *)n;
+				}
+			/* ALTER TABLE <name> DROP STATISTICS (...) */
+			| DROP STATISTICS '(' name_list ')'
+				{
+					AlterTableCmd *n = makeNode(AlterTableCmd);
+					n->subtype = AT_DropMvStatistics;
+					n->def = (Node*) list_make2((Node*) $4, NULL);
+					$$ = (Node *)n;
+				}
 			/* ALTER TABLE <name> SET WITH OIDS  */
 			| SET WITH OIDS
 				{
diff --git a/src/include/catalog/heap.h b/src/include/catalog/heap.h
index e6ac394..30d2acd 100644
--- a/src/include/catalog/heap.h
+++ b/src/include/catalog/heap.h
@@ -119,6 +119,7 @@ extern void RemoveAttrDefault(Oid relid, AttrNumber attnum,
 				  DropBehavior behavior, bool complain, bool internal);
 extern void RemoveAttrDefaultById(Oid attrdefId);
 extern void RemoveStatistics(Oid relid, AttrNumber attnum);
+extern void RemoveMvStatistics(Oid relid, AttrNumber attnum);
 
 extern Form_pg_attribute SystemAttributeDefinition(AttrNumber attno,
 						  bool relhasoids);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index f0dcd2f..479b8bc 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -1513,6 +1513,8 @@ typedef enum AlterTableType
 	AT_ReplicaIdentity,			/* REPLICA IDENTITY */
 	AT_EnableRowSecurity,		/* ENABLE ROW SECURITY */
 	AT_DisableRowSecurity,		/* DISABLE ROW SECURITY */
+	AT_AddMvStatistics,			/* ADD STATISTICS (...) ON (...)  */
+	AT_DropMvStatistics,		/* DROP STATISTICS (...) ON (...)  */
 	AT_GenericOptions			/* OPTIONS (...) */
 } AlterTableType;
 
-- 
1.8.3.1

0005-Add-btree-operator-class-for-int2vector.patchtext/x-patch; charset=us-asciiDownload
>From 6e71473e8ea615bd5f047d48c7a99e3f7e47940a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 6 Aug 2015 21:11:55 +0900
Subject: [PATCH 5/6] Add btree operator class for int2vector.

Add new opclass for use for pg_mvcoefficient.
---
 src/backend/access/nbtree/nbtcompare.c | 24 ++++++++++++++++++++++
 src/backend/utils/adt/int.c            | 37 ++++++++++++++++++++++++++++++++++
 src/backend/utils/cache/syscache.c     | 10 ++++-----
 src/include/catalog/pg_amop.h          | 10 +++++++++
 src/include/catalog/pg_amproc.h        |  1 +
 src/include/catalog/pg_opclass.h       |  1 +
 src/include/catalog/pg_operator.h      | 12 ++++++++++-
 src/include/catalog/pg_opfamily.h      |  1 +
 src/include/catalog/pg_proc.h          |  8 ++++++++
 src/include/utils/builtins.h           |  6 ++++++
 src/include/utils/syscache.h           |  2 +-
 11 files changed, 105 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/nbtree/nbtcompare.c b/src/backend/access/nbtree/nbtcompare.c
index 8ce8009..baa28c0 100644
--- a/src/backend/access/nbtree/nbtcompare.c
+++ b/src/backend/access/nbtree/nbtcompare.c
@@ -308,6 +308,30 @@ btoidvectorcmp(PG_FUNCTION_ARGS)
 }
 
 Datum
+btint2vectorcmp(PG_FUNCTION_ARGS)
+{
+	int2vector  *a = (int2vector *) PG_GETARG_POINTER(0);
+	int2vector  *b = (int2vector *) PG_GETARG_POINTER(1);
+	int			i;
+
+	/* We arbitrarily choose to sort first by vector length */
+	if (a->dim1 != b->dim1)
+		PG_RETURN_INT32(a->dim1 - b->dim1);
+
+	for (i = 0; i < a->dim1; i++)
+	{
+		if (a->values[i] != b->values[i])
+		{
+			if (a->values[i] > b->values[i])
+				PG_RETURN_INT32(1);
+			else
+				PG_RETURN_INT32(-1);
+		}
+	}
+	PG_RETURN_INT32(0);
+}
+
+Datum
 btcharcmp(PG_FUNCTION_ARGS)
 {
 	char		a = PG_GETARG_CHAR(0);
diff --git a/src/backend/utils/adt/int.c b/src/backend/utils/adt/int.c
index 1a91b29..063d85a 100644
--- a/src/backend/utils/adt/int.c
+++ b/src/backend/utils/adt/int.c
@@ -269,6 +269,43 @@ int2vectoreq(PG_FUNCTION_ARGS)
 	PG_RETURN_BOOL(memcmp(a->values, b->values, a->dim1 * sizeof(int16)) == 0);
 }
 
+Datum
+int2vectorne(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_BOOL(!int2vectoreq(fcinfo));
+}
+
+Datum
+int2vectorlt(PG_FUNCTION_ARGS)
+{
+	int32	cmp = DatumGetInt32(btint2vectorcmp(fcinfo));
+
+	PG_RETURN_BOOL(cmp < 0);
+}
+
+Datum
+int2vectorle(PG_FUNCTION_ARGS)
+{
+	int32	cmp = DatumGetInt32(btint2vectorcmp(fcinfo));
+
+	PG_RETURN_BOOL(cmp <= 0);
+}
+
+Datum
+int2vectorge(PG_FUNCTION_ARGS)
+{
+	int32	cmp = DatumGetInt32(btint2vectorcmp(fcinfo));
+
+	PG_RETURN_BOOL(cmp >= 0);
+}
+
+Datum
+int2vectorgt(PG_FUNCTION_ARGS)
+{
+	int32	cmp = DatumGetInt32(btint2vectorcmp(fcinfo));
+
+	PG_RETURN_BOOL(cmp > 0);
+}
 
 /*****************************************************************************
  *	 PUBLIC ROUTINES														 *
diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c
index a895405..e41268e 100644
--- a/src/backend/utils/cache/syscache.c
+++ b/src/backend/utils/cache/syscache.c
@@ -502,16 +502,16 @@ static const struct cachedesc cacheinfo[] = {
 		},
 		4
 	},
-	{MvCoefficientRelationId,		/* MVCOEFFICIENT */
-		MvCoefficientIndexId,
-		1,
+	{MvCoefficientRelationId,		/* MVCOEFRELIDATTRS */
+	 	MvCoefficientIndexId,
+		2,
 		{
 			Anum_pg_mvcoefficient_mvcreloid,
-			0,
+			Anum_pg_mvcoefficient_mvcattrs,
 			0,
 			0
 		},
-		4
+		8
 	},
 	{NamespaceRelationId,		/* NAMESPACENAME */
 		NamespaceNameIndexId,
diff --git a/src/include/catalog/pg_amop.h b/src/include/catalog/pg_amop.h
index da5fe9d..23eb849 100644
--- a/src/include/catalog/pg_amop.h
+++ b/src/include/catalog/pg_amop.h
@@ -185,6 +185,16 @@ DATA(insert (	1991   30 30 4 s	648 403 0 ));
 DATA(insert (	1991   30 30 5 s	646 403 0 ));
 
 /*
+ *	btree int2vector_ops
+ */
+
+DATA(insert (	3321   22 22 1 s	3317 403 0 ));
+DATA(insert (	3321   22 22 2 s	3319 403 0 ));
+DATA(insert (	3321   22 22 3 s	386  403 0 ));
+DATA(insert (	3321   22 22 4 s	3320 403 0 ));
+DATA(insert (	3321   22 22 5 s	3318 403 0 ));
+
+/*
  *	btree float_ops
  */
 
diff --git a/src/include/catalog/pg_amproc.h b/src/include/catalog/pg_amproc.h
index b57d6e6..72e3557 100644
--- a/src/include/catalog/pg_amproc.h
+++ b/src/include/catalog/pg_amproc.h
@@ -142,6 +142,7 @@ DATA(insert (	3626   3614 3614 1 3622 ));
 DATA(insert (	3683   3615 3615 1 3668 ));
 DATA(insert (	3901   3831 3831 1 3870 ));
 DATA(insert (	4033   3802 3802 1 4044 ));
+DATA(insert (	3321   22 22 1 3327 ));
 
 
 /* hash */
diff --git a/src/include/catalog/pg_opclass.h b/src/include/catalog/pg_opclass.h
index e7b3148..37075ff 100644
--- a/src/include/catalog/pg_opclass.h
+++ b/src/include/catalog/pg_opclass.h
@@ -167,6 +167,7 @@ DATA(insert (	403		bpchar_pattern_ops	PGNSP PGUID 2097 1042 f 0 ));
 DATA(insert (	403		money_ops			PGNSP PGUID 2099  790 t 0 ));
 DATA(insert (	405		bool_ops			PGNSP PGUID 2222   16 t 0 ));
 DATA(insert (	405		bytea_ops			PGNSP PGUID 2223   17 t 0 ));
+DATA(insert (	403		int2vector_ops		PGNSP PGUID 3321   22 t 0 ));
 DATA(insert (	405		int2vector_ops		PGNSP PGUID 2224   22 t 0 ));
 DATA(insert (	403		tid_ops				PGNSP PGUID 2789   27 t 0 ));
 DATA(insert (	405		xid_ops				PGNSP PGUID 2225   28 t 0 ));
diff --git a/src/include/catalog/pg_operator.h b/src/include/catalog/pg_operator.h
index 26c9d4e..7b11942 100644
--- a/src/include/catalog/pg_operator.h
+++ b/src/include/catalog/pg_operator.h
@@ -154,8 +154,18 @@ DATA(insert OID = 389 (  "!!"	   PGNSP PGUID l f f	 0	20	1700  0  0 numeric_fac
 DESCR("deprecated, use ! instead");
 DATA(insert OID = 385 (  "="	   PGNSP PGUID b f t	29	29	16 385	 0 cideq eqsel eqjoinsel ));
 DESCR("equal");
-DATA(insert OID = 386 (  "="	   PGNSP PGUID b f t	22	22	16 386	 0 int2vectoreq eqsel eqjoinsel ));
+DATA(insert OID = 386 (  "="	   PGNSP PGUID b t t	22	22	16 386 3316 int2vectoreq eqsel eqjoinsel ));
 DESCR("equal");
+DATA(insert OID = 3316 (  "<>"	   PGNSP PGUID b f f	22	22	16 3316	386 int2vectorne neqsel neqjoinsel ));
+DESCR("not equal");
+DATA(insert OID = 3317 (  "<"	   PGNSP PGUID b f f	22	22	16 3318	3320 int2vectorlt scalarltsel scalarltjoinsel ));
+DESCR("less than");
+DATA(insert OID = 3318 (  ">"	   PGNSP PGUID b f f	22	22	16 3317	3319 int2vectorgt scalargtsel scalargtjoinsel ));
+DESCR("greater than");
+DATA(insert OID = 3319 (  "<="	   PGNSP PGUID b f f	22	22	16 3320	3318 int2vectorle scalarltsel scalarltjoinsel ));
+DESCR("less than or equal");
+DATA(insert OID = 3320 (  ">="	   PGNSP PGUID b f f	22	22	16 3319	3317 int2vectorge scalargtsel scalargtjoinsel ));
+DESCR("greater than or equal");
 
 DATA(insert OID = 387 (  "="	   PGNSP PGUID b t f	27	27	16 387 402 tideq eqsel eqjoinsel ));
 DESCR("equal");
diff --git a/src/include/catalog/pg_opfamily.h b/src/include/catalog/pg_opfamily.h
index acbc100..eb9e62b 100644
--- a/src/include/catalog/pg_opfamily.h
+++ b/src/include/catalog/pg_opfamily.h
@@ -87,6 +87,7 @@ DATA(insert OID = 1983 (	405		interval_ops	PGNSP PGUID ));
 DATA(insert OID = 1984 (	403		macaddr_ops		PGNSP PGUID ));
 DATA(insert OID = 1985 (	405		macaddr_ops		PGNSP PGUID ));
 DATA(insert OID = 1986 (	403		name_ops		PGNSP PGUID ));
+DATA(insert OID = 3321 (	403		int2vector_ops	PGNSP PGUID ));
 #define NAME_BTREE_FAM_OID 1986
 DATA(insert OID = 1987 (	405		name_ops		PGNSP PGUID ));
 DATA(insert OID = 1988 (	403		numeric_ops		PGNSP PGUID ));
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index ddf7c67..f3da6d6 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -536,6 +536,12 @@ DESCR("convert int2 to int4");
 DATA(insert OID = 314 (  int2			   PGNSP PGUID 12 1 0 0 0 f f f f t f i 1 0 21	"23" _null_ _null_ _null_ _null_ _null_ i4toi2 _null_ _null_ _null_ ));
 DESCR("convert int4 to int2");
 DATA(insert OID = 315 (  int2vectoreq	   PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "22 22" _null_ _null_ _null_ _null_ _null_ int2vectoreq _null_ _null_ _null_ ));
+DATA(insert OID = 3322 (  int2vectorne	   PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "22 22" _null_ _null_ _null_ _null_ _null_ int2vectorne _null_ _null_ _null_ ));
+DATA(insert OID = 3323 (  int2vectorlt	   PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "22 22" _null_ _null_ _null_ _null_ _null_ int2vectorlt _null_ _null_ _null_ ));
+DATA(insert OID = 3324 (  int2vectorle	   PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "22 22" _null_ _null_ _null_ _null_ _null_ int2vectorle _null_ _null_ _null_ ));
+DATA(insert OID = 3325 (  int2vectorge	   PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "22 22" _null_ _null_ _null_ _null_ _null_ int2vectorge _null_ _null_ _null_ ));
+DATA(insert OID = 3326 (  int2vectorgt	   PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "22 22" _null_ _null_ _null_ _null_ _null_ int2vectorgt _null_ _null_ _null_ ));
+
 DATA(insert OID = 316 (  float8			   PGNSP PGUID 12 1 0 0 0 f f f f t f i 1 0 701  "23" _null_ _null_ _null_ _null_ _null_	i4tod _null_ _null_ _null_ ));
 DESCR("convert int4 to float8");
 DATA(insert OID = 317 (  int4			   PGNSP PGUID 12 1 0 0 0 f f f f t f i 1 0 23 "701" _null_ _null_ _null_ _null_ _null_ dtoi4 _null_ _null_ _null_ ));
@@ -644,6 +650,8 @@ DATA(insert OID = 3134 ( btoidsortsupport PGNSP PGUID 12 1 0 0 0 f f f f t f i 1
 DESCR("sort support");
 DATA(insert OID = 404 (  btoidvectorcmp    PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 23 "30 30" _null_ _null_ _null_ _null_ _null_ btoidvectorcmp _null_ _null_ _null_ ));
 DESCR("less-equal-greater");
+DATA(insert OID = 3327 (  btint2vectorcmp  PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 23 "22 22" _null_ _null_ _null_ _null_ _null_ btint2vectorcmp _null_ _null_ _null_ ));
+DESCR("less-equal-greater");
 DATA(insert OID = 357 (  btabstimecmp	   PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 23 "702 702" _null_ _null_ _null_ _null_ _null_ btabstimecmp _null_ _null_ _null_ ));
 DESCR("less-equal-greater");
 DATA(insert OID = 358 (  btcharcmp		   PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 23 "18 18" _null_ _null_ _null_ _null_ _null_ btcharcmp _null_ _null_ _null_ ));
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index fc1679e..89f231a 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -187,6 +187,11 @@ extern Datum int2vectorout(PG_FUNCTION_ARGS);
 extern Datum int2vectorrecv(PG_FUNCTION_ARGS);
 extern Datum int2vectorsend(PG_FUNCTION_ARGS);
 extern Datum int2vectoreq(PG_FUNCTION_ARGS);
+extern Datum int2vectorne(PG_FUNCTION_ARGS);
+extern Datum int2vectorlt(PG_FUNCTION_ARGS);
+extern Datum int2vectorle(PG_FUNCTION_ARGS);
+extern Datum int2vectorge(PG_FUNCTION_ARGS);
+extern Datum int2vectorgt(PG_FUNCTION_ARGS);
 extern Datum int4in(PG_FUNCTION_ARGS);
 extern Datum int4out(PG_FUNCTION_ARGS);
 extern Datum int4recv(PG_FUNCTION_ARGS);
@@ -310,6 +315,7 @@ extern Datum btfloat48cmp(PG_FUNCTION_ARGS);
 extern Datum btfloat84cmp(PG_FUNCTION_ARGS);
 extern Datum btoidcmp(PG_FUNCTION_ARGS);
 extern Datum btoidvectorcmp(PG_FUNCTION_ARGS);
+extern Datum btint2vectorcmp(PG_FUNCTION_ARGS);
 extern Datum btabstimecmp(PG_FUNCTION_ARGS);
 extern Datum btreltimecmp(PG_FUNCTION_ARGS);
 extern Datum bttintervalcmp(PG_FUNCTION_ARGS);
diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h
index 1f48d61..f5ce307 100644
--- a/src/include/utils/syscache.h
+++ b/src/include/utils/syscache.h
@@ -66,7 +66,7 @@ enum SysCacheIdentifier
 	INDEXRELID,
 	LANGNAME,
 	LANGOID,
-	MVCOEFFICIENT,
+	MVCOEFRELIDATTRS,
 	NAMESPACENAME,
 	NAMESPACEOID,
 	OPERNAMENSP,
-- 
1.8.3.1

0006-Use-modified-index-of-pg_mvcoefficient.patchtext/x-patch; charset=us-asciiDownload
>From 86d867084e6c44f9d7c732f2e315b4dcd34fd711 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 6 Aug 2015 21:15:08 +0900
Subject: [PATCH 6/6] Use modified index of pg_mvcoefficient.

Add column vector to the index of pg_mvcoefficient and make use of it
on DDL operation.
---
 src/backend/commands/tablecmds.c | 74 ++++++++++++++--------------------------
 src/include/catalog/indexing.h   |  2 +-
 2 files changed, 27 insertions(+), 49 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 843a14e..dd37aa8 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -9327,9 +9327,11 @@ ATExecAddDropMvStatistics(Relation rel, List * params, bool drop)
 	Oid			relid = RelationGetRelid(rel);
 	int			ncols = 0;
 	int16		colids[MVCOEF_MAX_COLS];
+	int2vector *colvec = NULL;
 	Relation 	mvcrel;
 	HeapTuple	oldtup, newtup;
-	ScanKeyData	scankey;
+	int			nscankeys;
+	ScanKeyData	scankey[2];
 	SysScanDesc	sysscan;
 	Datum	values[Natts_pg_mvcoefficient];
 	bool	nulls[Natts_pg_mvcoefficient];
@@ -9395,62 +9397,38 @@ ATExecAddDropMvStatistics(Relation rel, List * params, bool drop)
 
 	mvcrel = heap_open(MvCoefficientRelationId, RowExclusiveLock);
 
-	ScanKeyInit(&scankey,
+	nscankeys = 1;
+	ScanKeyInit(&scankey[0],
 				Anum_pg_mvcoefficient_mvcreloid,
 				BTEqualStrategyNumber, F_OIDEQ,
 				ObjectIdGetDatum(relid));
-	sysscan = systable_beginscan(mvcrel, MvCoefficientIndexId, true,
-								 NULL, 1, &scankey);
-
-	/* colum list is required to be sorted in comparison */
-	qsort(colids, ncols, sizeof(int16), cmpint16);
+	if (ncols > 0)
+	{
+		qsort(colids, ncols, sizeof(int16), cmpint16);
+		colvec = buildint2vector(colids, ncols);
 
+		ScanKeyInit(&scankey[1],
+					Anum_pg_mvcoefficient_mvcattrs,
+					BTEqualStrategyNumber, F_INT2VECTOREQ,
+					PointerGetDatum(colvec));
+		nscankeys = 2;
+	}
+	sysscan = systable_beginscan(mvcrel, MvCoefficientIndexId, true,
+								 NULL, nscankeys, scankey);
 	if (drop)
 	{
-		bool dropped = false;
-
+		/* Two or more tuples may be returned for the drop case */
 		while (HeapTupleIsValid(oldtup = systable_getnext(sysscan)))
 		{
-			Form_pg_mvcoefficient mvc =
-				(Form_pg_mvcoefficient) GETSTRUCT (oldtup);
-			int i;
-
-			/* Fast exit, number of columns don't match  */
-			if (mvc->mvcnattrs != ncols)
-				continue;
-
-			/*
-			 * Check for individual columns. This assumes that both of the
-			 * column vector in mvc and colids are sorted.
-			 */
-			for (i = 0 ; i < ncols ; i++)
-			{
-				if (mvc->mvcattrs.values[i] != colids[i])
-					break;
-			}
-
-			/*
-			 * Remove it if match. Unconditionally do this for the case of
-			 * ncols == 0.
-			 */
-			if (i == ncols)
-			{
-				simple_heap_delete(mvcrel, &oldtup->t_self);
-				dropped = true;
-			}
+			simple_heap_delete(mvcrel, &oldtup->t_self);
 		}
-
-		if (!dropped)
-			ereport(ERROR,
-					(errcode(ERRCODE_NO_DATA_FOUND),
-					 errmsg("no matching statistics found")));
 	}
 	else
 	{
-		int2vector *colvec;
-
+		/* At most one tuple is fetched for the add case */
 		oldtup = systable_getnext(sysscan);
 
+		/* Create data for the new tuple */
 		for (i = 0; i < Natts_pg_mvcoefficient ; ++i)
 		{
 			nulls[i] = false;
@@ -9458,8 +9436,7 @@ ATExecAddDropMvStatistics(Relation rel, List * params, bool drop)
 			replaces[i] = false;
 		}
 
-		colvec = buildint2vector(colids, ncols);
-
+		Assert(colvec != NULL);
 		values[Anum_pg_mvcoefficient_mvcreloid - 1] = ObjectIdGetDatum(relid);
 		values[Anum_pg_mvcoefficient_mvcnattrs - 1] = Int16GetDatum(ncols);
 		values[Anum_pg_mvcoefficient_mvccoefficient - 1]  = Float8GetDatum(1.0);
@@ -9467,6 +9444,7 @@ ATExecAddDropMvStatistics(Relation rel, List * params, bool drop)
 
 		if (HeapTupleIsValid(oldtup))
 		{
+			/* Update existing tuple */
 			replaces[Anum_pg_mvcoefficient_mvcnattrs - 1] = true;
 			replaces[Anum_pg_mvcoefficient_mvccoefficient - 1] = true;
 			replaces[Anum_pg_mvcoefficient_mvcattrs - 1] = true;
@@ -9480,14 +9458,14 @@ ATExecAddDropMvStatistics(Relation rel, List * params, bool drop)
 		}
 		else
 		{
-			newtup = heap_form_tuple(RelationGetDescr(mvcrel), values, nulls);
+			/* Insert new tuple  */
+			newtup = heap_form_tuple(RelationGetDescr(mvcrel),
+									 values, nulls);
 			simple_heap_insert(mvcrel, newtup);
 		}
 		CatalogUpdateIndexes(mvcrel, newtup);
 	}
-
 	systable_endscan(sysscan);
-
 	heap_close(mvcrel, RowExclusiveLock);
 }
 
diff --git a/src/include/catalog/indexing.h b/src/include/catalog/indexing.h
index fb28747..16ad43d 100644
--- a/src/include/catalog/indexing.h
+++ b/src/include/catalog/indexing.h
@@ -173,7 +173,7 @@ DECLARE_UNIQUE_INDEX(pg_largeobject_loid_pn_index, 2683, on pg_largeobject using
 DECLARE_UNIQUE_INDEX(pg_largeobject_metadata_oid_index, 2996, on pg_largeobject_metadata using btree(oid oid_ops));
 #define LargeObjectMetadataOidIndexId	2996
 
-DECLARE_UNIQUE_INDEX(pg_mvcoefficient_index, 3578, on pg_mvcoefficient using btree(mvcreloid oid_ops));
+DECLARE_UNIQUE_INDEX(pg_mvcoefficient_index, 3578, on pg_mvcoefficient using btree(mvcreloid oid_ops, mvcattrs int2vector_ops));
 #define MvCoefficientIndexId  3578
 
 DECLARE_UNIQUE_INDEX(pg_namespace_nspname_index, 2684, on pg_namespace using btree(nspname name_ops));
-- 
1.8.3.1

#2Bruce Momjian
bruce@momjian.us
In reply to: Kyotaro HORIGUCHI (#1)
Re: Multi-column distinctness.

On Fri, Aug 28, 2015 at 05:33:34PM +0900, Kyotaro HORIGUCHI wrote:

Hello, this patch enables planner to be couscious of inter-column
correlation.

Sometimes two or more columns in a table has some correlation
which brings underestimate, which leads to wrong join method and
ends with slow execution.

Tomas Vondra is now working on heavily-equipped multivariate
statistics for OLAP usage. In contrast, this is a lightly
implemented solution which calculates only the ratio between a
rows estimated by current method and a actual row number. I think
this doesn't conflict with his work except the grammar part.

I am very glad multi-variate statistics are being addressed. I think
this is our most common cause of optimizer failures.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Simon Riggs
simon@2ndQuadrant.com
In reply to: Kyotaro HORIGUCHI (#1)
Re: Multi-column distinctness.

On 28 August 2015 at 09:33, Kyotaro HORIGUCHI <
horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Tomas Vondra is now working on heavily-equipped multivariate
statistics for OLAP usage. In contrast, this is a lightly
implemented solution which calculates only the ratio between a
rows estimated by current method and a actual row number. I think
this doesn't conflict with his work except the grammar part.

I think it very obviously does conflict, so I don't see this patch as
appropriate.

If you think a cut version of Tomas' patch is appropriate, then the usual
response is to give a review that says "Tomas, I think a cut down version
is appropriate here, can we reduce the scope of this patch for now?". If
you have done that and he refuses to listen, then a separate patch version
is appropriate. Otherwise we should just reject this second patchset to
avoid confusion and to avoid encouraging people to take this approach.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#4Simon Riggs
simon@2ndQuadrant.com
In reply to: Bruce Momjian (#2)
Re: Multi-column distinctness.

On 5 September 2015 at 20:46, Bruce Momjian <bruce@momjian.us> wrote:

On Fri, Aug 28, 2015 at 05:33:34PM +0900, Kyotaro HORIGUCHI wrote:

Hello, this patch enables planner to be couscious of inter-column
correlation.

Sometimes two or more columns in a table has some correlation
which brings underestimate, which leads to wrong join method and
ends with slow execution.

Tomas Vondra is now working on heavily-equipped multivariate
statistics for OLAP usage. In contrast, this is a lightly
implemented solution which calculates only the ratio between a
rows estimated by current method and a actual row number. I think
this doesn't conflict with his work except the grammar part.

I am very glad multi-variate statistics are being addressed. I think
this is our most common cause of optimizer failures.

Glad to see you agree.

Why have you said this on *this* patch, when Tomas' patch has been very
obviously addressing this problem since January, with very little traction.

Please add your backing to Tomas' patch.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#5Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Kyotaro HORIGUCHI (#1)
Re: Multi-column distinctness.

Hello Horiguchi-san,

On 08/28/2015 10:33 AM, Kyotaro HORIGUCHI wrote:

Hello, this patch enables planner to be couscious of inter-column
correlation.

Sometimes two or more columns in a table has some correlation
which brings underestimate, which leads to wrong join method and
ends with slow execution.

Tomas Vondra is now working on heavily-equipped multivariate
statistics for OLAP usage. In contrast, this is a lightly
implemented solution which calculates only the ratio between a
rows estimated by current method and a actual row number. I think
this doesn't conflict with his work except the grammar part.

I'd like to sign up as a reviewer of this patch, if you're OK with that.
I don't see the patch in any of the commitfests, though :-(

Now, a few comments based on reading the patches (will try testing them
in the next few days, hopefully).

1) catalog design
-----------------
I see you've invented a catalog with a slightly different structure than
I use in the multivariate stats patch, storing the coefficient for each
combination of columns in a separate row. The idea in the multivariate
patch is to generate all possible combinations of columns and store the
coefficients packed in a single row, which requires implementing a more
complicated data structure. But your patch does not do that (only
computes coefficient for the one combination).

This is another potential incompatibility with the multivariate patch,
although it's rather in the background and should not be difficult to
rework if needed.

2) find_mv_coeffeicient
---------------------
I don't quite see why this is the right thing

/* Prefer smaller one */
if (mvc->mvccoefficient > 0 && mvc->mvccoefficient < mv_coef)
mv_coef = mvc->mvccoefficient;

i.e. why it's correct to choose the record with the lower coefficient
and not the one with most columns (the largest subset). Maybe there's
some rule that those are in fact the same thing (seems like adding a
column can only lower the coefficient), but it's not quite obvious and
would deserve a comment.

3) single statistics limitation
-------------------------------
What annoys me a bit is that this patch only applies a single statistics
- it won't try "combining" multiple smaller ones (even if they don't
overlap, which shouldn't be that difficult).

4) no sub-combinations
----------------------
Another slightly annoying thing is that the patch computes statistics
only for the specified combination of columns, and no subsets. So if you
have enabled stats on [a,b,c] you won't be able to use that for
conditions on [a,b], for example. Sure, the user can create the stats
manually, but IMO the less burden we place on the user the better. It's
enough that we force users to create the stats at all, we shouldn't
really force them co create all possible combinations.

5) syntax
---------
The syntax might be one of the pain points if we eventually decide to
commit the multivariate stats patch. I have no intention in blocking
this patch for that reasons, but if we might design the syntax to make
it compatible with the multivariate patch, that'd be nice. But it seems
to me the syntax is pretty much the same, no?

I.e. it uses

ADD STATISTICS (options) ON (columns)

just like the multivariate patch, no? Well, it doesn't really check the
stattypes in ATExecAddDropMvStatistics except for checking there's a
single entry, but the syntax seems OK.

BTW mixing ADD and DROP in ATExecAddDropMvStatistics seems a bit
confusing. Maybe two separate methods would be better?

6) GROUP BY
-----------
One thing I'd really love to see is improvement of GROUP BY clauses.
That's one of the main sources of pain in analytical queries, IMHO,
resulting in either OOM issues in Hash Aggregate, or choice of sort
paths where hash would be more efficient.

regards
Tomas

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Simon Riggs (#3)
Re: Multi-column distinctness.

Hello,

On 09/06/2015 10:24 AM, Simon Riggs wrote:

On 28 August 2015 at 09:33, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp
<mailto:horiguchi.kyotaro@lab.ntt.co.jp>> wrote:

Tomas Vondra is now working on heavily-equipped multivariate
statistics for OLAP usage. In contrast, this is a lightly
implemented solution which calculates only the ratio between a
rows estimated by current method and a actual row number. I think
this doesn't conflict with his work except the grammar part.

I think it very obviously does conflict, so I don't see this patch as
appropriate.

If you think a cut version of Tomas' patch is appropriate, then the
usual response is to give a review that says "Tomas, I think a cut down
version is appropriate here, can we reduce the scope of this patch for
now?". If you have done that and he refuses to listen, then a separate
patch version is appropriate. Otherwise we should just reject this
second patchset to avoid confusion and to avoid encouraging people to
take this approach.

FWIW Horiguchi-san is one of the few people who actually took time to
review the multivariate stats patch, and I don't quite see this patch as
conflicting with the multivariate one.

It implements a small subset of the (much larger) multivariate stats
patch, and reusing it within my patch should not be a big deal.
Actually, this type of statistics was proposed by Horiguchi-san himself,
and the multivariate patch does not implement it yet (although I intend
to address that soon).

So no conflict here. Of course, we need to be a bit careful to make it
compatible (especially the syntax part).

kind regards

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#5)
Re: Multi-column distinctness.

On Sep 6, 2015 10:31, "Tomas Vondra" <tomas.vondra@2ndquadrant.com> wrote:

5) syntax
---------
The syntax might be one of the pain points if we eventually decide to

commit the multivariate stats patch. I have no intention in blocking this
patch for that reasons, but if we might design the syntax to make it
compatible with the multivariate patch, that'd be nice. But it seems to me
the syntax is pretty much the same, no?

I.e. it uses

ADD STATISTICS (options) ON (columns)

just like the multivariate patch, no? Well, it doesn't really check the

stattypes in ATExecAddDropMvStatistics except for checking there's a single
entry, but the syntax seems OK.

BTW mixing ADD and DROP in ATExecAddDropMvStatistics seems a bit

confusing. Maybe two separate methods would be better?

BTW one more comment about the syntax - you ran into the same conflict
between "ADD [COLUMN] column" and "ADD STATISTICS" like I did, but you
solved it by making the COLUMN required while I made STATISTICS keyword.

I'm not enthusiastic about the keyword thing, but making the COLUMN
required is certainly much worse as it breaks many existing scripts. The
keyword inky breaks cases that manipulate "statistics" column.

If any of this is unacceptable, then we probably need to come up with a
different syntax.

Regards

--
Tomas Vondra http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#8Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Simon Riggs (#3)
Re: Multi-column distinctness.

Hello,

Thank you for pointing that. It is one crucial point of this
patch. Sorry for not mentioning on the point.

At Sun, 6 Sep 2015 09:24:48 +0100, Simon Riggs <simon@2ndQuadrant.com> wrote in <CANP8+j+F+DzrzCW1cjK1Up009TAytYN=P_DNsJ4OZEUJEXywjA@mail.gmail.com>

Tomas Vondra is now working on heavily-equipped multivariate
statistics for OLAP usage. In contrast, this is a lightly
implemented solution which calculates only the ratio between a
rows estimated by current method and a actual row number. I think
this doesn't conflict with his work except the grammar part.

I think it very obviously does conflict, so I don't see this patch as
appropriate.

If you think a cut version of Tomas' patch is appropriate, then the usual
response is to give a review that says "Tomas, I think a cut down version
is appropriate here, can we reduce the scope of this patch for now?". If
you have done that and he refuses to listen, then a separate patch version
is appropriate. Otherwise we should just reject this second patchset to
avoid confusion and to avoid encouraging people to take this approach.

You are absolutely right generally and I agree if this is 'a cut
version' of Tomas's patch. I might have wrong concept about size
of a piece of work.

I will discontinue this patch if Tomas and/or Simon, or many
think this as inappropriate to be brought up now (or ever after)
after reading the following explanation.

======
I already asked Tomas to *add* this feature in his patch and got
a reply that it will be after the completion of undergoing work.

Tomas's patch and mine are generally aiming similar objective but
as discussed with Tomas I understood there's some crucial
differences between them.

He is considering more precise and widely-applicable estimation
baesd on a firm theoretical basis, and as described in the
ciation above, it is aiming OLAP usage and allowing rather
complex calculation. It will be reduced through future
discussions but the priority to do so is not so high for now.

Although Tomas's patch is very complex and needs more labor to
complete, resolving the wrong prediction caused by multicolumn
correlation (especially on OLTP usage) is demanded. So I tried a
patch that suit the objective. It has only rough distinctness
coefficient and doesn't have MV-MCV, MV-HISTOGRAM and strict
functional dependency but needs quire small storage and less
calculation.

The two are so different in concrete objective and
characteristics and does not have common functional piece exept
grammer part so I concluded that this patch doesn't break the
foothold (standpoint?) of Toamas's patch and we can continue to
work on it as we did until now.

This is why I think this is not a cut-down version of Tomas's and
dosn't break or conflict with it. But if many don't think so, I
should dismiss this, of course.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Tomas Vondra (#6)
Re: Multi-column distinctness.

Hello,

FWIW Horiguchi-san is one of the few people who actually took time to
review

I personally think such kind of things should not to be counted
in judging this issue:)

the multivariate stats patch, and I don't quite see this patch
as conflicting with the multivariate one.

It implements a small subset of the (much larger) multivariate stats
patch, and reusing it within my patch should not be a big
deal. Actually, this type of statistics was proposed by Horiguchi-san
himself, and the multivariate patch does not implement it yet
(although I intend to address that soon).

Year. I think we can deal Tomas's patch as natural enhancement of
this patch and it is not so bad as a vangard(?).

So no conflict here. Of course, we need to be a bit careful to make it
compatible (especially the syntax part).

I agree with it. They don't make bad conflict but we might stuc
by wrong decision in API, like syntax will.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Tomas Vondra (#7)
Re: Multi-column distinctness.

Hello,

5) syntax
---------
The syntax might be one of the pain points if we eventually decide to

commit the multivariate stats patch. I have no intention in blocking this
patch for that reasons, but if we might design the syntax to make it
compatible with the multivariate patch, that'd be nice. But it seems to me
the syntax is pretty much the same, no?

I.e. it uses

ADD STATISTICS (options) ON (columns)

just like the multivariate patch, no? Well, it doesn't really check the

stattypes in ATExecAddDropMvStatistics except for checking there's a single
entry, but the syntax seems OK.

BTW mixing ADD and DROP in ATExecAddDropMvStatistics seems a bit

confusing. Maybe two separate methods would be better?

No problem.

BTW one more comment about the syntax - you ran into the same conflict
between "ADD [COLUMN] column" and "ADD STATISTICS" like I did, but you
solved it by making the COLUMN required while I made STATISTICS keyword.
I'm not enthusiastic about the keyword thing,

I don't have firm idea that it has the priority than duplication
of syntax, but I followed the policy to reduce keywords. Which do
you think is more acceptable between the additional definition
caused by new keywords and syntax splitting caused by avoiding
them for this case?

but making the COLUMN
required is certainly much worse as it breaks many existing scripts. The
keyword inky breaks cases that manipulate "statistics" column.

Ouch! It is simply by accident, or my lack of carefulness. I will
come up with fixed syntax later..

If any of this is unacceptable, then we probably need to come up with a
different syntax.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Kyotaro HORIGUCHI (#10)
Re: Multi-column distinctness.

On 09/07/2015 05:25 AM, Kyotaro HORIGUCHI wrote:

Hello,

but making the COLUMN
required is certainly much worse as it breaks many existing scripts. The
keyword inky breaks cases that manipulate "statistics" column.

Ouch! It is simply by accident, or my lack of carefulness. I will
come up with fixed syntax later..

If any of this is unacceptable, then we probably need to come up with a
different syntax.

I've been thinking about the syntax, and I think both options (making
COLUMN required or making STATISTICS a reserved keyword) will break
something no matter what we do, forcing the users to either always use
ADD COLUMN or quote all the existing uses of "statistics" (as column
names, for example).

Maybe the best solution is to abandon the ALTER TABLE approach entirely,
and instead invent a new set of commands

CREATE STATISTICS
DROP STATISTICS

(ALTER STATISTICS seems a bit excessive at this point).

Another thing is that perhaps we should add names for statistics, just
like we do for constraints, for example. Otherwise the DROP STATISTICS
handling is rather awkward - for example if the user creates stats twice
by mistake, he's unable to drop just one of them.

regards

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#11)
Re: Multi-column distinctness.

Hello Kyotaro-san,

On 09/11/2015 06:58 PM, Tomas Vondra wrote:

Maybe the best solution is to abandon the ALTER TABLE approach
entirely, and instead invent a new set of commands

CREATE STATISTICS
DROP STATISTICS

(ALTER STATISTICS seems a bit excessive at this point).

Another thing is that perhaps we should add names for statistics,
just like we do for constraints, for example. Otherwise the DROP
STATISTICS handling is rather awkward - for example if the user
creates stats twice by mistake, he's unable to drop just one of them.

Do you think this modified syntax makes sense? I'll have time to hack on
this over the next few days.

regards

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Simon Riggs
simon@2ndQuadrant.com
In reply to: Tomas Vondra (#12)
Re: Multi-column distinctness.

On 19 October 2015 at 20:16, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

Hello Kyotaro-san,

On 09/11/2015 06:58 PM, Tomas Vondra wrote:

Maybe the best solution is to abandon the ALTER TABLE approach
entirely, and instead invent a new set of commands

CREATE STATISTICS
DROP STATISTICS

(ALTER STATISTICS seems a bit excessive at this point).

Another thing is that perhaps we should add names for statistics,
just like we do for constraints, for example. Otherwise the DROP
STATISTICS handling is rather awkward - for example if the user
creates stats twice by mistake, he's unable to drop just one of them.

Do you think this modified syntax makes sense? I'll have time to hack on
this over the next few days.

Koyotaro's changes to force COLUMN to be required won't get through.

ISTM that we could use COLLECT STATISTICS instead of ADD STATISTICS, and
use REMOVE STATISTICS instead of DROP STATISTICS. That way we can use ALTER
TABLE rather than inventing a new command. 5 minute change...

Unless there is some better reason for a whole new command?

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#14Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Simon Riggs (#13)
Re: Multi-column distinctness.

Hi,

On 10/20/2015 01:14 PM, Simon Riggs wrote:

On 19 October 2015 at 20:16, Tomas Vondra <tomas.vondra@2ndquadrant.com
<mailto:tomas.vondra@2ndquadrant.com>> wrote:

Hello Kyotaro-san,

On 09/11/2015 06:58 PM, Tomas Vondra wrote:

Maybe the best solution is to abandon the ALTER TABLE approach
entirely, and instead invent a new set of commands

CREATE STATISTICS
DROP STATISTICS

(ALTER STATISTICS seems a bit excessive at this point).

Another thing is that perhaps we should add names for statistics,
just like we do for constraints, for example. Otherwise the DROP
STATISTICS handling is rather awkward - for example if the user
creates stats twice by mistake, he's unable to drop just one of
them.

Do you think this modified syntax makes sense? I'll have time to
hack on this over the next few days.

Koyotaro's changes to force COLUMN to be required won't get through.

ISTM that we could use COLLECT STATISTICS instead of ADD STATISTICS, and
use REMOVE STATISTICS instead of DROP STATISTICS. That way we can use
ALTER TABLE rather than inventing a new command. 5 minute change...

That seems like a neat idea, actually. I'm not sure COLLECT is a good
choice as it suggest the statistics is actually built, but that only
happens during ANALYZE. But otherwise this seems to solve the issues
with keywords and it's quite simple.

Unless there is some better reason for a whole new command?

Not really. The other proposal (adding names for statistics) does not
require new command. The one thing that would require new command is
building statistics on multiple tables (for join estimation), but I
don't have any idea of how that would actually work.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#14)
Re: Multi-column distinctness.

On Tue, Oct 20, 2015 at 10:51 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Koyotaro's changes to force COLUMN to be required won't get through.

ISTM that we could use COLLECT STATISTICS instead of ADD STATISTICS, and
use REMOVE STATISTICS instead of DROP STATISTICS. That way we can use
ALTER TABLE rather than inventing a new command. 5 minute change...

That seems like a neat idea, actually. I'm not sure COLLECT is a good choice
as it suggest the statistics is actually built, but that only happens during
ANALYZE. But otherwise this seems to solve the issues with keywords and it's
quite simple.

But ADD is no better there. I think ALTER TABLE .. COLLECT STATISTICS
isn't any worse than ALTER TABLE ... CLUSTER ON index_name. In both
cases, it means, when you do this operation, do it this way.

I would suggest that instead of DROP or REMOVE, the opposite should be
ALTER TABLE .. NO COLLECT STATISTICS.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#15)
Re: Multi-column distinctness.

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Oct 20, 2015 at 10:51 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

ISTM that we could use COLLECT STATISTICS instead of ADD STATISTICS, and
use REMOVE STATISTICS instead of DROP STATISTICS. That way we can use
ALTER TABLE rather than inventing a new command. 5 minute change...

That seems like a neat idea, actually. I'm not sure COLLECT is a good choice
as it suggest the statistics is actually built, but that only happens during
ANALYZE. But otherwise this seems to solve the issues with keywords and it's
quite simple.

But ADD is no better there. I think ALTER TABLE .. COLLECT STATISTICS
isn't any worse than ALTER TABLE ... CLUSTER ON index_name. In both
cases, it means, when you do this operation, do it this way.

I would suggest that instead of DROP or REMOVE, the opposite should be
ALTER TABLE .. NO COLLECT STATISTICS.

Why is this an improvement over using already-existing keywords?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#16)
Re: Multi-column distinctness.

On 20 October 2015 at 11:59, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Oct 20, 2015 at 10:51 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

ISTM that we could use COLLECT STATISTICS instead of ADD STATISTICS,

and

use REMOVE STATISTICS instead of DROP STATISTICS. That way we can use
ALTER TABLE rather than inventing a new command. 5 minute change...

That seems like a neat idea, actually. I'm not sure COLLECT is a good

choice

as it suggest the statistics is actually built, but that only happens

during

ANALYZE. But otherwise this seems to solve the issues with keywords and

it's

quite simple.

But ADD is no better there. I think ALTER TABLE .. COLLECT STATISTICS
isn't any worse than ALTER TABLE ... CLUSTER ON index_name. In both
cases, it means, when you do this operation, do it this way.

I would suggest that instead of DROP or REMOVE, the opposite should be
ALTER TABLE .. NO COLLECT STATISTICS.

Why is this an improvement over using already-existing keywords?

The earlier patch changed the grammar for the DROP (column) subcommand,
which I am saying is not acceptable.

So by using an alternate keyword we are able to keep the existing syntax
untouched.

I suggested the word COLLECT since that is another word commonly used in
conjunction with STATISTICS (at least in DB2 and Teradata).

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#18Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#16)
Re: Multi-column distinctness.

On Tue, Oct 20, 2015 at 11:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Oct 20, 2015 at 10:51 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

ISTM that we could use COLLECT STATISTICS instead of ADD STATISTICS, and
use REMOVE STATISTICS instead of DROP STATISTICS. That way we can use
ALTER TABLE rather than inventing a new command. 5 minute change...

That seems like a neat idea, actually. I'm not sure COLLECT is a good choice
as it suggest the statistics is actually built, but that only happens during
ANALYZE. But otherwise this seems to solve the issues with keywords and it's
quite simple.

But ADD is no better there. I think ALTER TABLE .. COLLECT STATISTICS
isn't any worse than ALTER TABLE ... CLUSTER ON index_name. In both
cases, it means, when you do this operation, do it this way.

I would suggest that instead of DROP or REMOVE, the opposite should be
ALTER TABLE .. NO COLLECT STATISTICS.

Why is this an improvement over using already-existing keywords?

Well, if we can use existing keywords, that's better still.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#16)
Re: Multi-column distinctness.

On 10/20/2015 05:59 PM, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Oct 20, 2015 at 10:51 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

ISTM that we could use COLLECT STATISTICS instead of ADD STATISTICS, and
use REMOVE STATISTICS instead of DROP STATISTICS. That way we can use
ALTER TABLE rather than inventing a new command. 5 minute change...

That seems like a neat idea, actually. I'm not sure COLLECT is a good choice
as it suggest the statistics is actually built, but that only happens during
ANALYZE. But otherwise this seems to solve the issues with keywords and it's
quite simple.

But ADD is no better there. I think ALTER TABLE .. COLLECT STATISTICS
isn't any worse than ALTER TABLE ... CLUSTER ON index_name. In both
cases, it means, when you do this operation, do it this way.

I would suggest that instead of DROP or REMOVE, the opposite should be
ALTER TABLE .. NO COLLECT STATISTICS.

Why is this an improvement over using already-existing keywords?

The problem is that the actual syntax is ADD [COLUMN], so we can't
simply use ADD STATISTICS as that would mean a conflict in the grammar.
Resolving it means either making COLUMN non-optional, or adding
STATISTICS to reserved keywords - both options break existing code.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#19)
Re: Multi-column distinctness.

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

On 10/20/2015 05:59 PM, Tom Lane wrote:

Why is this an improvement over using already-existing keywords?

The problem is that the actual syntax is ADD [COLUMN], so we can't
simply use ADD STATISTICS as that would mean a conflict in the grammar.
Resolving it means either making COLUMN non-optional, or adding
STATISTICS to reserved keywords - both options break existing code.

I'm unconvinced that it cannot be made to work. The proposal was
something like
ALTER TABLE t ADD STATISTICS ON column-names ...
no? ON is already fully reserved, which means that this is not ambiguous.
Or if you would rather not use ON, another way of making it not ambiguous
would be to put the column-names list in parentheses.

It's entirely possible that some refactoring of the grammar would
be needed to make it work, of course.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#20)
Re: Multi-column distinctness.

I wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

The problem is that the actual syntax is ADD [COLUMN], so we can't
simply use ADD STATISTICS as that would mean a conflict in the grammar.
Resolving it means either making COLUMN non-optional, or adding
STATISTICS to reserved keywords - both options break existing code.

I'm unconvinced that it cannot be made to work.

Specifically, this works just fine for me as a variant of alter_table_cmd:

| ADD_P STATISTICS ON '(' name_list ')'

and I can get this to work too

| DROP STATISTICS ON '(' name_list ')'

if I expand out the "DROP opt_column" productions to be two separate
productions with and without COLUMN, as was done long ago for ADD COLUMN.

These also work without ON, actually, though it does not work to leave out
the parens. (I think that has to do with the conflict against multiple
ALTER TABLE subcommands, not against ADD COLUMN.)

It doesn't look like we could easily stick a "name" in between STATISTICS
and ON, but I'm unconvinced that that's necessary. Can't we just insist
that there be only one statistics declaration for a given column list?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#21)
Re: Multi-column distinctness.

Hi,

On 10/20/2015 09:11 PM, Tom Lane wrote:

I wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

The problem is that the actual syntax is ADD [COLUMN], so we
can't simply use ADD STATISTICS as that would mean a conflict in
the grammar. Resolving it means either making COLUMN
non-optional, or adding STATISTICS to reserved keywords - both
options break existing code.

I'm unconvinced that it cannot be made to work.

Specifically, this works just fine for me as a variant of
alter_table_cmd:

| ADD_P STATISTICS ON '(' name_list ')'

and I can get this to work too

| DROP STATISTICS ON '(' name_list ')'

if I expand out the "DROP opt_column" productions to be two separate
productions with and without COLUMN, as was done long ago for ADD
COLUMN.

These also work without ON, actually, though it does not work to
leave out the parens. (I think that has to do with the conflict
against multiple ALTER TABLE subcommands, not against ADD COLUMN.)

It doesn't look like we could easily stick a "name" in between
STATISTICS and ON, but I'm unconvinced that that's necessary.

Well, it's definitely easier to reference the statistics by name (e.g.
in the DROP command).

Can't we just insist that there be only one statistics declaration
for a given column list?

I would rather not, because I envision preferring different types of
statistics for different types of queries. For example the statistics
discussed in this particular thread only supports equality clauses, so
this restriction would mean we can't also define histogram, we'll be
unable to estimate queries with ranges.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23Simon Riggs
simon@2ndQuadrant.com
In reply to: Tomas Vondra (#22)
Re: Multi-column distinctness.

On 20 October 2015 at 16:48, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

and I can get this to work too

| DROP STATISTICS ON '(' name_list ')'

if I expand out the "DROP opt_column" productions to be two separate
productions with and without COLUMN, as was done long ago for ADD
COLUMN.

These also work without ON, actually, though it does not work to
leave out the parens. (I think that has to do with the conflict
against multiple ALTER TABLE subcommands, not against ADD COLUMN.)

It doesn't look like we could easily stick a "name" in between
STATISTICS and ON, but I'm unconvinced that that's necessary.

Well, it's definitely easier to reference the statistics by name (e.g. in
the DROP command).

Can't we just insist that there be only one statistics declaration

for a given column list?

I would rather not, because I envision preferring different types of
statistics for different types of queries. For example the statistics
discussed in this particular thread only supports equality clauses, so this
restriction would mean we can't also define histogram, we'll be unable to
estimate queries with ranges.

Can we do something like this...

ADD STATISTICS ON (col list) USING (histogram, MFV)

so we have Types/Methods of statistic, rather than specific names for the
statistic entry?

Since this command doesn't actually ADD the statistics, it just creates a
specification used by the next ANALYZE, it would seem better to use a
different keyword than ADD, perhaps DECLARE STATISTICS ON... and DROP
STATISTICS ON

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#24Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Simon Riggs (#23)
Re: Multi-column distinctness.

On 10/20/2015 11:28 PM, Simon Riggs wrote:

On 20 October 2015 at 16:48, Tomas Vondra <tomas.vondra@2ndquadrant.com
<mailto:tomas.vondra@2ndquadrant.com>> wrote:

On 10/20/2015 09:11 PM, Tom Lane wrote:

Can't we just insist that there be only one statistics declaration
for a given column list? >>

I would rather not, because I envision preferring different types of
statistics for different types of queries. For example the statistics
discussed in this particular thread only supports equality clauses,
so this restriction would mean we can't also define histogram, we'll
be unable to estimate queries with ranges.

Can we do something like this...

ADD STATISTICS ON (col list) USING (histogram, MFV)
so we have Types/Methods of statistic, rather than specific names for
the statistic entry?

That's how it works now (comparing columns and types of stats), but I
find it awkward. That's why I proposed adding the name.

Since this command doesn't actually ADD the statistics, it just creates
a specification used by the next ANALYZE, it would seem better to use a
different keyword than ADD, perhaps DECLARE STATISTICS ON... and DROP
STATISTICS ON

Maybe, although we should not use DROP with DECLARE. Not only DROP has
the same issue with name as ADD, but I think it's a good practice to
commands "paired" ADD-DROP and DECLARE-X?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#24)
Re: Multi-column distinctness.

Tomas Vondra <tomas.vondra@2ndQuadrant.com> writes:

On 10/20/2015 11:28 PM, Simon Riggs wrote:

Since this command doesn't actually ADD the statistics, it just creates
a specification used by the next ANALYZE, it would seem better to use a
different keyword than ADD, perhaps DECLARE STATISTICS ON... and DROP
STATISTICS ON

Maybe, although we should not use DROP with DECLARE. Not only DROP has
the same issue with name as ADD, but I think it's a good practice to
commands "paired" ADD-DROP and DECLARE-X?

If we spelled the drop case as "DROP STATISTICS name", I think the
conflict could be avoided. I agree though that pairing DECLARE and DROP
seems rather unintuitive.

I am getting more attracted to your suggestion of making these things
stand-alone commands "CREATE STATISTICS" and "DROP STATISTICS". Not only
does that fix the syntactic problems of shoehorning them into ALTER TABLE,
but it would be possible to design the syntax to allow for straightforward
extension to multiple-table cases. Even if we don't know what we'd do
with that just yet, it seems to me like a possible future extension that
we should keep in mind. And anything based on ALTER TABLE just cannot do
that reasonably.

So consider

CREATE STATISTICS name ON table(columnlist) [, table(columnlist) ...]
[ WITH options ]

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#25)
Re: Multi-column distinctness.

On 20 October 2015 at 21:06, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I am getting more attracted to your suggestion of making these things
stand-alone commands "CREATE STATISTICS" and "DROP STATISTICS". Not only
does that fix the syntactic problems of shoehorning them into ALTER TABLE,
but it would be possible to design the syntax to allow for straightforward
extension to multiple-table cases. Even if we don't know what we'd do
with that just yet, it seems to me like a possible future extension that
we should keep in mind. And anything based on ALTER TABLE just cannot do
that reasonably.

+1

Good argument, so now I understand and agree with Tomas' original suggestion

So consider

CREATE STATISTICS name ON table(columnlist) [, table(columnlist) ...]
[ WITH options ]

Seems good

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services