Collect frequency statistics for arrays

Started by Alexander Korotkovover 14 years ago49 messageshackers
Jump to latest
#1Alexander Korotkov
aekorotkov@gmail.com

Hi!

There is updated version of patch. General list of changes since reviewed
version:
1) Distinct slot is used for length histogram.
2) Standard statistics is collected for arrays.
3) Most common values and most common elements are mapped to distinct
columns of pg_stats view, because both of them are calculated for arrays.
4) Description of lossy counting algorithm was copied from
compute_tsvector_stats with corresponding changes in it.
5) In estimation functions comments about assumtions were added.

Accuracy testing

Following files are attached.
datasets.sql - sql script which generates test datasets
arrayanalyze.php - php script which does accuracy testing
results.sql - dump of table with tests results

As we can see from testing results, estimates seem to be quite accurate in
most part of test cases. When length of constant array exceeds 30, estimate
of "column <@ const" is very inaccurate for arrat_test3 table. It's related
with skipping of length histogram usage because of high CPU usage during
estimate (see array_sel.c:888).

------
With best regards,
Alexander Korotkov.

Attachments:

arrayanalyze-0.6.patch.gzapplication/x-gzip; name=arrayanalyze-0.6.patch.gzDownload
datasets.sqltext/x-sql; charset=US-ASCII; name=datasets.sqlDownload
arrayanalyze.phpapplication/x-httpd-php; name=arrayanalyze.phpDownload
results.sql.gzapplication/x-gzip; name=results.sql.gzDownload+1-0
#2Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#1)
Re: Collect frequency statistics for arrays

Rebased with head.

------
With best regards,
Alexander Korotkov.

Attachments:

arrayanalyze-0.7.patch.gzapplication/x-gzip; name=arrayanalyze-0.7.patch.gzDownload
#3Nathan Boley
npboley@gmail.com
In reply to: Alexander Korotkov (#2)
Re: Collect frequency statistics for arrays

Rebased with head.

FYI, I've added myself as the reviewer for the current commitfest.

Best,
Nathan Boley

#4Alexander Korotkov
aekorotkov@gmail.com
In reply to: Nathan Boley (#3)
Re: Collect frequency statistics for arrays

Hi!

On Wed, Nov 16, 2011 at 1:43 AM, Nathan Boley <npboley@gmail.com> wrote:

FYI, I've added myself as the reviewer for the current commitfest.

How is going review now?

------
With best regards,
Alexander Korotkov.

#5Noah Misch
noah@leadboat.com
In reply to: Alexander Korotkov (#4)
Re: Collect frequency statistics for arrays

On Tue, Dec 20, 2011 at 04:37:37PM +0400, Alexander Korotkov wrote:

On Wed, Nov 16, 2011 at 1:43 AM, Nathan Boley <npboley@gmail.com> wrote:

FYI, I've added myself as the reviewer for the current commitfest.

How is going review now?

I will examine this patch within the week.

#6Noah Misch
noah@leadboat.com
In reply to: Alexander Korotkov (#2)
Re: Collect frequency statistics for arrays

On Wed, Nov 09, 2011 at 08:49:35PM +0400, Alexander Korotkov wrote:

Rebased with head.

I took a look at this patch. The basic approach seems sensible. For array
columns, ANALYZE will now determine the elements most often present in the
column's arrays. It will also calculate a histogram from the counts of
distinct elements in each array. That is to say, both newly-collected
statistics treat {1,1,2,3,1} and {3,1,2} identically. New selectivity
machinery uses the new statistics to optimize these operations:

column @> const
column <@ const
column && const
const = ANY (column)
const = ALL (column)

Concrete estimates look mostly sane, with a few corner cases I note below.

We have many implementation details to nail down.

With the patch applied, I get the attached regression.diffs from "make check".
The palloc errors indicate bugs, but rules.out just needs a refresh.

During ANALYZE, we'll now detoast all array column values regardless of size,
just as we already do for tsvector columns. That may be reasonable enough for
tsvector table/index columns, whose very presence is a hint that the user has
planned to use the value for searching. Since arrays make no such
implication, should we skip large arrays to constrain TOAST I/O? Say, skip
arrays larger than 100 KiB or 1 MiB?

I find distressing the thought of having two copies of the lossy sampling
code, each implementing the algorithm with different variable names and levels
of generality. We might someday extend this to hstore, and then we'd have yet
another copy. Tom commented[1]http://archives.postgresql.org/message-id/12406.1298055475@sss.pgh.pa.us that ts_typanalyze() and array_typanalyze()
should remain distinct, and I agree. However, they could call a shared
counting module. Is that practical? Possible API:

typedef struct LossyCountCtl;
LossyCountCtl *LossyCountStart(float s,
float epsilon,
int2 typlen,
bool typbyval,
Oid eqfunc); /* + hash func, a few others */
void LossyCountAdd(LossyCountCtl *ctl, Datum elem);
TrackItem **LossyCountGetAll(LossyCountCtl *ctl);

[1]: http://archives.postgresql.org/message-id/12406.1298055475@sss.pgh.pa.us

*** a/doc/src/sgml/catalogs.sgml
--- b/doc/src/sgml/catalogs.sgml
***************
*** 8253,8260 ****
<entry>
A list of the most common values in the column. (Null if
no values seem to be more common than any others.)
!        For some data types such as <type>tsvector</>, this is a list of
!        the most common element values rather than values of the type itself.
</entry>
</row>
--- 8253,8261 ----
<entry>
A list of the most common values in the column. (Null if
no values seem to be more common than any others.)
!        For some data types such as <type>arrays</> and <type>tsvector</>,
!        this is a list of the most common element values rather than values of
!        the type itself.
</entry>
</row>

***************
*** 8266,8274 ****
A list of the frequencies of the most common values or elements,
i.e., number of occurrences of each divided by total number of rows.
(Null when <structfield>most_common_vals</structfield> is.)
! For some data types such as <type>tsvector</>, it can also store some
! additional information, making it longer than the
! <structfield>most_common_vals</> array.
</entry>
</row>

--- 8267,8275 ----
A list of the frequencies of the most common values or elements,
i.e., number of occurrences of each divided by total number of rows.
(Null when <structfield>most_common_vals</structfield> is.)
!        For some data types such as <type>arrays</> and <type>tsvector</>,
!        it can also store some additional information, making it longer than
!        the <structfield>most_common_vals</> array.
</entry>
</row>

We're falsifying the above by splitting out that case into new columns
most_common_elems and most_common_elem_freqs.

***************
*** 8284,8289 ****
--- 8285,8291 ----
does not have a <literal>&lt;</> operator or if the
<structfield>most_common_vals</> list accounts for the entire
population.)
+        For <type>arrays</>, it holds histogram bounds of array lengths. 
</entry>
</row>

Likewise: that's now in the new column length_histogram_bounds.

We need documentation for the new pg_stats columns. Also, in particular,
let's document the special entries at the end of most_common_freqs.

*** a/src/backend/catalog/system_views.sql
--- b/src/backend/catalog/system_views.sql
***************
*** 117,145 **** CREATE VIEW pg_stats AS
stawidth AS avg_width,
stadistinct AS n_distinct,
CASE
!             WHEN stakind1 IN (1, 4) THEN stavalues1
!             WHEN stakind2 IN (1, 4) THEN stavalues2
!             WHEN stakind3 IN (1, 4) THEN stavalues3
!             WHEN stakind4 IN (1, 4) THEN stavalues4
END AS most_common_vals,
CASE
!             WHEN stakind1 IN (1, 4) THEN stanumbers1
!             WHEN stakind2 IN (1, 4) THEN stanumbers2
!             WHEN stakind3 IN (1, 4) THEN stanumbers3
!             WHEN stakind4 IN (1, 4) THEN stanumbers4
END AS most_common_freqs,
CASE
WHEN stakind1 = 2 THEN stavalues1
WHEN stakind2 = 2 THEN stavalues2
WHEN stakind3 = 2 THEN stavalues3
WHEN stakind4 = 2 THEN stavalues4
END AS histogram_bounds,
CASE
WHEN stakind1 = 3 THEN stanumbers1[1]
WHEN stakind2 = 3 THEN stanumbers2[1]
WHEN stakind3 = 3 THEN stanumbers3[1]
WHEN stakind4 = 3 THEN stanumbers4[1]
!         END AS correlation
FROM pg_statistic s JOIN pg_class c ON (c.oid = s.starelid)
JOIN pg_attribute a ON (c.oid = attrelid AND attnum = s.staattnum)
LEFT JOIN pg_namespace n ON (n.oid = c.relnamespace)
--- 117,170 ----
stawidth AS avg_width,
stadistinct AS n_distinct,
CASE
!             WHEN stakind1 = 1 THEN stavalues1
!             WHEN stakind2 = 1 THEN stavalues2
!             WHEN stakind3 = 1 THEN stavalues3
!             WHEN stakind4 = 1 THEN stavalues4
!             WHEN stakind5 = 1 THEN stavalues5
END AS most_common_vals,
CASE
!             WHEN stakind1 = 1 THEN stanumbers1
!             WHEN stakind2 = 1 THEN stanumbers2
!             WHEN stakind3 = 1 THEN stanumbers3
!             WHEN stakind4 = 1 THEN stanumbers4
!             WHEN stakind5 = 1 THEN stanumbers5
END AS most_common_freqs,
CASE
WHEN stakind1 = 2 THEN stavalues1
WHEN stakind2 = 2 THEN stavalues2
WHEN stakind3 = 2 THEN stavalues3
WHEN stakind4 = 2 THEN stavalues4
+             WHEN stakind5 = 2 THEN stavalues5
END AS histogram_bounds,
CASE
WHEN stakind1 = 3 THEN stanumbers1[1]
WHEN stakind2 = 3 THEN stanumbers2[1]
WHEN stakind3 = 3 THEN stanumbers3[1]
WHEN stakind4 = 3 THEN stanumbers4[1]
!             WHEN stakind5 = 3 THEN stanumbers5[1]
!         END AS correlation,
!         CASE
!             WHEN stakind1 = 4 THEN stavalues1
!             WHEN stakind2 = 4 THEN stavalues2
!             WHEN stakind3 = 4 THEN stavalues3
!             WHEN stakind4 = 4 THEN stavalues4
!             WHEN stakind5 = 4 THEN stavalues5
!         END AS most_common_elems,
!         CASE
!             WHEN stakind1 = 4 THEN stanumbers1
!             WHEN stakind2 = 4 THEN stanumbers2
!             WHEN stakind3 = 4 THEN stanumbers3
!             WHEN stakind4 = 4 THEN stanumbers4
!             WHEN stakind5 = 4 THEN stanumbers5
!         END AS most_common_elem_freqs,

I think this is an improvement, but some code out there may rely on the
ability to get stakind = 4 data from the most_common_vals column. We'll need
to mention this in the release notes as an incompatibility.

! CASE
! WHEN stakind1 = 5 THEN stavalues1
! WHEN stakind2 = 5 THEN stavalues2
! WHEN stakind3 = 5 THEN stavalues3
! WHEN stakind4 = 5 THEN stavalues4
! WHEN stakind5 = 5 THEN stavalues5
! END AS length_histogram_bounds
FROM pg_statistic s JOIN pg_class c ON (c.oid = s.starelid)
JOIN pg_attribute a ON (c.oid = attrelid AND attnum = s.staattnum)
LEFT JOIN pg_namespace n ON (n.oid = c.relnamespace)

*** a/src/backend/commands/typecmds.c
--- b/src/backend/commands/typecmds.c
***************
*** 610,616 **** DefineType(List *names, List *parameters)
F_ARRAY_SEND,	/* send procedure */
typmodinOid,		/* typmodin procedure */
typmodoutOid,	/* typmodout procedure */
! 			   InvalidOid,		/* analyze procedure - default */
typoid,			/* element type ID */
true,			/* yes this is an array type */
InvalidOid,		/* no further array type */
--- 610,616 ----
F_ARRAY_SEND,	/* send procedure */
typmodinOid,		/* typmodin procedure */
typmodoutOid,	/* typmodout procedure */
! 			   ArrayTypanalyzeOid,		/* special analyze procedure for arrays */
typoid,			/* element type ID */
true,			/* yes this is an array type */
InvalidOid,		/* no further array type */

The recently-added function DefineRange() needs the same change.

*** /dev/null
--- b/src/backend/utils/adt/array_sel.c

"array_selfuncs.c" would better match our informal convention.

***************
*** 0 ****
--- 1,948 ----
+ /*-------------------------------------------------------------------------
+  *
+  * array_sel.c
+  *	  Functions for selectivity estimation of array operators.
+  *
+  * Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group
+  *
+  *
+  * IDENTIFICATION
+  *	  src/backend/tsearch/array_sel.c

Fix file location.

+  *
+  *-------------------------------------------------------------------------
+  */
+ 
+ #include "postgres.h"
+ #include "access/hash.h"
+ #include "catalog/pg_operator.h"
+ #include "commands/vacuum.h"
+ #include "utils/builtins.h"
+ #include "utils/typcache.h"
+ #include "utils/array.h"
+ #include "catalog/pg_am.h"
+ #include "catalog/pg_collation.h"
+ #include "commands/defrem.h"
+ #include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"

Sort the includes alphabetically, except for postgres.h coming first.

+ 
+ /* Default selectivity constant */
+ #define DEFAULT_CONT_SEL 0.005
+ 
+ /* Macro for selectivity estimation to be used if we have no statistics */
+ #define array_selec_no_stats(array,nitems,op) \
+ 	mcelem_array_selec(array, nitems, typentry, NULL, 0, NULL, 0, NULL, 0, op)
+ 
+ Datum		arraysel(PG_FUNCTION_ARGS);

extern prototypes go in header files, even when it's not strictly necessary.

+ 
+ static Selectivity calc_arraysel(VariableStatData *vardata, Datum constval,
+ 			  Oid operator);
+ static Selectivity mcelem_array_selec(ArrayType *array, int nitems, TypeCacheEntry *typentry,
+ 				   Datum *mcelem, int nmcelem, float4 *numbers, int nnumbers,
+ 				   Datum *hist, int nhist, Oid operator);
+ static int	element_compare(const void *key1, const void *key2);
+ bool		find_next_mcelem(Datum *mcelem, int nmcelem, Datum value, int *index);
+ static Selectivity mcelem_array_contain_overlap_selec(Datum *mcelem,
+   int nmcelem, float4 *numbers, Datum *array_data, int nitems, Oid operator);
+ static float calc_hist(Datum *hist, int nhist, float *hist_part, int n);
+ static Selectivity mcelem_array_contained_selec(Datum *mcelem, int nmcelem,
+ 	  float4 *numbers, Datum *array_data, int nitems, Datum *hist, int nhist,
+ 							 Oid operator);
+ static float *calc_distr(float *p, int n, int m, float rest);
+ 
+ /* Compare function of element data type */
+ static FunctionCallInfoData cmpfunc;
+ 
+ /*
+  * selectivity for "const = ANY(column)" and "const = ALL(column)"
+  */
+ Selectivity
+ calc_scalararraysel(VariableStatData *vardata, Datum constval, bool orClause)

You have scalararraysel() calling this function for any operator (consider
"const < ANY(column)"), but it only handles a single operator: the "=" of the
default btree opclass used at ANALYZE time. We could start by just returning
a constant selectivity for other operators, but we may be able to do better.
If the actual operator is the equality operator we used at ANALYZE time
(staop), use binary search on the mcelem array. Otherwise, use linear search,
applying the operator to each MCE. (If the benefits justify the trouble, you
could also use this strategy to support types with no default btree opclass.)

+ {
+ 	Oid			elemtype;
+ 	Selectivity selec;
+ 	TypeCacheEntry *typentry;
+ 	Datum	   *hist;
+ 	int			nhist;
+ 
+ 	elemtype = get_base_element_type(vardata->vartype);
+ 
+ 	/* Get default comparison function */
+ 	typentry = lookup_type_cache(elemtype,
+ 							  TYPECACHE_CMP_PROC | TYPECACHE_CMP_PROC_FINFO);
+ 	InitFunctionCallInfoData(cmpfunc, &typentry->cmp_proc_finfo, 2,
+ 							 DEFAULT_COLLATION_OID, NULL, NULL);

The default btree operator class that existed at ANALYZE time may no longer
exist. If we don't find a cmp function here, punt to avoid a crash.

+ /*
+  * arraysel -- restriction selectivity for "column @> const", "column && const"
+  * and "column <@ const"
+  */
+ Datum
+ arraysel(PG_FUNCTION_ARGS)
+ {
+ 	PlannerInfo *root = (PlannerInfo *) PG_GETARG_POINTER(0);
+ 
+ 	Oid			operator = PG_GETARG_OID(1);
+ 	List	   *args = (List *) PG_GETARG_POINTER(2);
+ 	int			varRelid = PG_GETARG_INT32(3);
+ 	VariableStatData vardata;
+ 	Node	   *other;
+ 	bool		varonleft;
+ 	Selectivity selec;
+ 	Oid			element_typeid;
+ 
+ 	/*
+ 	 * If expression is not variable = something or something = variable, then
+ 	 * punt and return a default estimate.
+ 	 */

The operator is never "=".

+ 	if (!get_restriction_variable(root, args, varRelid,
+ 								  &vardata, &other, &varonleft))
+ 		PG_RETURN_FLOAT8(DEFAULT_CONT_SEL);
+ 
+ 	/*
+ 	 * Can't do anything useful if the something is not a constant, either.
+ 	 */
+ 	if (!IsA(other, Const))
+ 	{
+ 		ReleaseVariableStats(vardata);
+ 		PG_RETURN_FLOAT8(DEFAULT_CONT_SEL);
+ 	}

Previously, we defaulted to 0.005 for operator "&&" (via areasel()) and 0.001
for operators "@>" and "<@" (via contsel()). Surely some difference between
those cases remains appropriate.

+ 		if (res == 0)
+ 		{
+ 			*index = i;
+ 			return true;
+ 		}
+ 		else if (res < 0)
+ 		{
+ 			l = i + 1;
+ 		}
+ 		else
+ 		{
+ 			r = i - 1;
+ 		}

Throughout this patch, omit braces around single statements.

+ 	}
+ 	*index = l;
+ 	return false;
+ }
+ 
+ /*
+  * Array selectivity estimation based on most common elements statistics.
+  */
+ static Selectivity
+ mcelem_array_selec(ArrayType *array, int nitems, TypeCacheEntry *typentry,
+ 	  Datum *mcelem, int nmcelem, float4 *numbers, int nnumbers, Datum *hist,
+ 				   int nhist, Oid operator)
+ {
+ 	/* "column @> const" and "column && const" cases */
+ 	if (operator == OID_ARRAY_CONTAIN_OP || operator == OID_ARRAY_OVERLAP_OP)
+ 		return mcelem_array_contain_overlap_selec(mcelem, nmcelem, numbers,
+ 									   array_data, nonnull_nitems, operator);
+ 
+ 	/* "column <@ const" case */
+ 	if (operator == OID_ARRAY_CONTAINED_OP)
+ 		return mcelem_array_contained_selec(mcelem, nmcelem, numbers,
+ 						  array_data, nonnull_nitems, hist, nhist, operator);
+ 	return DEFAULT_CONT_SEL;

Returning a fixed selectivity when this gets attached to an unexpected
operator seems counterproductive. Let's just elog(ERROR) in that case.

+ /*
+  * Array selectivity estimation based on most common elements statistics for
+  * "column @> const" and "column && const" cases. This estimation assumes
+  * element occurences to be independent.
+  */
+ static Selectivity
+ mcelem_array_contain_overlap_selec(Datum *mcelem, int nmcelem,
+ 				float4 *numbers, Datum *array_data, int nitems, Oid operator)

We could take some advantage of the unique element count histogram for "@>".
Any column value with fewer distinct elements than the constant array cannot
match. We perhaps can't use both that fact and element frequency stats at the
same time, but we could use the lesser of the two probabilities.

For operator "&&" or operator "@>" with a nonempty constant array, no rows
having empty arrays will match. Excepting the special case of "@>" with an
empty constant array, we can multiply the MCE-derived selectivity by the
fraction, based on the histogram, of nonempty arrays in the column. (For
"<@", rows having empty arrays always match.)

I don't think it's mandatory that the initial commit implement the above, but
it's something to mention in the comments as a future direction.

+ /*
+  * Let be n independent events with probabilities p. This function calculates
+  * probabilities of exact k of events occurence for k in [0;m].
+  * Imagine matrix M of (n + 1) x (m + 1) size. Element M[i,j] denotes
+  * probability that exact j of first i events occurs. Obviously M[0,0] = 1.
+  * Each next event increase total number of occured events if it occurs and
+  * leave last value of that number if it doesn't occur. So, by the law of
+  * total probability: M[i,j] = M[i - 1, j] * (1 - p[i]) + M[i - 1, j - 1] * p[i]
+  * for i > 0, j > 0. M[i,0] = M[i - 1, 0] * (1 - p[i]) for i > 0.
+  * Also there could be some events with low probabilities. Their summary
+  * probability passed in the rest parameter.
+  */
+ static float *
+ calc_distr(float *p, int n, int m, float rest)
+ {
+ 	/* Take care about events with low probabilities. */
+ 	if (rest > 0.0f)
+ 	{
+ 		/*
+ 		 * The probability of no occurence of events which forms "rest"
+ 		 * probability have a limit of exp(-rest) when number of events fo to
+ 		 * infinity. Another simplification is to replace that events with one
+ 		 * event with (1 - exp(-rest)) probability.
+ 		 */
+ 		rest = 1.0f - exp(-rest);

What is the name of the underlying concept in probability theory?

+ /*
+  * Array selectivity estimation based on most common elements statistics for
+  * "column <@ const" case. Assumption that element occurences are independent
+  * means certain distribution of array lengths. Typically real distribution
+  * of lengths is significantly different from it. For example, if even we
+  * have set of arrays with 1 integer element in range [0;10] each, element
+  * occurences are not independent. Because in the case of independence we

Do you refer to a column where '{1,12,46}' and '{13,7}' may appear, but
'{6,19,4}' cannot appear?

+  * have probabilities of length of 0, 1, 2 etc. In the "column @> const"
+  * and "column && const" cases we usually have "const" with low summary
+  * frequency of elements (otherwise we have selectivity close to 0 or 1
+  * correspondingly). That's why effect of dependence related to lengths
+  * distribution id negligible there. In the "column <@ const" case summary
+  * frequency of elements is high (otherwise we have selectivity close to 0).

What does the term "summary frequency" denote?

+  * That's why we should do correction due to array lengths distribution.
+  */
+ static Selectivity
+ mcelem_array_contained_selec(Datum *mcelem, int nmcelem, float4 *numbers,
+ 		 Datum *array_data, int nitems, Datum *hist, int nhist, Oid operator)

Break up the parameter list to avoid pgindent reverse-indenting the line.

When I tried a constant array with duplicate elements, I saw an inappropriate
report of 100% selectivity:

[local] test=# create table t1 as select array[n % 2, n] as arr from generate_series(1,100000) t(n);
SELECT 100000
[local] test=# analyze t1;
WARNING: problem in alloc set Analyze: detected write past chunk end in block 0x7f189cbbf010, chunk 0x7f189cc1d0d8
ANALYZE
[local] test=# explain select * from t1 where arr <@ '{0,45}';
QUERY PLAN
--------------------------------------------------------
Seq Scan on t1 (cost=0.00..1986.00 rows=186 width=29)
Filter: (arr <@ '{0,45}'::integer[])
(2 rows)

[local] test=# explain select * from t1 where arr <@ '{0,45,45}';
QUERY PLAN
-----------------------------------------------------------
Seq Scan on t1 (cost=0.00..1986.00 rows=100000 width=29)
Filter: (arr <@ '{0,45,45}'::integer[])
(2 rows)

By contrast, the estimate in the non-duplicate case looks sane considering
these estimates for the individual elements:

[local] test=# explain select * from t1 where 0 = any (arr);
QUERY PLAN
----------------------------------------------------------
Seq Scan on t1 (cost=0.00..2986.00 rows=49967 width=29)
Filter: (0 = ANY (arr))
(2 rows)

[local] test=# explain select * from t1 where 45 = any (arr);
QUERY PLAN
--------------------------------------------------------
Seq Scan on t1 (cost=0.00..2986.00 rows=500 width=29)
Filter: (45 = ANY (arr))
(2 rows)

Incidentally, "const = ALL (column)" should be equivalent to "column <@
array[const]". (Assuming the "=" operator chosen in the first statement is
the "=" operator of the array type's default btree opclass). However, I get
significantly different estimates, with the latter getting a better estimate:

[local] test=# explain select * from t1 where 1 = all (arr);
QUERY PLAN
----------------------------------------------------------
Seq Scan on t1 (cost=0.00..2986.00 rows=18407 width=29)
Filter: (1 = ALL (arr))
(2 rows)

[local] test=# explain select * from t1 where arr <@ array[1]http://archives.postgresql.org/message-id/12406.1298055475@sss.pgh.pa.us;
QUERY PLAN
------------------------------------------------------
Seq Scan on t1 (cost=0.00..1986.00 rows=1 width=29)
Filter: (arr <@ '{1}'::integer[])
(2 rows)

+ 	/*
+ 	 * Rest is a average length of elements which aren't present in mcelem.
+ 	 */
+ 	rest = avg_length;

You define "rest" here as an array length ...

+ 
+ 	default_freq = Min(DEFAULT_CONT_SEL, minfreq / 2);
+ 
+ 	mcelem_index = 0;
+ 
+ 	/*
+ 	 * mult is the multiplier that presents estimate of probability that each
+ 	 * mcelem which is not present in constant doesn't occur.
+ 	 */
+ 	mult = 1.0f;
+ 
+ 	for (i = 0; i < nitems; i++)
+ 	{
+ 		bool		found = false;
+ 
+ 		/* Comparison with previous value in order to guarantee uniquness */
+ 		if (i > 0)
+ 		{
+ 			if (!element_compare(&array_data[i - 1], &array_data[i]))
+ 				continue;
+ 		}
+ 
+ 		/*
+ 		 * Iterate over mcelem until find mcelem that is greater or equal to
+ 		 * element of constant. Simultaneously taking care about rest and
+ 		 * mult. If that mcelem is found then fill corresponding elem_selec.
+ 		 */
+ 		while (mcelem_index < nmcelem)
+ 		{
+ 			int			cmp = element_compare(&mcelem[mcelem_index], &array_data[i]);
+ 
+ 			if (cmp < 0)
+ 			{
+ 				mult *= (1.0f - numbers[mcelem_index]);
+ 				rest -= numbers[mcelem_index];

... But here, you're subtracting a frequency from an array length?

+ /*
+  * Comparison function for elements. Based on default comparison function for
+  * array element data type.
+  */
+ static int
+ element_compare(const void *key1, const void *key2)
+ {
+ 	const Datum *d1 = (const Datum *) key1;
+ 	const Datum *d2 = (const Datum *) key2;
+ 
+ 	cmpfunc.	arg[0] = *d1;
+ 	cmpfunc.	arg[1] = *d2;
+ 	cmpfunc.	argnull[0] = false;
+ 	cmpfunc.	argnull[1] = false;
+ 	cmpfunc.	isnull = false;
+ 
+ 	return DatumGetInt32(FunctionCallInvoke(&cmpfunc));
+ }

We could easily make this reentrant by passing the fcinfo through an argument
and using qsort_arg(). Please do so.

*** /dev/null
--- b/src/backend/utils/adt/array_typanalyze.c
***************
*** 0 ****
--- 1,834 ----
+ /*-------------------------------------------------------------------------
+  *
+  * array_typanalyze.c
+  *	  functions for gathering statistics from array columns
+  *
+  * Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group
+  *
+  *
+  * IDENTIFICATION
+  *	  src/backend/tsearch/array_typanalyze.c

Fix file location.

+  *
+  *-------------------------------------------------------------------------
+  */
+ 
+ #include "postgres.h"
+ #include "access/hash.h"
+ #include "catalog/pg_operator.h"
+ #include "commands/vacuum.h"
+ #include "commands/defrem.h"
+ #include "parser/parse_oper.h"
+ #include "utils/builtins.h"
+ #include "utils/hsearch.h"
+ #include "utils/typcache.h"
+ #include "utils/array.h"
+ #include "catalog/pg_am.h"
+ #include "catalog/pg_collation.h"
+ #include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"

Alphabetize the includes after "postgres.h".

+ 
+ #define ARRAY_ANALYZE_CHECK_OID(x) \
+ 	if (!OidIsValid(x)) \
+ 	{ \
+ 		elog(INFO, "Can't collect statistics on %d array type. Array \
+ 					statistics collection requires default hash and btree \
+ 					opclasses for element type.", stats->attrtypid); \
+ 		stats->stats_valid = false; \
+ 		return; \
+ 	}

No message is necessary: the absence of a btree opclass or of both opclasses
degrades normal statistics, and we make no message about it then.

The decision to abandon the stats at this point causes a minor regression for
types having only a default hash opclass. They currently get scalar minimal
stats; now they'll have none. See below for one way to avoid that.

This macro is used in just one function, and the return statement means one
wouldn't lightly use it anywhere else. Therefore, as a minor style point, I'd
place its definition right with the function that uses it rather than at the
head of the file.

+ Datum array_typanalyze(PG_FUNCTION_ARGS);

extern prototypes go in header files, even when it's not strictly necessary.

+ /*
+  *	array_typanalyze -- a custom typanalyze function for array columns
+  */
+ Datum
+ array_typanalyze(PG_FUNCTION_ARGS)
+ {
+ 	VacAttrStats *stats = (VacAttrStats *) PG_GETARG_POINTER(0);
+ 	Form_pg_attribute attr = stats->attr;
+ 	Oid			ltopr;
+ 	Oid			eqopr;
+ 	StdAnalyzeData *mystats;
+ 
+ 	/* If the attstattarget column is negative, use the default value */
+ 	/* NB: it is okay to scribble on stats->attr since it's a copy */
+ 	if (attr->attstattarget < 0)
+ 		attr->attstattarget = default_statistics_target;
+ 
+ 	/* Look for default "<" and "=" operators for column's type */
+ 	get_sort_group_operators(stats->attrtypid,
+ 							 false, false, false,
+ 							 &ltopr, &eqopr, NULL,
+ 							 NULL);
+ 
+ 	/* If column has no "=" operator, we can't do much of anything */
+ 	if (!OidIsValid(eqopr))
+ 		return false;
+ 
+ 	/* Save the operator info for compute_stats routines */
+ 	mystats = (StdAnalyzeData *) palloc(sizeof(StdAnalyzeData));
+ 	mystats->eqopr = eqopr;
+ 	mystats->eqfunc = get_opcode(eqopr);
+ 	mystats->ltopr = ltopr;
+ 	stats->extra_data = mystats;

Instead of duplicating this initialization and exporting StdAnalyzeData,
compute_scalar_stats() and compute_minimal_stats() from analyze.c, I suggest
structuring things as follows. Export only std_typanalyze() and call it here.
If it returns false, return false here, too. Otherwise, proceed with the
additional lookups you currently do in compute_array_stats(). If you don't
find everything you need (default btree operator class, for example), just
return true; the analysis will continue with the standard scalar approach as
setup by std_typanalyze(). Otherwise, replace compute_stats and extra_data
with your own materials.

+ 
+ 	stats->compute_stats = compute_array_stats;
+ 	/* see comment about the choice of minrows in commands/analyze.c */
+ 	stats->minrows = 300 * attr->attstattarget;
+ 
+ 	PG_RETURN_BOOL(true);
+ }
+ 
+ /*
+  *	compute_array_stats() -- compute statistics for a array column
+  *
+  *	This functions computes statistics that are useful for determining <@, &&,
+  *	@> operations selectivity, along with the fraction of non-null rows and
+  *	average width.
+  *
+  *	As an addition to the the most common values, as we do for most datatypes,
+  *	we're looking for the most common elements and length histogram. In the
+  *	case of relatively long arrays it might be more useful, because there most
+  *	probably won't be any two rows with the same array and thus MCV has no
+  *	much sense. With a list of the most common elements we can do a better job
+  *	at figuring out <@, &&, @> selectivity. Arrays length histogram helps to
+  *	"column <@ const" to be more precise.

The histogram addresses array distinct element counts, not array lengths.
That's exactly what we need for the selectivity calculations in question.
Please update the terminology throughout the patch, though.

+  *
+  *	The algorithm used is Lossy Counting, as proposed in the paper "Approximate
+  *	frequency counts over data streams" by G. S. Manku and R. Motwani, in
+  *	Proceedings of the 28th International Conference on Very Large Data Bases,
+  *	Hong Kong, China, August 2002, section 4.2. The paper is available at
+  *	http://www.vldb.org/conf/2002/S10P03.pdf
+  *
+  *	The Lossy Counting (aka LC) algorithm goes like this:
+  *	Let s be the threshold frequency for an item (the minimum frequency we
+  *	are interested in) and epsilon the error margin for the frequency. Let D
+  *	be a set of triples (e, f, delta), where e is an element value, f is that
+  *	element's frequency (actually, its current occurrence count) and delta is
+  *	the maximum error in f. We start with D empty and process the elements in
+  *	batches of size w. (The batch size is also known as "bucket size" and is
+  *	equal to 1/epsilon.) Let the current batch number be b_current, starting
+  *	with 1. For each element e we either increment its f count, if it's
+  *	already in D, or insert a new triple into D with values (e, 1, b_current
+  *	- 1). After processing each batch we prune D, by removing from it all
+  *	elements with f + delta <= b_current.  After the algorithm finishes we
+  *	suppress all elements from D that do not satisfy f >= (s - epsilon) * N,
+  *	where N is the total number of elements in the input.  We emit the
+  *	remaining elements with estimated frequency f/N.  The LC paper proves
+  *	that this algorithm finds all elements with true frequency at least s,
+  *	and that no frequency is overestimated or is underestimated by more than
+  *	epsilon.  Furthermore, given reasonable assumptions about the input
+  *	distribution, the required table size is no more than about 7 times w.
+  *
+  *	We set s to be the estimated frequency of the K'th element in a natural
+  *	language's frequency table, where K is the target number of entries in
+  *	the MCELEM array. We assume that the distribution of element frequencies
+  *	follows Zipf's law with an exponent of 1.
+  *
+  *	Assuming Zipfian distribution, the frequency of the K'th element is equal
+  *	to 1/(K * H(W)) where H(n) is 1/2 + 1/3 + ... + 1/n and W is the number of
+  *	elements in the language.	Putting W as one million, we get roughly
+  *	0.07/K. This gives s = 0.07/K.	We set epsilon = s/10, which gives bucket
+  *	width w = K/0.007 and maximum expected hashtable size of about 1000 * K.

These last two paragraphs, adapted from ts_typanalyze.c, assume natural
language documents. To what extent do these parameter choices remain sensible
for arbitrary data such as users may place in arrays? In any event, we need a
different justification, even if it's just a hand-wavy justification.

If I'm following this correctly, this choice of "s" makes the algorithm
guaranteed to find only elements constituting >= 7% of the input elements.
Incidentally, isn't that far too high for natural language documents? If the
English word "the" only constitutes 7% of typical documents, then this "s"
value would permit us to discard practically every word; we'd be left with
words read while filling the last bucket and words that happened to repeat
exceedingly often in the column. I haven't tried to make a test case to
observe this problem; am I missing something? (This question is largely
orthogonal to your patch.)

+  *
+  *	Note: in the above discussion, s, epsilon, and f/N are in terms of a
+  *	element's frequency as a fraction of all elements seen in the input.
+  *	However, what we actually want to store in the finished pg_statistic
+  *	entry is each element's frequency as a fraction of all rows that it occurs
+  *	in. Elements might be repeated in the same array. Since operators
+  *	<@, &&, @> takes care only about element occurence itself and not about
+  *	occurence count, function takes additional care about uniqueness of
+  *	counting. Also we need to change the divisor from N to nonnull_cnt to get
+  *	the number we want.

On the same tangent, why does ts_typanalyze() not deduplicate the same way?
The @@ operator has the same property.

+  */
+ static void
+ compute_array_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
+ 					int samplerows, double totalrows)
+ {
+ 	int			num_mcelem;
+ 	int			null_cnt = 0;
+ 
+ 	/*
+ 	 * We should count not only null array values, but also null array
+ 	 * elements
+ 	 */
+ 	int			null_elem_cnt = 0.0;
+ 
+ 	double		total_width = 0.0;
+ 	double		total_length = 0.0;
+ 
+ 	/* This is D from the LC algorithm. */
+ 	HTAB	   *elements_tab;
+ 	HASHCTL		elem_hash_ctl;
+ 	HASH_SEQ_STATUS scan_status;
+ 
+ 	/* This is the current bucket number from the LC algorithm */
+ 	int			b_current;
+ 
+ 	/* This is 'w' from the LC algorithm */
+ 	int			bucket_width;
+ 	int			array_no,
+ 				element_no;
+ 	Datum		hash_key;
+ 	TrackItem  *item;
+ 
+ 	int			lengths_count;
+ 	int			length_index;
+ 	int			slot_idx = 0;
+ 	HTAB	   *length_tab;
+ 	HASHCTL		length_hash_ctl;
+ 	LengthItem *length_item;
+ 	LengthItem *sorted_length_tab;
+ 
+ 	/*
+ 	 * Most part of array operators, which selectivity is estimated by this
+ 	 * statistics, takes care only one occurence of element in array. That's
+ 	 * why we should take care about count element occurence only once per
+ 	 * array. To clean occurence flag for each array by iterating over all
+ 	 * hash table can be too expensive. That's why we store pointers to hash
+ 	 * items contain elements which occur in last array.
+ 	 */
+ 	TrackItem **occurences = NULL;
+ 	int			occurences_count = 0,
+ 				occurence_index;
+ 
+ 	TypeCacheEntry *typentry;
+ 	Oid			hash_opclass,
+ 				hash_opfamily,
+ 				element_typeid,
+ 				hash_oroc;
+ 	FmgrInfo	hash_func_info;
+ 
+ 	StdAnalyzeData *mystats = (StdAnalyzeData *) stats->extra_data;
+ 
+ 	/* Compute standard statistics */
+ 	if (OidIsValid(mystats->ltopr))
+ 		compute_scalar_stats(stats, fetchfunc, samplerows, totalrows);
+ 	else
+ 		compute_minimal_stats(stats, fetchfunc, samplerows, totalrows);
+ 
+ 
+ 	/* Gathering all necessary information about element data type. */
+ 
+ 	element_typeid = stats->attrtype->typelem;
+ 
+ 	if (!OidIsValid(element_typeid))
+ 		elog(ERROR, "array_typanalyze was invoked with %d non-array type",
+ 			 stats->attrtypid);
+ 
+ 	typentry = lookup_type_cache(element_typeid, TYPECACHE_EQ_OPR |
+ 	 TYPECACHE_CMP_PROC | TYPECACHE_EQ_OPR_FINFO | TYPECACHE_CMP_PROC_FINFO);
+ 	ARRAY_ANALYZE_CHECK_OID(typentry->cmp_proc);
+ 	ARRAY_ANALYZE_CHECK_OID(typentry->eq_opr);
+ 
+ 	hash_opclass = GetDefaultOpClass(element_typeid, HASH_AM_OID);
+ 	ARRAY_ANALYZE_CHECK_OID(hash_opclass);
+ 
+ 	hash_opfamily = get_opclass_family(hash_opclass);
+ 	ARRAY_ANALYZE_CHECK_OID(hash_opfamily);
+ 
+ 	hash_oroc = get_opfamily_proc(hash_opfamily, element_typeid,
+ 								  element_typeid, HASHPROC);
+ 	ARRAY_ANALYZE_CHECK_OID(hash_oroc);
+ 
+ 	fmgr_info(hash_oroc, &hash_func_info);
+ 
+ 	InitFunctionCallInfoData(element_type_info.cmp, &typentry->cmp_proc_finfo,
+ 							 2, DEFAULT_COLLATION_OID, NULL, NULL);
+ 	InitFunctionCallInfoData(element_type_info.eq, &typentry->eq_opr_finfo,
+ 							 2, DEFAULT_COLLATION_OID, NULL, NULL);
+ 	InitFunctionCallInfoData(element_type_info.hash, &hash_func_info,
+ 							 1, DEFAULT_COLLATION_OID, NULL, NULL);
+ 	element_type_info.typbyval = typentry->typbyval;

As I mentioned above in passing, all of the above setup only needs to happen
once per analyzed column, not once per tuple. It belongs in array_typanalyze;
define a struct to hold all the looked-up state, including a pointer to any
state for std_typanalyze(), and store that in stats->extra_data.

This code should probably get around to using the new SortSupport
infrastructure like analyze.c now uses. Not sure whether that's mandatory for
initial commit.

typanalyze functions that call arbitrary user code must be reentrant. It only
matters in corner cases, but crashing in those corner cases is not acceptable.
See our use of CurTupleHashTable in execGrouping.c and defend this global
state in a similar fashion.

+ 
+ 	/*
+ 	 * We want statistics_target * 10 elements in the MCELEM array. This
+ 	 * multiplier is pretty arbitrary, but is meant to reflect the fact that
+ 	 * the number of individual elements tracked in pg_statistic ought to be
+ 	 * more than the number of values for a simple scalar column.
+ 	 */
+ 	num_mcelem = stats->attr->attstattarget * 10;
+ 
+ 	/*
+ 	 * We set bucket width equal to (num_mcelem + 10) / 0.007 as per the
+ 	 * comment above.
+ 	 */
+ 	bucket_width = num_mcelem * 1000 / 7;

The addend mentioned is not present in the code or discussed in "the comment
above". (I see the comment is copied verbatim from ts_typanalyze(), where the
addend *is* present, though again the preceding comment says nothing of it.)

+ 
+ 	/*
+ 	 * Create the hashtable. It will be in local memory, so we don't need to
+ 	 * worry about overflowing the initial size. Also we don't need to pay any
+ 	 * attention to locking and memory management.
+ 	 */
+ 	MemSet(&elem_hash_ctl, 0, sizeof(elem_hash_ctl));
+ 	elem_hash_ctl.keysize = sizeof(Datum);
+ 	elem_hash_ctl.entrysize = sizeof(TrackItem);
+ 	elem_hash_ctl.hash = element_hash;
+ 	elem_hash_ctl.match = element_match;
+ 	elem_hash_ctl.hcxt = CurrentMemoryContext;
+ 	elements_tab = hash_create("Analyzed elements table",
+ 							   bucket_width * 7,

Though it's copied from compute_tsvector_stats(), why "7"?

+ 							   &elem_hash_ctl,
+ 					HASH_ELEM | HASH_FUNCTION | HASH_COMPARE | HASH_CONTEXT);
+ 
+ 	/*
+ 	 * hashtable for arrays lengths.
+ 	 */
+ 	MemSet(&length_hash_ctl, 0, sizeof(length_hash_ctl));
+ 	length_hash_ctl.keysize = sizeof(int);
+ 	length_hash_ctl.entrysize = sizeof(LengthItem);
+ 	length_hash_ctl.hash = tag_hash;

You need to pass the HASH_FUNCTION flag for this setting to take effect.

+ length_hash_ctl.match = memcmp;

Omit this. You would need to pass HASH_COMPARE for it to take effect, but
it's also implicit once you override the hash function.

+ 	length_hash_ctl.hcxt = CurrentMemoryContext;
+ 	length_tab = hash_create("Array length table",
+ 							 64,
+ 							 &length_hash_ctl,
+ 							 HASH_ELEM | HASH_CONTEXT);
+ 
+ 	/* Initialize counters. */
+ 	b_current = 1;
+ 	element_no = 0;
+ 
+ 	/* Loop over the arrays. */
+ 	for (array_no = 0; array_no < samplerows; array_no++)
+ 	{
+ 		Datum		value;
+ 		bool		isnull;
+ 		bool		null_present;
+ 		ArrayType  *array;
+ 		char	   *ptr;
+ 		bits8	   *bitmap;
+ 		int			bitmask;
+ 		int			j;
+ 		int			ndims;
+ 		int		   *dims;
+ 		int			nitems;
+ 		bool		length_found;
+ 
+ 		vacuum_delay_point();
+ 
+ 		value = fetchfunc(stats, array_no, &isnull);
+ 
+ 		/*
+ 		 * Check for null/nonnull.
+ 		 */
+ 		if (isnull)
+ 		{
+ 			null_cnt++;
+ 			continue;
+ 		}
+ 
+ 		/*
+ 		 * Add up widths for average-width calculation.  Since it's a array,
+ 		 * we know it's varlena.  As in the regular compute_minimal_stats
+ 		 * function, we use the toasted width for this calculation.
+ 		 */
+ 		total_width += VARSIZE_ANY(DatumGetPointer(value));
+ 
+ 		/*
+ 		 * Now detoast the array if needed.
+ 		 */
+ 		array = DatumGetArrayTypeP(value);
+ 		ptr = ARR_DATA_PTR(array);
+ 		bitmap = ARR_NULLBITMAP(array);
+ 		bitmask = 1;
+ 		ndims = ARR_NDIM(array);
+ 		dims = ARR_DIMS(array);
+ 		nitems = ArrayGetNItems(ndims, dims);
+ 
+ 		/*
+ 		 * Check if we have enough of memory to store element occurences in
+ 		 * one array.
+ 		 */
+ 		if (nitems > occurences_count)
+ 		{
+ 			occurences_count = 2 * nitems;
+ 			if (occurences)
+ 				occurences = (TrackItem **) repalloc(occurences,
+ 									 sizeof(TrackItem *) * occurences_count);
+ 			else
+ 				occurences = (TrackItem **) palloc(
+ 									 sizeof(TrackItem *) * occurences_count);
+ 		}
+ 		occurence_index = 0;
+ 
+ 		null_present = false;
+ 
+ 		/*
+ 		 * We loop through the elements in the array and add them to our
+ 		 * tracking hashtable.	Note: the hashtable entries will point into
+ 		 * the (detoasted) array value, therefore we cannot free that storage
+ 		 * until we're done.
+ 		 */

The second sentence of this comment is obsolete.

+ 		for (j = 0; j < nitems; j++)
+ 		{
+ 			bool		found;
+ 			bool		isnull;
+ 
+ 			/* Get elements, checking for NULL */
+ 			if (bitmap && (*bitmap & bitmask) == 0)
+ 			{
+ 				hash_key = (Datum) 0;
+ 				isnull = true;
+ 				null_present = true;
+ 			}
+ 			else
+ 			{
+ 				/* Get element value */
+ 				hash_key = fetch_att(ptr, typentry->typbyval, typentry->typlen);
+ 				isnull = false;
+ 
+ 				/*
+ 				 * We should allocate memory for element if it isn't passing
+ 				 * by value, because array will be freed after that loop.
+ 				 */
+ 				if (!typentry->typbyval)
+ 				{
+ 					Datum		tmp;
+ 					int			length;
+ 
+ 					length = (typentry->typlen == -1) ?
+ 						VARSIZE(hash_key) : typentry->typlen;
+ 					tmp = (Datum) MemoryContextAlloc(stats->anl_context, length);
+ 					memcpy((void *) tmp, (void *) hash_key, length);
+ 					hash_key = tmp;
+ 				}

Please use datumCopy() or add a comment about why it's insufficient.

+ 				ptr = att_addlength_pointer(ptr, typentry->typlen, ptr);
+ 				ptr = (char *) att_align_nominal(ptr, typentry->typalign);
+ 			}
+ 
+ 			/* Advance bitmap pointers if any */
+ 			bitmask <<= 1;
+ 			if (bitmask == 0x100)
+ 			{
+ 				if (bitmap)
+ 					bitmap++;
+ 				bitmask = 1;
+ 			}
+ 
+ 			/* No null element processing other then flag setting here */
+ 			if (isnull)
+ 				continue;
+ 
+ 			/* Lookup current element in hashtable, adding it if new */
+ 			item = (TrackItem *) hash_search(elements_tab,
+ 											 (const void *) &hash_key,
+ 											 HASH_ENTER, &found);
+ 
+ 			if (found)
+ 			{
+ 				/*
+ 				 * The element is already on the tracking list. If it is the
+ 				 * first occurence in array then update element frequency.
+ 				 */
+ 				if (!item->occurence)
+ 				{
+ 					item->frequency++;
+ 					item->occurence = true;
+ 					occurences[occurence_index++] = item;
+ 				}
+ 			}
+ 			else
+ 			{
+ 				/* Initialize new tracking list element */
+ 				item->frequency = 1;
+ 				item->delta = b_current - 1;
+ 				item->occurence = true;
+ 				occurences[occurence_index++] = item;
+ 			}
+ 
+ 			/* element_no is the number of elements processed (ie N) */
+ 			element_no++;

We should not bump element_no when we skipped the element as a duplicate
within the same array.

+ 
+ 			/* We prune the D structure after processing each bucket */
+ 			if (element_no % bucket_width == 0)
+ 			{
+ 				prune_element_hashtable(elements_tab, b_current);
+ 				b_current++;
+ 			}
+ 		}
+ 		/* Count null element only once per array */
+ 		if (null_present)
+ 			null_elem_cnt++;
+ 
+ 		/* Update frequency of particular array length. */
+ 		length_item = (LengthItem *) hash_search(length_tab,
+ 												 &occurence_index,
+ 												 HASH_ENTER, &length_found);
+ 		if (length_found)
+ 		{
+ 			length_item->frequency++;
+ 		}
+ 		else
+ 		{
+ 			length_item->length = occurence_index;
+ 			length_item->frequency = 1;
+ 		}
+ 		total_length += occurence_index;
+ 
+ 		/*
+ 		 * When we end processing of particular array we should clean the
+ 		 * occurence flag.
+ 		 */
+ 		for (j = 0; j < occurence_index; j++)
+ 			occurences[j]->occurence = false;

Could you usefully simplify this away by storing the last-containing array_no,
instead of a bool, in each hash entry?

+ 
+ 		/* We should free memory from array if it was copied during detoast. */
+ 		if ((Datum) array != value)
+ 			pfree((void *) array);

No need to for casts to "void *".

+ 	}
+ 
+ 	/* Skip slots occupied by standard statistics */
+ 	while (OidIsValid(stats->stakind[slot_idx]))
+ 		slot_idx++;
+ 
+ 	/* Fill histogram of arrays lengths. */
+ 	lengths_count = hash_get_num_entries(length_tab);
+ 	if (lengths_count > 0)
+ 	{
+ 		int			num_hist = stats->attr->attstattarget;
+ 		int			delta;
+ 		int			frac;
+ 		int			i;
+ 		Datum	   *hist_values;
+ 
+ 		/* Copy lengths statistics from hashtab to array and sort them. */
+ 		length_index = 0;
+ 		sorted_length_tab = (LengthItem *) palloc(sizeof(LengthItem) * lengths_count);
+ 		hash_seq_init(&scan_status, length_tab);
+ 		while ((length_item = (LengthItem *) hash_seq_search(&scan_status)) != NULL)
+ 		{
+ 			memcpy(&sorted_length_tab[length_index], length_item,
+ 				   sizeof(LengthItem));
+ 			length_index++;
+ 		}
+ 		qsort(sorted_length_tab, lengths_count, sizeof(LengthItem),
+ 			  lengthitem_compare_element);
+ 
+ 		/* Histogram should be stored in anl_context. */
+ 		hist_values = (Datum *) MemoryContextAlloc(stats->anl_context,
+ 												   sizeof(Datum) * num_hist);
+ 		/* Fill histogram by hashtab. */
+ 		delta = samplerows - null_cnt - 1;
+ 		length_index = 0;
+ 		frac = sorted_length_tab[0].frequency * (num_hist - 1);
+ 		for (i = 0; i < num_hist; i++)
+ 		{
+ 			hist_values[i] =
+ 				Int32GetDatum(sorted_length_tab[length_index].length);
+ 			frac -= delta;
+ 			while (frac <= 0)
+ 			{
+ 				length_index++;
+ 				frac += sorted_length_tab[length_index].frequency *
+ 					(num_hist - 1);
+ 			}
+ 		}
+ 
+ 		stats->stakind[slot_idx] = STATISTIC_KIND_LENGTH_HISTOGRAM;
+ 		stats->staop[slot_idx] = Int4EqualOperator;
+ 		stats->stavalues[slot_idx] = hist_values;
+ 		stats->numvalues[slot_idx] = num_hist;
+ 		/* We are storing values of element type */
+ 		stats->statypid[slot_idx] = INT4OID;
+ 		stats->statyplen[slot_idx] = 4;
+ 		stats->statypbyval[slot_idx] = true;
+ 		stats->statypalign[slot_idx] = 'i';
+ 		slot_idx++;
+ 	}
+ 
+ 	/* We can only compute real stats if we found some non-null values. */
+ 	if (null_cnt < samplerows)
+ 	{
+ 		int			nonnull_cnt = samplerows - null_cnt;
+ 		int			i;
+ 		TrackItem **sort_table;
+ 		int			track_len;
+ 		int			cutoff_freq;
+ 		int			minfreq,
+ 					maxfreq;
+ 
+ 		stats->stats_valid = true;
+ 		/* Do the simple null-frac and average width stats */
+ 		stats->stanullfrac = (double) null_cnt / (double) samplerows;
+ 		stats->stawidth = total_width / (double) nonnull_cnt;

Isn't this redundant with the calculations made in compute_scalar_stats()?

+ 
+ 		/* Assume it's a unique column (see notes above) */
+ 		stats->stadistinct = -1.0;

The comment, copied from ts_typanalyze(), refers to notes not likewise copied.
We should probably instead leave whatever compute_scalar_stats() calculated.

*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
***************
*** 1705,1710 **** scalararraysel(PlannerInfo *root,
--- 1705,1736 ----
RegProcedure oprsel;
FmgrInfo	oprselproc;
Selectivity s1;
+ 	bool		varonleft;
+ 	Node	   *other;
+ 	VariableStatData vardata;
+ 	
+ 	/*
+ 	 * Handle "const = qual(column)" case using array column statistics.
+ 	 */
+ 	if (get_restriction_variable(root, clause->args, varRelid,
+ 								  &vardata, &other, &varonleft))
+ 	{
+ 		Oid elemtype;
+ 		elemtype = get_base_element_type(vardata.vartype);
+ 		if (elemtype != InvalidOid && IsA(other, Const))
+ 		{
+ 			if (((Const *) other)->constisnull)
+ 			{
+ 				/* qual can't succeed if null array */
+ 				ReleaseVariableStats(vardata);
+ 				return (Selectivity) 0.0;
+ 			}
+ 			s1 = calc_scalararraysel(&vardata, ((Const *) other)->constvalue, useOr);
+ 			ReleaseVariableStats(vardata);
+ 			return s1;
+ 		}
+ 		ReleaseVariableStats(vardata);
+ 	}

If we're going to add a new file for array selectivity functions (which seems
reasonable), scalararraysel() should also move there. (To do this and still
keep the patch easy to read, you could do the move in a pre-patch.)

*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
***************
*** 849,854 **** DATA(insert OID = 2334 (  array_agg_finalfn   PGNSP PGUID 12 1 0 0 0 f f f f f i
--- 849,859 ----
DESCR("aggregate final function");
DATA(insert OID = 2335 (  array_agg		   PGNSP PGUID 12 1 0 0 0 t f f f f i 1 0 2277 "2283" _null_ _null_ _null_ _null_ aggregate_dummy _null_ _null_ _null_ ));
DESCR("concatenate aggregate input into an array");
+ DATA(insert OID = 3816 (  array_typanalyze PGNSP PGUID 12 1 0 0 0 f f f t f s 1 0 16 "2281" _null_ _null_ _null_ _null_ array_typanalyze _null_ _null_ _null_ ));
+ DESCR("array statistics collector");
+ #define ArrayTypanalyzeOid 3816

Use the fmgroids.h symbol, F_ARRAY_TYPANALYZE, instead.

+ DATA(insert OID = 3817 (  arraysel		   PGNSP PGUID 12 1 0 0 0 f f f t f s 4 0 701 "2281 26 2281 23" _null_ _null_ _null_ _null_ arraysel _null_ _null_ _null_ ));
+ DESCR("array selectivity estimation functions");
DATA(insert OID = 760 (  smgrin			   PGNSP PGUID 12 1 0 0 0 f f f t f s 1 0 210 "2275" _null_ _null_ _null_ _null_	smgrin _null_ _null_ _null_ ));
DESCR("I/O");
*** a/src/include/catalog/pg_statistic.h
--- b/src/include/catalog/pg_statistic.h

/*
* Currently, three statistical slot "kinds" are defined: most common values,

Here's a larger quote of the comment starting here:

/*
* Currently, three statistical slot "kinds" are defined: most common values,
* histogram, and correlation. Additional "kinds" will probably appear in
* future to help cope with non-scalar datatypes. Also, custom data types
* can define their own "kind" codes by mutual agreement between a custom
* typanalyze routine and the selectivity estimation functions of the type's
* operators.

That needs an update. (It already needs an update for STATISTIC_KIND_MCELEM,
but this patch would invalidate it yet again.)

***************
*** 260,263 **** typedef FormData_pg_statistic *Form_pg_statistic;
--- 268,274 ----
*/
#define STATISTIC_KIND_MCELEM  4
+ 
+ #define STATISTIC_KIND_LENGTH_HISTOGRAM  5

The other kinds have long explanatory comments; this one should, too.

*** a/src/include/catalog/pg_type.h
--- b/src/include/catalog/pg_type.h

! DATA(insert OID = 1021 ( _float4 PGNSP PGUID -1 f b A f t \054 0 700 0 array_in array_out array_recv array_send - - array_typanalyze i x f 0 -1 0 0 _null_ _null_ ));

With this patch, a fresh database sets typanalyze = array_typanalyze for 27
array types and leaves typanalyze = NULL for the other 38 array types. What
is the rationale for the split? For example, why does real[] use the new
typanalyze but numeric[] does not?

Thanks,
nm

Attachments:

regression.diffstext/plain; charset=us-asciiDownload+4-2
#7Alexander Korotkov
aekorotkov@gmail.com
In reply to: Noah Misch (#6)
Re: Collect frequency statistics for arrays

Hi!

Thanks for your great work on reviewing this patch. Now I'm trying to find
memory corruption bug. Unfortunately it doesn't appears on my system. Can
you check if this bug remains in attached version of patch. If so, please
provide me information about system you're running (processor, OS etc.).

------
With best regards,
Alexander Korotkov.

Attachments:

arrayanalyze-0.8.patch.gzapplication/x-gzip; name=arrayanalyze-0.8.patch.gzDownload
#8Noah Misch
noah@leadboat.com
In reply to: Alexander Korotkov (#7)
Re: Collect frequency statistics for arrays

On Wed, Jan 04, 2012 at 12:09:16AM +0400, Alexander Korotkov wrote:

Thanks for your great work on reviewing this patch. Now I'm trying to find
memory corruption bug. Unfortunately it doesn't appears on my system. Can
you check if this bug remains in attached version of patch. If so, please
provide me information about system you're running (processor, OS etc.).

I get the same diagnostic from this version. Opteron processor, operating
system is Ubuntu 8.04 (64-bit). You're using --enable-cassert, right?

#9Alexander Korotkov
aekorotkov@gmail.com
In reply to: Noah Misch (#8)
Re: Collect frequency statistics for arrays

On Wed, Jan 4, 2012 at 12:33 AM, Noah Misch <noah@leadboat.com> wrote:

On Wed, Jan 04, 2012 at 12:09:16AM +0400, Alexander Korotkov wrote:

Thanks for your great work on reviewing this patch. Now I'm trying to

find

memory corruption bug. Unfortunately it doesn't appears on my system. Can
you check if this bug remains in attached version of patch. If so, please
provide me information about system you're running (processor, OS etc.).

I get the same diagnostic from this version. Opteron processor, operating
system is Ubuntu 8.04 (64-bit). You're using --enable-cassert, right?

Oh, actually no. Thanks for point.

------
With best regards,
Alexander Korotkov.

#10Noah Misch
noah@leadboat.com
In reply to: Noah Misch (#6)
Re: Collect frequency statistics for arrays

Corrections:

On Thu, Dec 29, 2011 at 11:35:00AM -0500, Noah Misch wrote:

On Wed, Nov 09, 2011 at 08:49:35PM +0400, Alexander Korotkov wrote:

+  *	We set s to be the estimated frequency of the K'th element in a natural
+  *	language's frequency table, where K is the target number of entries in
+  *	the MCELEM array. We assume that the distribution of element frequencies
+  *	follows Zipf's law with an exponent of 1.
+  *
+  *	Assuming Zipfian distribution, the frequency of the K'th element is equal
+  *	to 1/(K * H(W)) where H(n) is 1/2 + 1/3 + ... + 1/n and W is the number of
+  *	elements in the language.	Putting W as one million, we get roughly
+  *	0.07/K. This gives s = 0.07/K.	We set epsilon = s/10, which gives bucket
+  *	width w = K/0.007 and maximum expected hashtable size of about 1000 * K.

These last two paragraphs, adapted from ts_typanalyze.c, assume natural
language documents. To what extent do these parameter choices remain sensible
for arbitrary data such as users may place in arrays? In any event, we need a
different justification, even if it's just a hand-wavy justification.

If I'm following this correctly, this choice of "s" makes the algorithm
guaranteed to find only elements constituting >= 7% of the input elements.
Incidentally, isn't that far too high for natural language documents? If the
English word "the" only constitutes 7% of typical documents, then this "s"
value would permit us to discard practically every word; we'd be left with
words read while filling the last bucket and words that happened to repeat
exceedingly often in the column. I haven't tried to make a test case to
observe this problem; am I missing something? (This question is largely
orthogonal to your patch.)

No, we'll find elements of frequency at least 0.07/(default_statistics_target
* 10) -- in the default configuration, 0.007%. Also, ts_typanalyze() counts
the number of documents that contain one or more instances of each lexeme,
ignoring the number of appearances within each document. The word "the" may
constitute 7% of a typical document, but it will appear at least once in
nearly 100% of documents. Therefore, this "s" value is adequate even for the
pathological case of each "document" containing just one lexeme.

+  *
+  *	Note: in the above discussion, s, epsilon, and f/N are in terms of a
+  *	element's frequency as a fraction of all elements seen in the input.
+  *	However, what we actually want to store in the finished pg_statistic
+  *	entry is each element's frequency as a fraction of all rows that it occurs
+  *	in. Elements might be repeated in the same array. Since operators
+  *	<@, &&, @> takes care only about element occurence itself and not about
+  *	occurence count, function takes additional care about uniqueness of
+  *	counting. Also we need to change the divisor from N to nonnull_cnt to get
+  *	the number we want.

On the same tangent, why does ts_typanalyze() not deduplicate the same way?
The @@ operator has the same property.

Answer: to_tsvector() will have already done so.

+ 	/*
+ 	 * We set bucket width equal to (num_mcelem + 10) / 0.007 as per the
+ 	 * comment above.
+ 	 */
+ 	bucket_width = num_mcelem * 1000 / 7;

The addend mentioned is not present in the code or discussed in "the comment
above". (I see the comment is copied verbatim from ts_typanalyze(), where the
addend *is* present, though again the preceding comment says nothing of it.)

The addend rationale in fact does appear in the ts_typanalyze() comment.

Thanks,
nm

#11Alexander Korotkov
aekorotkov@gmail.com
In reply to: Noah Misch (#10)
Re: Collect frequency statistics for arrays

Hi!

Patch where most part of issues are fixed is attached.

On Thu, Dec 29, 2011 at 8:35 PM, Noah Misch <noah@leadboat.com> wrote:

I find distressing the thought of having two copies of the lossy sampling
code, each implementing the algorithm with different variable names and

levels

of generality. We might someday extend this to hstore, and then we'd

have yet

another copy. Tom commented[1] that ts_typanalyze() and

array_typanalyze()

should remain distinct, and I agree. However, they could call a shared
counting module. Is that practical? Possible API:

typedef struct LossyCountCtl;
LossyCountCtl *LossyCountStart(float s,
float epsilon,
int2 typlen,
bool typbyval,
Oid eqfunc); /*

+ hash func, a few others */

void LossyCountAdd(LossyCountCtl *ctl, Datum elem);
TrackItem **LossyCountGetAll(LossyCountCtl *ctl);

[1]

http://archives.postgresql.org/message-id/12406.1298055475@sss.pgh.pa.us

I'm not sure about shared lossy counting module, because part of shared
code would be relatively small. Part of compute_array_stats function which
is taking care about array decompression, distinct occurence calculation,
disting element count histogram, packing statistics slots etc is much
larger than lossy counting algorithm itself. May be, there is some other
opinions in community?

I think this is an improvement, but some code out there may rely on the
ability to get stakind = 4 data from the most_common_vals column. We'll

need

to mention this in the release notes as an incompatibility.

I'm not sure I understand mechanism of release notes. Does it require
something in a patch itself?

+ /*
+  * Let be n independent events with probabilities p. This function

calculates

+  * probabilities of exact k of events occurence for k in [0;m].
+  * Imagine matrix M of (n + 1) x (m + 1) size. Element M[i,j] denotes
+  * probability that exact j of first i events occurs. Obviously

M[0,0] = 1.

+ * Each next event increase total number of occured events if it

occurs and

+ * leave last value of that number if it doesn't occur. So, by the

law of

+ * total probability: M[i,j] = M[i - 1, j] * (1 - p[i]) + M[i - 1, j

- 1] * p[i]

+  * for i > 0, j > 0. M[i,0] = M[i - 1, 0] * (1 - p[i]) for i > 0.
+  * Also there could be some events with low probabilities. Their

summary

+  * probability passed in the rest parameter.
+  */
+ static float *
+ calc_distr(float *p, int n, int m, float rest)
+ {
+     /* Take care about events with low probabilities. */
+     if (rest > 0.0f)
+     {
+             /*
+              * The probability of no occurence of events which forms

"rest"

+ * probability have a limit of exp(-rest) when number of

events fo to

+ * infinity. Another simplification is to replace that

events with one

+              * event with (1 - exp(-rest)) probability.
+              */
+             rest = 1.0f - exp(-rest);

What is the name of the underlying concept in probability theory?

The most closest concept to caculated distribution is multinomial
distribution. But it's not exactly same, because multinomial distribution
gives probability of particular count of each event occurece, not
probability of summary occurence. Actually, distribution is caclulated just
from assumption of events independence. The most closest concept of rest
probability is approximation by exponential distribution. It's quite rough
approximation, but I can't invent something better with low calculation
complexity.

+ /*
+  * Array selectivity estimation based on most common elements

statistics for

+ * "column <@ const" case. Assumption that element occurences are

independent

+ * means certain distribution of array lengths. Typically real

distribution

+ * of lengths is significantly different from it. For example, if

even we

+ * have set of arrays with 1 integer element in range [0;10] each,

element

+ * occurences are not independent. Because in the case of

independence we

Do you refer to a column where '{1,12,46}' and '{13,7}' may appear, but
'{6,19,4}' cannot appear?

I refer column where only one element exists, i.e. only possible values are
'{0}', '{1}', '{2}', '{3}', '{4}', '{5}', '{6}', '{7}', '{8}', '{9}',
'{10}'. That is a corner case. But similar situation occurs when, for
example, we've distribution of distinct element count between 1 and 3. It
significantly differs from distribution from independent occurence.

+ * have probabilities of length of 0, 1, 2 etc. In the "column @>

const"

+ * and "column && const" cases we usually have "const" with low

summary

+ * frequency of elements (otherwise we have selectivity close to 0 or

1

+ * correspondingly). That's why effect of dependence related to

lengths

+ * distribution id negligible there. In the "column <@ const" case

summary

+ * frequency of elements is high (otherwise we have selectivity close

to 0).

What does the term "summary frequency" denote?

I meant summ of frequences of "const" array elements.

+     /*
+      * Rest is a average length of elements which aren't present in

mcelem.

+ */
+ rest = avg_length;

You define "rest" here as an array length ...

+
+     default_freq = Min(DEFAULT_CONT_SEL, minfreq / 2);
+
+     mcelem_index = 0;
+
+     /*
+      * mult is the multiplier that presents estimate of probability

that each

+      * mcelem which is not present in constant doesn't occur.
+      */
+     mult = 1.0f;
+
+     for (i = 0; i < nitems; i++)
+     {
+             bool            found = false;
+
+             /* Comparison with previous value in order to guarantee

uniquness */

+             if (i > 0)
+             {
+                     if (!element_compare(&array_data[i - 1],

&array_data[i]))

+                             continue;
+             }
+
+             /*
+              * Iterate over mcelem until find mcelem that is greater

or equal to

+ * element of constant. Simultaneously taking care about

rest and

+ * mult. If that mcelem is found then fill corresponding

elem_selec.

+              */
+             while (mcelem_index < nmcelem)
+             {
+                     int                     cmp =

element_compare(&mcelem[mcelem_index], &array_data[i]);

+
+                     if (cmp < 0)
+                     {
+                             mult *= (1.0f - numbers[mcelem_index]);
+                             rest -= numbers[mcelem_index];

... But here, you're subtracting a frequency from an array length?

Yes, because average distinct element count is summ of frequencies of
elements. Substracting mcelem frequencies from avg_length we have summ of
frequencies of non-mcelem elements.

------
With best regards,
Alexander Korotkov.

Attachments:

arrayanalyze-0.9.patch.gzapplication/x-gzip; name=arrayanalyze-0.9.patch.gzDownload
#12Noah Misch
noah@leadboat.com
In reply to: Alexander Korotkov (#11)
Re: Collect frequency statistics for arrays

On Sat, Jan 07, 2012 at 09:36:42PM +0400, Alexander Korotkov wrote:

Patch where most part of issues are fixed is attached.

Thanks. I've made several, largely cosmetic, edits. See attached version
0.10. Please use it as the basis for your next version, and feel free to
revert any changes you deem inappropriate. Where I made non-cosmetic edits, I
attempt to point that out below. I've left unfixed a few more-substantive
problems, also described below.

When you post another update, could you add it to the open CF? Given the
timing, I think we might as well consider any further activity to have
happened under the aegis of the 2012-01 CF. I'm marking the current entry
Returned with Feedback.

On Thu, Dec 29, 2011 at 8:35 PM, Noah Misch <noah@leadboat.com> wrote:

I find distressing the thought of having two copies of the lossy sampling
code, each implementing the algorithm with different variable names and

levels

of generality. We might someday extend this to hstore, and then we'd

have yet

another copy. Tom commented[1] that ts_typanalyze() and

array_typanalyze()

should remain distinct, and I agree. However, they could call a shared
counting module. Is that practical? Possible API:

typedef struct LossyCountCtl;
LossyCountCtl *LossyCountStart(float s,
float epsilon,
int2 typlen,
bool typbyval,
Oid eqfunc); /*

+ hash func, a few others */

void LossyCountAdd(LossyCountCtl *ctl, Datum elem);
TrackItem **LossyCountGetAll(LossyCountCtl *ctl);

[1]

http://archives.postgresql.org/message-id/12406.1298055475@sss.pgh.pa.us

I'm not sure about shared lossy counting module, because part of shared
code would be relatively small. Part of compute_array_stats function which
is taking care about array decompression, distinct occurence calculation,
disting element count histogram, packing statistics slots etc is much
larger than lossy counting algorithm itself. May be, there is some other
opinions in community?

True; it would probably increase total lines of code. The benefit, if any,
lies in separation of concerns; the business of implementing this algorithm is
quite different from the other roles of these typanalyze functions. I won't
insist that you try it, though.

I think this is an improvement, but some code out there may rely on the
ability to get stakind = 4 data from the most_common_vals column. We'll

need

to mention this in the release notes as an incompatibility.

I'm not sure I understand mechanism of release notes. Does it require
something in a patch itself?

No. I just wanted to call attention to the fact in the hope that someone
remembers as the release notes get drafted.

+             /*
+              * The probability of no occurence of events which forms

"rest"

+ * probability have a limit of exp(-rest) when number of

events fo to

+ * infinity. Another simplification is to replace that

events with one

+              * event with (1 - exp(-rest)) probability.
+              */
+             rest = 1.0f - exp(-rest);

What is the name of the underlying concept in probability theory?

The most closest concept to caculated distribution is multinomial
distribution. But it's not exactly same, because multinomial distribution
gives probability of particular count of each event occurece, not
probability of summary occurence. Actually, distribution is caclulated just
from assumption of events independence. The most closest concept of rest
probability is approximation by exponential distribution. It's quite rough
approximation, but I can't invent something better with low calculation
complexity.

Do you have a URL of a tutorial or paper that explains the method in more
detail? If, rather, this is a novel synthesis, could you write a proof to
include in the comments?

+ /*
+  * Array selectivity estimation based on most common elements

statistics for

+ * "column <@ const" case. Assumption that element occurences are

independent

+ * means certain distribution of array lengths. Typically real

distribution

+ * of lengths is significantly different from it. For example, if

even we

+ * have set of arrays with 1 integer element in range [0;10] each,

element

+ * occurences are not independent. Because in the case of

independence we

Do you refer to a column where '{1,12,46}' and '{13,7}' may appear, but
'{6,19,4}' cannot appear?

I refer column where only one element exists, i.e. only possible values are
'{0}', '{1}', '{2}', '{3}', '{4}', '{5}', '{6}', '{7}', '{8}', '{9}',
'{10}'. That is a corner case. But similar situation occurs when, for
example, we've distribution of distinct element count between 1 and 3. It
significantly differs from distribution from independent occurence.

Oh, I think I see now. If each element 1..10 had frequency 0.1 independently,
column values would have exactly one distinct element just 39% of the time?

If probability theory has a prototypical problem resembling this, it would be
nice to include a URL to a thorough discussion thereof. I could not think of
the search terms to find one, though.

+ * have probabilities of length of 0, 1, 2 etc. In the "column @>

const"

+ * and "column && const" cases we usually have "const" with low

summary

+ * frequency of elements (otherwise we have selectivity close to 0 or

1

+ * correspondingly). That's why effect of dependence related to

lengths

+ * distribution id negligible there. In the "column <@ const" case

summary

+ * frequency of elements is high (otherwise we have selectivity close

to 0).

What does the term "summary frequency" denote?

I meant summ of frequences of "const" array elements.

Do you mean literally P_0 + P_1 ... + P_N? If so, I can follow the above
argument for "column && const" and "column <@ const", but not for "column @>
const". For "column @> const", selectivity cannot exceed the smallest
frequency among elements of "const". Several high-frequency elements together
will drive up the sum of frequencies without increasing the true selectivity.

+     /*
+      * Rest is a average length of elements which aren't present in

mcelem.

+ */
+ rest = avg_length;

You define "rest" here as an array length ...

+ rest -= numbers[mcelem_index];

... But here, you're subtracting a frequency from an array length?

Yes, because average distinct element count is summ of frequencies of
elements. Substracting mcelem frequencies from avg_length we have summ of
frequencies of non-mcelem elements.

I see now; thanks. I updated the comments so that this would have been
clearer to me.

*** /dev/null
--- b/src/backend/utils/adt/array_selfuncs.c
+ Selectivity
+ calc_scalararraysel(VariableStatData *vardata, Datum constval, bool orClause,
+ 					Oid operator)
+ {
+ 	Oid			elemtype;
+ 	Selectivity selec;
+ 	TypeCacheEntry *typentry;
+ 	Datum	   *hist;
+ 	int			nhist;
+ 	FunctionCallInfoData cmpfunc;
+ 
+ 	elemtype = get_base_element_type(vardata->vartype);
+ 
+ 
+ 	/* Get default comparison function */
+ 	typentry = lookup_type_cache(elemtype,
+ 		   TYPECACHE_CMP_PROC | TYPECACHE_CMP_PROC_FINFO | TYPECACHE_EQ_OPR);
+ 
+ 	/* Handle only "=" operator. Return default selectivity in other cases. */
+ 	if (operator != typentry->eq_opr)
+ 		return (Selectivity) 0.5;

Punting on other operators this way creates a plan quality regression for
operations like "const < ANY (column)". Please do it some way that falls
back on the somewhat-better existing scalararraysel() treatment for this.

+ 
+ 	/* Without comparison function return default selectivity estimation */
+ 	if (!OidIsValid(typentry->cmp_proc))
+ 	{
+ 		if (orClause)
+ 			return DEFAULT_OVERLAP_SEL;
+ 		else
+ 			return DEFAULT_CONTAIN_SEL;
+ 	}

Since "const = ANY (column)" is equivalent to "column @> array[const]" and
"const = ALL (column)" is equivalent to "column <@ array[const]",
DEFAULT_CONTAIN_SEL is always correct here. I've made that change.

+ /*
+  * Calculate first n distinct element counts probabilities by histogram. We
+  * assume that any interval between a and b histogram values gives
+  * 1 / ((b - a + 1) * (nhist - 1)) probability to values between a and b and
+  * half of that to a and b. Returns total probability that distinct element
+  * count is less of equal to n.
+  */
+ static float
+ calc_hist(Datum *hist, int nhist, float *hist_part, int n)

To test this function, I ran the following test case:

set default_statistics_target = 4;
create table t3 as select array(select * from generate_series(1, v)) as arr
from (values (2),(2),(2),(3),(5),(5),(5)) v(v), generate_series(1,100);
analyze t3; -- length_histogram_bounds = {2,2,5,5}
select * from t3 where arr <@ array[6,7,8,9,10,11];

Using gdb to observe calc_hist()'s result during the last command:

(gdb) p calc_hist(hist, nhist, hist_part, unique_nitems)
$23 = 0.666666687
(gdb) x/6f hist_part
0xcd4bc8: 0 0 0.333333343 0
0xcd4bd8: 0 0.333333343

I expected an equal, nonzero probability in hist_part[3] and hist_part[4] and
a total probability of 1.0.

+ {
+ 	int			k,
+ 				i = 0,
+ 				prev_interval = 0,
+ 				next_interval = 0;
+ 	float		frac,
+ 				total = 0.0f;
+ 
+ 	/*
+ 	 * frac is a probability contribution by each interval between histogram
+ 	 * values. We have nhist - 1 intervals. Contribution of one will be 1 /
+ 	 * (nhist - 1).
+ 	 */
+ 	frac = 1.0f / ((float) (nhist - 1));
+ 	for (k = 0; k <= n; k++)
+ 	{
+ 		int			count = 0;
+ 
+ 		/* Count occurences of k distinct element counts in histogram. */
+ 		while (i < nhist && DatumGetInt32(hist[i]) <= k)
+ 		{
+ 			if (DatumGetInt32(hist[i]) == k)
+ 				count++;
+ 			i++;
+ 		}
+ 
+ 		if (count > 0)
+ 		{
+ 			float		val;
+ 
+ 			/* Find length between current histogram value and the next one */
+ 			if (i < nhist)
+ 				next_interval = DatumGetInt32(hist[i + 1]) -

Doesn't this read past the array end when i == nhist - 1?

+ /*
+  * Let be n independent events with probabilities p. This function calculates
+  * probabilities of exact k of events occurence for k in [0;m].
+  * Imagine matrix M of (n + 1) x (m + 1) size. Element M[i,j] denotes
+  * probability that exact j of first i events occurs. Obviously M[0,0] = 1.
+  * Each next event increase total number of occured events if it occurs and
+  * leave last value of that number if it doesn't occur. So, by the law of
+  * total probability: M[i,j] = M[i - 1, j] * (1 - p[i]) + M[i - 1, j - 1] * p[i]
+  * for i > 0, j > 0. M[i,0] = M[i - 1, 0] * (1 - p[i]) for i > 0.
+  * Also there could be some events with low probabilities. Their summary
+  * probability passed in the rest parameter.
+  */
+ static float *
+ calc_distr(float *p, int n, int m, float rest)

I attempted to clarify this comment; please see if I preserved its accuracy.

+ 	/*
+ 	 * Using of distinct element counts histogram requires O(nitems * (nmcelem
+ 	 * + nitems)) operations. It's reasonable to limit the number of required
+ 	 * operation and give less accurate answer when this limit exceed.
+ 	 */
+ 	if (nhist > 0 && unique_nitems <=
+ 		300 * default_statistics_target / (nmcelem + unique_nitems))

I benchmarked the quadratic complexity here. With default settings, this
cutoff skips the algorithm beginning around a 170-element constant array,
which would nonetheless take single-digit milliseconds to plan. When I
temporarily removed the cutoff, I needed much larger scales to get poor plan
times. 5000 elements took 180ms to plan, and 10000 elements took 620 ms.
Bottom line, the cutoff you've chosen is plenty conservative.

+ static int
+ element_compare(const void *key1, const void *key2, void *arg)
+ {
+ 	const Datum *d1 = (const Datum *) key1;
+ 	const Datum *d2 = (const Datum *) key2;
+ 	FunctionCallInfo cmpfunc = (FunctionCallInfo) arg;
+ 
+ 	cmpfunc   ->arg[0] = *d1;
+ 	cmpfunc   ->arg[1] = *d2;
+ 	cmpfunc   ->argnull[0] = false;
+ 	cmpfunc   ->argnull[1] = false;
+ 	cmpfunc   ->isnull = false;

This indented poorly due to "cmpfunc" having a place in our typedefs list. I
changed the identifier.

*** /dev/null
--- b/src/backend/utils/adt/array_typanalyze.c
+  *	We set s to be the estimated frequency of the K'th element in a natural
+  *	language's frequency table, where K is the target number of entries in
+  *	the MCELEM array. We assume that the distribution of element frequencies
+  *	follows Zipf's law with an exponent of 1.
+  *
+  *	Assuming Zipfian distribution, the frequency of the K'th element is equal
+  *	to 1/(K * H(W)) where H(n) is 1/2 + 1/3 + ... + 1/n and W is the number of
+  *	elements in the language.	Putting W as one million, we get roughly
+  *	0.07/K. This gives s = 0.07/K.	We set epsilon = s/10, which gives bucket
+  *	width w = K/0.007 and maximum expected hashtable size of about 1000 * K.

Given the lack of applicability to arrays, I replaced these last two
paragraphs with some weasel words. My gut feeling is that we're priming the
algorithm to deliver answers far more precise than needed. However, I haven't
attempted a principled replacement.

+ 	/* This is 'w' from the LC algorithm */
+ 	int			bucket_width;
+ 	int			array_no,
+ 				element_no;

I think it's possible for element_no to overflow. Consider rows with 2000
distinct elements apiece at a statistics target of 10000 (3M sample rows).
So, I made it a uint64.

+ extra_data = (ArrayAnalyzeExtraData *) stats->extra_data;

This still isn't reentrant; you'd need to save the existing static extra_data
and restore it on function exit. However, it turns out that do_analyze_rel()
itself isn't reentrant on account of its similar management of "anl_context";
any nested ANALYZE crashes the backend. So, I don't think we need further
change here. It will be easy to make reentrant later if necessary, though I'd
probably fix do_analyze_rel() by just throwing an error on recursive ANALYZE.

+ 	stats->extra_data = extra_data->std_extra_data;
+ 	old_context = CurrentMemoryContext;
+ 	extra_data->std_compute_stats(stats, fetchfunc, samplerows, totalrows);
+ 	MemoryContextSwitchTo(old_context);

Is the callee known to change CurrentMemoryContext and not restore it?
Offhand, I'm not seeing how it could do so.

+ 	/*
+ 	 * hashtable for arrays distinct element count.
+ 	 */
+ 	MemSet(&count_hash_ctl, 0, sizeof(count_hash_ctl));
+ 	count_hash_ctl.keysize = sizeof(int);
+ 	count_hash_ctl.entrysize = sizeof(DistinctElementCountItem);
+ 	count_hash_ctl.hash = tag_hash;
+ 	count_hash_ctl.match = memcmp;
+ 	count_hash_ctl.hcxt = CurrentMemoryContext;
+ 	count_tab = hash_create("Array distinct element count table",
+ 							64,
+ 							&count_hash_ctl,
+ 					HASH_ELEM | HASH_FUNCTION | HASH_COMPARE | HASH_CONTEXT);

This HASH_COMPARE setting is redundant, so I've removed it.

+ 		/* Skip too large values. */
+ 		if (toast_raw_datum_size(value) > WIDTH_THRESHOLD)

Fixed this warning:

array_typanalyze.c: In function `compute_array_stats':
array_typanalyze.c:361: warning: implicit declaration of function `toast_raw_datum_size'

+ 			continue;
+ 		else
+ 			analyzed_rows++;
+ 
+ 		/*
+ 		 * Add up widths for average-width calculation.  Since it's a array,
+ 		 * we know it's varlena.  As in the regular compute_minimal_stats
+ 		 * function, we use the toasted width for this calculation.
+ 		 */
+ 		total_width += VARSIZE_ANY(DatumGetPointer(value));

Since this is now unused, I removed it.

+ 			/* Lookup current element in hashtable, adding it if new */
+ 			item = (TrackItem *) hash_search(elements_tab,
+ 											 (const void *) &hash_key,
+ 											 HASH_ENTER, &found);
+ 
+ 			if (found)
+ 			{

I added a pfree(hash_key) here. In one of my default_statistics_target=3000
tests on a table with few possible elements, this saved hundreds of megabytes
of memory.

+ 				int			i;
+ 
+ 				/*
+ 				 * The element is already on the tracking list. Check if it's
+ 				 * first occurence of this element in array.
+ 				 */
+ 				for (i = 0; i < occurence_index; i++)
+ 				{
+ 					if (occurences[i] == item)
+ 						break;
+ 				}

This wasn't what I had in mind when I suggested the different approach last
time. See how I changed it in this version, and let me know if you see any
essential disadvantages.

+ 		/* Update frequency of particular array distinct element count. */
+ 		count_item = (DistinctElementCountItem *) hash_search(count_tab,
+ 															&occurence_index,
+ 											  HASH_ENTER, &count_item_found);
+ 		if (count_item_found)
+ 			count_item->frequency++;
+ 		else
+ 		{
+ 			count_item->count = occurence_index;

The key gets initialized automatically, so I removed this line.

+ 			count_item->frequency = 1;
+ 		}
+ 		total_distinct_count += occurence_index;

total_distinct_count seemed to follow element_no exactly, so I removed it.

*** a/src/include/catalog/pg_statistic.h
--- b/src/include/catalog/pg_statistic.h
***************
*** 260,263 **** typedef FormData_pg_statistic *Form_pg_statistic;
--- 268,285 ----
*/
#define STATISTIC_KIND_MCELEM  4
+ /*
+  * A "length histogram" slot describes the distribution of lengths of data for
+  * datatypes where length is important for selectivity estimation. stavalues
+  * contains M (>=2) non-null values that divide the non-null column data values
+  * into M-1 bins of approximately equal population. The first stavalues item
+  * is the minimum length and the last is the maximum length. In dependence on
+  * datatype this slot can hold distribution of not exactly length, but of
+  * similar value. For instance, it hold distribution of distinct elements count
+  * for arrays, because multiple occurences of array elements are ignored by
+  * array comparison operators. 
+  *
+  */
+ #define STATISTIC_KIND_LENGTH_HISTOGRAM  5

I changed this text to say that we always store distinct element counts. We
can always update the comment later if we diversify its applications.

*** a/src/include/catalog/pg_type.h
--- b/src/include/catalog/pg_type.h

This now updates all array types except record[]. I'm don't know offhand how
to even make a non-empty value of type record[], let alone get it into a
context where ANALYZE would see it. However, is there a particular reason to
make that one different?

Thanks,
nm

Attachments:

arrayanalyze-0.10.patchtext/plain; charset=us-asciiDownload+2062-245
#13Alexander Korotkov
aekorotkov@gmail.com
In reply to: Noah Misch (#12)
Re: Collect frequency statistics for arrays

Hi!

Thanks for your fixes to the patch. Them looks correct to me. I did some
fixes in the patch. The proof of some concepts is still needed. I'm going
to provide it in a few days.

On Thu, Jan 12, 2012 at 3:06 PM, Noah Misch <noah@leadboat.com> wrote:

I'm not sure about shared lossy counting module, because part of shared
code would be relatively small. Part of compute_array_stats function

which

is taking care about array decompression, distinct occurence calculation,
disting element count histogram, packing statistics slots etc is much
larger than lossy counting algorithm itself. May be, there is some other
opinions in community?

True; it would probably increase total lines of code. The benefit, if any,
lies in separation of concerns; the business of implementing this
algorithm is
quite different from the other roles of these typanalyze functions. I
won't
insist that you try it, though.

I'd prefer to try it as separate patch.

+             /*
+              * The probability of no occurence of events which

forms

"rest"

+ * probability have a limit of exp(-rest) when number

of

events fo to

+ * infinity. Another simplification is to replace that

events with one

+              * event with (1 - exp(-rest)) probability.
+              */
+             rest = 1.0f - exp(-rest);

What is the name of the underlying concept in probability theory?

The most closest concept to caculated distribution is multinomial
distribution. But it's not exactly same, because multinomial distribution
gives probability of particular count of each event occurece, not
probability of summary occurence. Actually, distribution is caclulated

just

from assumption of events independence. The most closest concept of rest
probability is approximation by exponential distribution. It's quite

rough

approximation, but I can't invent something better with low calculation
complexity.

Do you have a URL of a tutorial or paper that explains the method in more
detail? If, rather, this is a novel synthesis, could you write a proof to
include in the comments?

Unfortunately I don't have relevant paper for it. I'll try to search
it. Otherwise I'll try to do some proof.

+ /*
+  * Array selectivity estimation based on most common elements

statistics for

+ * "column <@ const" case. Assumption that element occurences are

independent

+ * means certain distribution of array lengths. Typically real

distribution

+ * of lengths is significantly different from it. For example, if

even we

+ * have set of arrays with 1 integer element in range [0;10] each,

element

+ * occurences are not independent. Because in the case of

independence we

Do you refer to a column where '{1,12,46}' and '{13,7}' may appear, but
'{6,19,4}' cannot appear?

I refer column where only one element exists, i.e. only possible values

are

'{0}', '{1}', '{2}', '{3}', '{4}', '{5}', '{6}', '{7}', '{8}', '{9}',
'{10}'. That is a corner case. But similar situation occurs when, for
example, we've distribution of distinct element count between 1 and 3. It
significantly differs from distribution from independent occurence.

Oh, I think I see now. If each element 1..10 had frequency 0.1
independently,
column values would have exactly one distinct element just 39% of the time?

Yes, it's right.

If probability theory has a prototypical problem resembling this, it would
be
nice to include a URL to a thorough discussion thereof. I could not think
of
the search terms to find one, though.

Actually, usage of both distinct element count histogram and element
occurrence frequencies produce some probability distribution (which is more
complex than just independent element occurrence). If real distribution is
close this distribution, then estimate is accurate. I didn't met such
distributions is papers, actually I've just invented it in my tries to do
accurate "column <@ const" estimation at least in simple cases. I'll try to
search for similar things in papers, but I doubt I'll succeed. Otherwise
I'll try to do some more detailed proof.

*** /dev/null
--- b/src/backend/utils/adt/array_selfuncs.c
+ Selectivity
+ calc_scalararraysel(VariableStatData *vardata, Datum constval, bool

orClause,

+                                     Oid operator)
+ {
+     Oid                     elemtype;
+     Selectivity selec;
+     TypeCacheEntry *typentry;
+     Datum      *hist;
+     int                     nhist;
+     FunctionCallInfoData cmpfunc;
+
+     elemtype = get_base_element_type(vardata->vartype);
+
+
+     /* Get default comparison function */
+     typentry = lookup_type_cache(elemtype,
+                TYPECACHE_CMP_PROC | TYPECACHE_CMP_PROC_FINFO |

TYPECACHE_EQ_OPR);

+
+     /* Handle only "=" operator. Return default selectivity in other

cases. */

+ if (operator != typentry->eq_opr)
+ return (Selectivity) 0.5;

Punting on other operators this way creates a plan quality regression for
operations like "const < ANY (column)". Please do it some way that falls
back on the somewhat-better existing scalararraysel() treatment for this.

I've made calc_scalararraysel return -1 in this case or in the case of no
comparison function. scalararraysel continues it's work
when calc_scalararraysel returns negative value.

+ /*
+  * Calculate first n distinct element counts probabilities by

histogram. We

+  * assume that any interval between a and b histogram values gives
+  * 1 / ((b - a + 1) * (nhist - 1)) probability to values between a and

b and

+ * half of that to a and b. Returns total probability that distinct

element

+  * count is less of equal to n.
+  */
+ static float
+ calc_hist(Datum *hist, int nhist, float *hist_part, int n)

To test this function, I ran the following test case:

set default_statistics_target = 4;
create table t3 as select array(select * from generate_series(1, v)) as arr
from (values (2),(2),(2),(3),(5),(5),(5)) v(v), generate_series(1,100);
analyze t3; -- length_histogram_bounds = {2,2,5,5}
select * from t3 where arr <@ array[6,7,8,9,10,11];

Using gdb to observe calc_hist()'s result during the last command:

(gdb) p calc_hist(hist, nhist, hist_part, unique_nitems)
$23 = 0.666666687
(gdb) x/6f hist_part
0xcd4bc8: 0 0 0.333333343 0
0xcd4bd8: 0 0.333333343

I expected an equal, nonzero probability in hist_part[3] and hist_part[4]
and
a total probability of 1.0.

+ {
+     int                     k,
+                             i = 0,
+                             prev_interval = 0,
+                             next_interval = 0;
+     float           frac,
+                             total = 0.0f;
+
+     /*
+      * frac is a probability contribution by each interval between

histogram

+ * values. We have nhist - 1 intervals. Contribution of one will

be 1 /

+      * (nhist - 1).
+      */
+     frac = 1.0f / ((float) (nhist - 1));
+     for (k = 0; k <= n; k++)
+     {
+             int                     count = 0;
+
+             /* Count occurences of k distinct element counts in

histogram. */

+             while (i < nhist && DatumGetInt32(hist[i]) <= k)
+             {
+                     if (DatumGetInt32(hist[i]) == k)
+                             count++;
+                     i++;
+             }
+
+             if (count > 0)
+             {
+                     float           val;
+
+                     /* Find length between current histogram value and

the next one */

+                     if (i < nhist)
+                             next_interval = DatumGetInt32(hist[i + 1])

-

Doesn't this read past the array end when i == nhist - 1?

It was a bug. It also causes wrong histogram calculation you noted above.
Fixed.

+     stats->extra_data = extra_data->std_extra_data;
+     old_context = CurrentMemoryContext;
+     extra_data->std_compute_stats(stats, fetchfunc, samplerows,

totalrows);

+ MemoryContextSwitchTo(old_context);

Is the callee known to change CurrentMemoryContext and not restore it?
Offhand, I'm not seeing how it could do so.

Right. Saving of memory context is not needed. Removed.

*** a/src/include/catalog/pg_type.h

--- b/src/include/catalog/pg_type.h

This now updates all array types except record[]. I'm don't know offhand
how
to even make a non-empty value of type record[], let alone get it into a
context where ANALYZE would see it. However, is there a particular reason
to
make that one different?

Oh, I didn't update all array types in 2 tries :) Fixed.

------
With best regards,
Alexander Korotkov.

Attachments:

arrayanalyze-0.11.patch.gzapplication/x-gzip; name=arrayanalyze-0.11.patch.gzDownload
#14Noah Misch
noah@leadboat.com
In reply to: Alexander Korotkov (#13)
Re: Collect frequency statistics for arrays

On Tue, Jan 17, 2012 at 12:04:06PM +0400, Alexander Korotkov wrote:

Thanks for your fixes to the patch. Them looks correct to me. I did some
fixes in the patch. The proof of some concepts is still needed. I'm going
to provide it in a few days.

Your further fixes look good. Could you also answer my question about the
header comment of mcelem_array_contained_selec()?

/*
* Estimate selectivity of "column <@ const" based on most common element
* statistics. Independent element occurrence would imply a particular
* distribution of distinct element counts among matching rows. Real data
* usually falsifies that assumption. For example, in a set of 1-element
* integer arrays having elements in the range [0;10], element occurrences are
* not independent. If they were, a sufficiently-large set would include all
* distinct element counts 0 through 11. We correct for this using the
* histogram of distinct element counts.
*
* In the "column @> const" and "column && const" cases, we usually have
* "const" with low summary frequency of elements (otherwise we have
* selectivity close to 0 or 1 correspondingly). That's why the effect of
* dependence related to distinct element counts distribution is negligible
* there. In the "column <@ const" case, summary frequency of elements is
* high (otherwise we have selectivity close to 0). That's why we should do
* correction due to array distinct element counts distribution.
*/

By "summary frequency of elements", do you mean literally P_0 + P_1 ... + P_N?
If so, I can follow the above argument for "column && const" and "column <@
const", but not for "column @> const". For "column @> const", selectivity
cannot exceed the smallest frequency among const elements. A number of
high-frequency elements will drive up the sum of the frequencies without
changing the true selectivity much at all.

Thanks,
nm

#15Alexander Korotkov
aekorotkov@gmail.com
In reply to: Noah Misch (#14)
Re: Collect frequency statistics for arrays

Hi!

Updated patch is attached. I've updated comment
of mcelem_array_contained_selec with more detailed description of
probability distribution assumption. Also, I found that "rest" behavious
should be better described by Poisson distribution, relevant changes were
made.

On Tue, Jan 17, 2012 at 2:33 PM, Noah Misch <noah@leadboat.com> wrote:

By "summary frequency of elements", do you mean literally P_0 + P_1 ... +
P_N?
If so, I can follow the above argument for "column && const" and "column <@
const", but not for "column @> const". For "column @> const", selectivity
cannot exceed the smallest frequency among const elements. A number of
high-frequency elements will drive up the sum of the frequencies without
changing the true selectivity much at all.

Referencing to summary frequency is not really correct. It would be more
correct to reference to number of element in "const". When there are many
elements in "const", "column @> const" selectivity tends to be close to 0
and "column @> const" tends to be close to 1. Surely, it's true when
elements have some kind of middle values of frequencies (not very close to
0 and not very close to 1). I've replaced "summary frequency of elements"
by "number of elements".

------
With best regards,
Alexander Korotkov.

Attachments:

arrayanalyze-0.12.patch.gzapplication/x-gzip; name=arrayanalyze-0.12.patch.gzDownload
#16Noah Misch
noah@leadboat.com
In reply to: Alexander Korotkov (#15)
Re: Collect frequency statistics for arrays

On Mon, Jan 23, 2012 at 01:21:20AM +0400, Alexander Korotkov wrote:

Updated patch is attached. I've updated comment
of mcelem_array_contained_selec with more detailed description of
probability distribution assumption. Also, I found that "rest" behavious
should be better described by Poisson distribution, relevant changes were
made.

Thanks. That makes more of the math clear to me. I do not follow all of it,
but I feel that the comments now have enough information that I could go about
doing so.

+ 	/* Take care about events with low probabilities. */
+ 	if (rest > DEFAULT_CONTAIN_SEL)
+ 	{

Why the change from "rest > 0" to this in the latest version?

+ 		/* emit some statistics for debug purposes */
+ 		elog(DEBUG3, "array: target # mces = %d, bucket width = %d, "
+ 			 "# elements = %llu, hashtable size = %d, usable entries = %d",
+ 			 num_mcelem, bucket_width, element_no, i, track_len);

That should be UINT64_FMT. (I introduced that error in v0.10.)

I've attached a new version that includes the UINT64_FMT fix, some edits of
your newest comments, and a rerun of pgindent on the new files. I see no
other issues precluding commit, so I am marking the patch Ready for Committer.
If I made any of the comments worse, please post another update.

Thanks,
nm

Attachments:

arrayanalyze-0.13.patch.gzapplication/x-gunzipDownload
#17Alexander Korotkov
aekorotkov@gmail.com
In reply to: Noah Misch (#16)
Re: Collect frequency statistics for arrays

On Mon, Jan 23, 2012 at 7:58 PM, Noah Misch <noah@leadboat.com> wrote:

+     /* Take care about events with low probabilities. */
+     if (rest > DEFAULT_CONTAIN_SEL)
+     {

Why the change from "rest > 0" to this in the latest version?

Ealier addition of "rest" distribution require O(m) time. Now there is a
more accurate and proved estimate, but it takes O(m^2) time.It doesn't make
general assymptotical time worse, but it significant. That's why I decided
to skip for low values of "rest" which don't change distribution
significantly.

+             /* emit some statistics for debug purposes */
+             elog(DEBUG3, "array: target # mces = %d, bucket width =

%d, "

+ "# elements = %llu, hashtable size = %d, usable

entries = %d",

+ num_mcelem, bucket_width, element_no, i,

track_len);

That should be UINT64_FMT. (I introduced that error in v0.10.)

I've attached a new version that includes the UINT64_FMT fix, some edits of
your newest comments, and a rerun of pgindent on the new files. I see no
other issues precluding commit, so I am marking the patch Ready for
Committer.

Great!

If I made any of the comments worse, please post another update.

Changes looks reasonable for me. Thanks!

------
With best regards,
Alexander Korotkov.

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexander Korotkov (#17)
Re: Collect frequency statistics for arrays

Alexander Korotkov <aekorotkov@gmail.com> writes:

On Mon, Jan 23, 2012 at 7:58 PM, Noah Misch <noah@leadboat.com> wrote:

I've attached a new version that includes the UINT64_FMT fix, some edits of
your newest comments, and a rerun of pgindent on the new files. I see no
other issues precluding commit, so I am marking the patch Ready for
Committer.
If I made any of the comments worse, please post another update.

Changes looks reasonable for me. Thanks!

I am starting to look at this patch now. I'm wondering exactly why the
decision was made to continue storing btree-style statistics for arrays,
in addition to the new stuff. The pg_statistic rows for array columns
tend to be unreasonably wide already, and as-is this patch will make
them quite a lot wider. I think it requires more than a little bit of
evidence to continue storing stats that seem to have only small
probability of usefulness.

In particular, if we didn't store that stuff, we'd not need to widen the
number of columns in pg_statistic, which would noticeably reduce both
the footprint of the patch and the probability of breaking external
code.

regards, tom lane

#19Alexander Korotkov
aekorotkov@gmail.com
In reply to: Tom Lane (#18)
Re: Collect frequency statistics for arrays

On Thu, Mar 1, 2012 at 12:39 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I am starting to look at this patch now. I'm wondering exactly why the
decision was made to continue storing btree-style statistics for arrays,
in addition to the new stuff. The pg_statistic rows for array columns
tend to be unreasonably wide already, and as-is this patch will make
them quite a lot wider. I think it requires more than a little bit of
evidence to continue storing stats that seem to have only small
probability of usefulness.

In particular, if we didn't store that stuff, we'd not need to widen the
number of columns in pg_statistic, which would noticeably reduce both
the footprint of the patch and the probability of breaking external
code.

Initially, I used existing slots for new statistics, but I've changed this
after the first review:
http://archives.postgresql.org/pgsql-hackers/2011-07/msg00780.php

Probably, btree statistics really does matter for some sort of arrays? For
example, arrays representing paths in the tree. We could request a subtree
in a range query on such arrays.

------
With best regards,
Alexander Korotkov.

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexander Korotkov (#19)
Re: Collect frequency statistics for arrays

Alexander Korotkov <aekorotkov@gmail.com> writes:

On Thu, Mar 1, 2012 at 12:39 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I am starting to look at this patch now. I'm wondering exactly why the
decision was made to continue storing btree-style statistics for arrays,

Probably, btree statistics really does matter for some sort of arrays? For
example, arrays representing paths in the tree. We could request a subtree
in a range query on such arrays.

That seems like a pretty narrow, uncommon use-case. Also, to get
accurate stats for such queries that way, you'd need really enormous
histograms. I doubt that the existing parameters for histogram size
will permit meaningful estimation of more than the first array entry
(since we don't make the histogram any larger than we do for a scalar
column).

The real point here is that the fact that we're storing btree-style
stats for arrays is an accident, backed into by having added btree
comparators for arrays plus analyze.c's habit of applying default
scalar-oriented analysis functions to any type without an explicit
typanalyze entry. I don't recall that we ever thought hard about
it or showed that those stats were worth anything.

regards, tom lane

#21Alexander Korotkov
aekorotkov@gmail.com
In reply to: Tom Lane (#20)
#22Nathan Boley
npboley@gmail.com
In reply to: Tom Lane (#18)
#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Nathan Boley (#22)
#24Nathan Boley
npboley@gmail.com
In reply to: Tom Lane (#23)
#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: Nathan Boley (#24)
#26Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#21)
#27Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#23)
#28Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#27)
#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#28)
#30Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#29)
#31Nathan Boley
npboley@gmail.com
In reply to: Tom Lane (#25)
#32Tom Lane
tgl@sss.pgh.pa.us
In reply to: Nathan Boley (#31)
#33Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#30)
#34Nathan Boley
npboley@gmail.com
In reply to: Tom Lane (#33)
#35Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexander Korotkov (#26)
#36Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#35)
#37Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#36)
#38Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexander Korotkov (#26)
#39Alexander Korotkov
aekorotkov@gmail.com
In reply to: Tom Lane (#38)
#40Alexander Korotkov
aekorotkov@gmail.com
In reply to: Tom Lane (#38)
#41Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexander Korotkov (#40)
#42Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexander Korotkov (#39)
#43Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#42)
#44Alexander Korotkov
aekorotkov@gmail.com
In reply to: Tom Lane (#43)
#45Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexander Korotkov (#44)
#46Noah Misch
noah@leadboat.com
In reply to: Tom Lane (#45)
#47Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noah Misch (#46)
#48Noah Misch
noah@leadboat.com
In reply to: Tom Lane (#47)
#49Alexander Korotkov
aekorotkov@gmail.com
In reply to: Tom Lane (#45)