Merging statistics from children instead of re-sampling everything

Started by Tomas Vondraalmost 5 years ago17 messages
#1Tomas Vondra
tomas.vondra@enterprisedb.com
1 attachment(s)

Hi,

While reviewing the thread about issues with auto-analyze on partitioned
tables [1]https://commitfest.postgresql.org/32/2492/ I remembered that the way we build statistics on the parent
is somewhat expensive, because it samples rows from the children again.

It's true we sample much smaller amounts of rows from each partition
(proportional to it's size), so it's not as expensive as simply running
ANALYZE on all partitions individually. But it's not free either, and in
some cases (e.g. with partitioning by time) it may touch old data that
is not touched by regular queries, so likely not cached etc. That's a
bit against the idea to use partitioning to "segregate" old data.

One reason why the ANALYZE costs are not a huge issue in practice is
that we're not actually triggering that - changes_since_analyze is not
updated for the parent, so autoanalyze does not realize it needs to do
anything, and we never rebuild the statistics :-( But as shown in [2]/messages/by-id/CAM-w4HO9hUHvJDVwQ8=Fgm-znF9WNvQiWsfyBjCr-5FD7gWKGA@mail.gmail.com,
that may lead to poor estimates (and bad plans) in cases when we rely on
the parent's stats (probably due to not expanding the list of children
early enough).

(Note: I wonder if we might simply not rely on the parent stats at all,
and always consult directly the children stats - but with many children
that's likely problematic/expensive, and it does have the same issues as
the merging.)

The other thread [1]https://commitfest.postgresql.org/32/2492/ attempts to fix that by incrementing the counters
for the parent too, but that'll make this much worse. firstly, with
multi-level partitioning we'll have to sample the children repeatedly,
essentially once for each level. Secondly, it's tricky to control when
exactly are the counters propagated to the parent, making the parent
analyzes more frequent. Not great.

Even if we do a great job in [1]https://commitfest.postgresql.org/32/2492/ and come up with smart heuristics,
there will always be cases where it does not work too well and we either
analyze too late or too often.

Note: Propagating changes_since_analyze is only part of the story, as it
does not do anything about stats after attach/detach of a partition.

This attempts to approach the problem from the other end - instead of
tightly controlling when to analyze the parent, it makes the analyze
much cheaper. That means we don't need to worry too much about doing the
analyze too often, and we can update the stats more aggressively.

So, how does it work? Well, it simply fetches the statistics for all
children, and merges them together. For most statistics that's fairly
simple for most statistics types.

1) stawidth, stanullfrac - Those are trivial to merge.

2) stacorrelation - Not sure, I've used weighted average for now.
Perhaps we could store the "internal" counters (sumX, sumX2) which would
allow us to calculate regular estimate for the parent.

3) stadistinct - This is quite problematic. We only have the per-child
estimates, and it's not clear if there's any overlap. For now I've just
summed it up, because that's safer / similar to what we do for gather
merge paths etc. Maybe we could improve this by estimating the overlap
somehow (e.g. from MCV lists / histograms). But honestly, I doubt the
estimates based on tiny sample of each child are any better. I suppose
we could introduce a column option, determining how to combine ndistinct
(similar to how we can override n_distinct itself).

4) MCV - It's trivial to build a new "parent" MCV list, although it may
be too large (in which case we cut it at statistics target, and copy the
remaining bits to the histogram)

5) histograms - A bit more complicated, because it requires dealing with
overlapping bins, so we may have to "cut" and combine them in some way.
If we assume that cutting a bin in K parts means each part has 1/K
tuples (no matter where exactly we cut it), then it's trivial and it
works just fine in practice. That's because with N children, each bin
actually represents 1.0/(target*N) of the tuples, so the errors are
quite limited.

The attached patch is a PoC - it should work, but there's plenty of room
for improvement. It only deals with "regular" per-column statistics, not
with extended stats (but I don't see why it wouldn't work for them too).

It adds a new analyze option "MERGE" which does not sample the children
but instead just combines the statistics. So the example from [2]/messages/by-id/CAM-w4HO9hUHvJDVwQ8=Fgm-znF9WNvQiWsfyBjCr-5FD7gWKGA@mail.gmail.com looks
like this:

======================================================================
create table p (i integer, j integer) partition by list (i);

create table p0 partition of p for values in (0);
create table p1 partition of p for values in (1);

insert into p select 0,generate_series(1,1000);
insert into p select 1,generate_series(1,1000);

analyze p0;
analyze p1;

create table q (i integer);
insert into q values (0);
analyze q;

test=# explain select * from q join p using (i) where j
between 1 and 500;
QUERY PLAN
---------------------------------------------------------------------
Hash Join (cost=1.02..49.82 rows=5 width=8)
Hash Cond: (p.i = q.i)
-> Append (cost=0.00..45.00 rows=1000 width=8)
-> Seq Scan on p0 p_1 (cost=0.00..20.00 rows=500 width=8)
Filter: ((j >= 1) AND (j <= 500))
-> Seq Scan on p1 p_2 (cost=0.00..20.00 rows=500 width=8)
Filter: ((j >= 1) AND (j <= 500))
-> Hash (cost=1.01..1.01 rows=1 width=4)
-> Seq Scan on q (cost=0.00..1.01 rows=1 width=4)
(9 rows)

test=# analyze (merge) p;
ANALYZE
test=# explain select * from q join p using (i) where j
between 1 and 500;
QUERY PLAN
---------------------------------------------------------------------
Hash Join (cost=1.02..54.77 rows=500 width=8)
Hash Cond: (p.i = q.i)
-> Append (cost=0.00..45.00 rows=1000 width=8)
-> Seq Scan on p0 p_1 (cost=0.00..20.00 rows=500 width=8)
Filter: ((j >= 1) AND (j <= 500))
-> Seq Scan on p1 p_2 (cost=0.00..20.00 rows=500 width=8)
Filter: ((j >= 1) AND (j <= 500))
-> Hash (cost=1.01..1.01 rows=1 width=4)
-> Seq Scan on q (cost=0.00..1.01 rows=1 width=4)
(9 rows)

======================================================================

FWIW I'm not sure we need the separate MERGE mode, but it's an easy way
to allow both the regular and "merge" approach, so that it's possible to
experiment and compare the behavior.

One issue is that this would require some coordination between the
parent/child analyzes. Essentially, any time a child is analyzed, the
parent should rebuild the stats (to merge from the new child stats).
This is similar to the issue of analyzing the parent too often because
we don't know when exactly the counters get updated, but it's much
cheaper to merge the stats, so it's much less problematic.

regards

[1]: https://commitfest.postgresql.org/32/2492/

[2]: /messages/by-id/CAM-w4HO9hUHvJDVwQ8=Fgm-znF9WNvQiWsfyBjCr-5FD7gWKGA@mail.gmail.com
/messages/by-id/CAM-w4HO9hUHvJDVwQ8=Fgm-znF9WNvQiWsfyBjCr-5FD7gWKGA@mail.gmail.com

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

0001-Combine-statistics-from-child-relations.patchtext/x-patch; charset=UTF-8; name=0001-Combine-statistics-from-child-relations.patchDownload
From 423f639cdd6e5d3d1fef67588e230ff7663d4e0f Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Sun, 28 Mar 2021 17:43:11 +0200
Subject: [PATCH] Combine statistics from child relations

ANALYZE (MERGE) t;
---
 src/backend/commands/analyze.c | 830 ++++++++++++++++++++++++++++++++-
 src/backend/commands/vacuum.c  |   6 +-
 src/include/commands/vacuum.h  |   1 +
 3 files changed, 834 insertions(+), 3 deletions(-)

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index f84616d3d2..dbffabae9f 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -66,6 +66,7 @@
 #include "utils/spccache.h"
 #include "utils/syscache.h"
 #include "utils/timestamp.h"
+#include "utils/typcache.h"
 
 
 /* Per-index data for ANALYZE */
@@ -90,6 +91,10 @@ static void do_analyze_rel(Relation onerel,
 						   VacuumParams *params, List *va_cols,
 						   AcquireSampleRowsFunc acquirefunc, BlockNumber relpages,
 						   bool inh, bool in_outer_xact, int elevel);
+static void do_analyze_merge_rel(Relation onerel,
+								 VacuumParams *params, List *va_cols,
+								 AcquireSampleRowsFunc acquirefunc, BlockNumber relpages,
+								 bool inh, bool in_outer_xact, int elevel);
 static void compute_index_stats(Relation onerel, double totalrows,
 								AnlIndexData *indexdata, int nindexes,
 								HeapTuple *rows, int numrows,
@@ -125,6 +130,7 @@ analyze_rel(Oid relid, RangeVar *relation,
 	int			elevel;
 	AcquireSampleRowsFunc acquirefunc = NULL;
 	BlockNumber relpages = 0;
+	bool		merge_stats;
 
 	/* Select logging level */
 	if (params->options & VACOPT_VERBOSE)
@@ -132,6 +138,9 @@ analyze_rel(Oid relid, RangeVar *relation,
 	else
 		elevel = DEBUG2;
 
+	/* requested to merge child statistics instead of sampling */
+	merge_stats = (params->options & VACOPT_MERGE_STATS);
+
 	/* Set up static variables */
 	vac_strategy = bstrategy;
 
@@ -263,10 +272,19 @@ analyze_rel(Oid relid, RangeVar *relation,
 
 	/*
 	 * If there are child tables, do recursive ANALYZE.
+	 *
+	 * If ANALYZE (MERGE) was requested, simply fetch and merge statistics
+	 * from the children, instead of sampling the data.
 	 */
 	if (onerel->rd_rel->relhassubclass)
-		do_analyze_rel(onerel, params, va_cols, acquirefunc, relpages,
-					   true, in_outer_xact, elevel);
+	{
+		if (merge_stats)
+			do_analyze_merge_rel(onerel, params, va_cols, acquirefunc, relpages,
+								 true, in_outer_xact, elevel);
+		else
+			do_analyze_rel(onerel, params, va_cols, acquirefunc, relpages,
+						   true, in_outer_xact, elevel);
+	}
 
 	/*
 	 * Close source relation now, but keep lock so that no one deletes it
@@ -802,6 +820,814 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 	anl_context = NULL;
 }
 
+/* merging of statistics from children tables (MCV, histogram) */
+
+typedef struct MCVMergeItem {
+	Datum	value;
+	double	nrows;
+} MCVMergeItem;
+
+typedef struct HistogramMergeItem {
+	Datum	minval;
+	Datum	maxval;
+	double	nrows;
+} HistogramMergeItem;
+
+typedef struct merge_context_t {
+	FmgrInfo	opproc;
+	Oid			collation;
+} merge_context_t;
+
+/* compare MCV items by the values */
+static int
+compare_mcv_merge_item(const void *a, const void *b, void *arg)
+{
+	bool		ltcmp;
+	MCVMergeItem *ma = (MCVMergeItem *) a;
+	MCVMergeItem *mb = (MCVMergeItem *) b;
+	merge_context_t *cxt = (merge_context_t *) arg;
+
+	ltcmp = DatumGetBool(FunctionCall2Coll(&cxt->opproc, cxt->collation,
+										   ma->value, mb->value));
+
+	if (ltcmp)
+		return -1;
+
+	ltcmp = DatumGetBool(FunctionCall2Coll(&cxt->opproc, cxt->collation,
+										   mb->value, ma->value));
+
+	if (ltcmp)
+		return 1;
+
+	return 0;
+}
+
+/* sort the MCV items by tuples in descending order */
+static int
+compare_mcv_by_tuples(const void *a, const void *b)
+{
+	MCVMergeItem *ma = (MCVMergeItem *) a;
+	MCVMergeItem *mb = (MCVMergeItem *) b;
+
+	if (ma->nrows > mb->nrows)
+		return -1;
+	else if (ma->nrows > mb->nrows)
+		return 1;
+
+	return 0;
+}
+
+/*
+ * Combine duplicate items from multiple MCV lists, build a new list.
+ *
+ * XXX In principle the size of the MCV list should be determined the same
+ * way analyze_mcv_list() does it for individual tables, but we have only
+ * very unreliable ndistinct estimate. So we simply assume that whatever
+ * got into the per-table MCV list, is worth keeping in the MCV list for
+ * the parent too.
+ */
+static int
+merge_mcv(MCVMergeItem *mcv, int *mcv_items, merge_context_t *cxt)
+{
+	int			i;
+	int			n;
+
+	if (*mcv_items == 0)
+		return 0;
+
+	qsort_arg(mcv, *mcv_items, sizeof(MCVMergeItem),
+			  compare_mcv_merge_item, cxt);
+
+	n = 1;
+	for (i = 1; i < *mcv_items; i++)
+	{
+		if (compare_mcv_merge_item(&mcv[i-1], &mcv[i], cxt) == 0)
+			continue;
+
+		mcv[n] = mcv[i];
+		n++;
+	}
+
+	*mcv_items = n;
+
+	pg_qsort(mcv, n, sizeof(MCVMergeItem), compare_mcv_by_tuples);
+
+	if (*mcv_items > default_statistics_target)
+		n = default_statistics_target;
+
+	return n;
+}
+
+/* compare datum using the comparator / collation */
+static int
+compare_values(const void *a, const void *b, void *arg)
+{
+	bool		ltcmp;
+	merge_context_t *cxt = (merge_context_t *) arg;
+
+	Datum da = * (Datum *) a;
+	Datum db = * (Datum *) b;
+
+	ltcmp = DatumGetBool(FunctionCall2Coll(&cxt->opproc, cxt->collation,
+										   da, db));
+
+	if (ltcmp)
+		return -1;
+
+	ltcmp = DatumGetBool(FunctionCall2Coll(&cxt->opproc, cxt->collation,
+										   db, da));
+
+	if (ltcmp)
+		return 1;
+
+	return 0;
+}
+
+/* merge ranges from multiple histograms, build a new histogram */
+static Datum *
+merge_histogram(HistogramMergeItem *hist, int hist_items, int *nitems,
+				merge_context_t *cxt)
+{
+	int			i;
+	int			n;
+	double		nrows_total;
+	Datum	   *values;
+	int			nvalues;
+	int			new_hist_items;
+	HistogramMergeItem *new_hist;
+	double		step;
+	double		next;
+	double		curr;
+
+	if (hist_items == 0)
+	{
+		*nitems = 0;
+		return NULL;
+	}
+
+	/* total number of rows in the histogram */
+	nrows_total = 0;
+	for (i = 0; i < hist_items; i++)
+		nrows_total += hist[i].nrows;
+
+	/* construct non-overlapping intervals - we don't split */
+	nvalues = 0;
+	values = (Datum *) palloc(2 * hist_items * sizeof(Datum));
+	for (i = 0; i < hist_items; i++)
+	{
+		values[nvalues++] = hist[i].minval;
+		values[nvalues++] = hist[i].maxval;
+	}
+
+	qsort_arg(values, nvalues, sizeof(Datum), compare_values, cxt);
+
+	n = 1;
+	for (i = 1; i < nvalues; i++)
+	{
+		if (compare_values(&values[i-1], &values[i], cxt) == 0)
+			continue;
+
+		values[n++] = values[i];
+	}
+	nvalues = n;
+
+	/* transform the values into histogram */
+	new_hist_items  = (nvalues - 1);
+	new_hist = (HistogramMergeItem *) palloc0(sizeof(HistogramMergeItem) * new_hist_items);
+	for (i = 0; i < new_hist_items; i++)
+	{
+		new_hist[i].minval = values[i];
+		new_hist[i].maxval = values[i+1];
+	}
+
+	/* redistribute the rows from the old into the new histogram */
+	for (i = 0; i < hist_items; i++)
+	{
+		int	startidx = 0;
+		int	endidx = 0;
+		double	nrows;
+		int		j;
+
+		/* find the first matching bin in the new histogram - loop until we
+		 * find the first bin after it, then go one step back */
+		while (true)
+		{
+			if ((startidx == new_hist_items) ||
+				compare_values(&hist[i].minval, &new_hist[startidx].minval, cxt) < 0)
+			{
+				startidx--;
+				break;
+			}
+
+			startidx++;
+		}
+
+		Assert(startidx >= 0);
+		Assert(startidx < new_hist_items);
+
+		/* find the last matching bin in the new histogram - loop until we
+		 * find the first bin after it, then go one step back */
+		while (true)
+		{
+			if ((endidx == new_hist_items) ||
+				compare_values(&hist[i].maxval, &new_hist[endidx].minval, cxt) < 0)
+			{
+				endidx--;
+				break;
+			}
+
+			endidx++;
+		}
+
+		Assert(endidx >= startidx);
+		Assert(startidx < new_hist_items);
+
+		/*
+		 * Redistribute the counts between the matching ranges equally.
+		 *
+		 * XXX We could do better for data types with "distance", maybe?
+		 */
+		nrows = hist[i].nrows / (endidx - startidx + 1);
+
+		for (j = startidx; j <= endidx; j++)
+			new_hist[j].nrows += nrows;
+	}
+
+	nrows_total = 0;
+	for (i = 0; i < new_hist_items; i++)
+		nrows_total += new_hist[i].nrows;
+
+	step = (nrows_total / default_statistics_target);
+	next = step;
+	curr = 0;
+
+	values[0] = new_hist[0].minval;
+	n = 1;
+
+	for (i = 0; i < new_hist_items; i++)
+	{
+		curr += new_hist[i].nrows;
+
+		/* not reached the next threshold */
+		if (curr < next)
+			continue;
+
+		while (curr >= next)
+		{
+			values[n++] = new_hist[i].maxval;
+			next += step;
+		}
+	}
+
+	*nitems = n;
+
+	return values;
+}
+
+/*
+ *	do_analyze_merge_rel() -- analyze partitioned relation by merging stats
+ * from child tables
+ */
+static void
+do_analyze_merge_rel(Relation onerel, VacuumParams *params,
+					 List *va_cols, AcquireSampleRowsFunc acquirefunc,
+					 BlockNumber relpages, bool inh, bool in_outer_xact,
+					 int elevel)
+{
+	int			attr_cnt,
+				tcnt,
+				i;
+	VacAttrStats **vacattrstats;
+	PGRUsage	ru0;
+	TimestampTz starttime = 0;
+	MemoryContext caller_context;
+	Oid			save_userid;
+	int			save_sec_context;
+	int			save_nestlevel;
+	int64		AnalyzePageHit = VacuumPageHit;
+	int64		AnalyzePageMiss = VacuumPageMiss;
+	int64		AnalyzePageDirty = VacuumPageDirty;
+	PgStat_Counter startreadtime = 0;
+	PgStat_Counter startwritetime = 0;
+	List	   *children;
+	Relation   *rels;
+	int			nrels;
+	ListCell   *lc;
+
+	double	   *relblocks;
+	double	   *reltuples;
+	double		totalblocks;
+	double		totaltuples;
+
+	/* pointers to pg_statistic tuples */
+	HeapTuple  *statistic_tuples;
+
+	/* partitioned tables / inheritance trees only */
+	Assert(inh);
+
+	/*
+	 * Set up a working context so that we can easily free whatever junk gets
+	 * created.
+	 */
+	anl_context = AllocSetContextCreate(CurrentMemoryContext,
+										"Analyze",
+										ALLOCSET_DEFAULT_SIZES);
+	caller_context = MemoryContextSwitchTo(anl_context);
+
+	/*
+	 * Switch to the table owner's userid, so that any index functions are run
+	 * as that user.  Also lock down security-restricted operations and
+	 * arrange to make GUC variable changes local to this command.
+	 *
+	 * XXX Probably not needed, we're not going to run any functions.
+	 */
+	GetUserIdAndSecContext(&save_userid, &save_sec_context);
+	SetUserIdAndSecContext(onerel->rd_rel->relowner,
+						   save_sec_context | SECURITY_RESTRICTED_OPERATION);
+	save_nestlevel = NewGUCNestLevel();
+
+	/* measure elapsed time iff autovacuum logging requires it */
+	if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
+	{
+		if (track_io_timing)
+		{
+			startreadtime = pgStatBlockReadTime;
+			startwritetime = pgStatBlockWriteTime;
+		}
+
+		pg_rusage_init(&ru0);
+		if (params->log_min_duration >= 0)
+			starttime = GetCurrentTimestamp();
+	}
+
+	/*
+	 * Determine which columns to analyze
+	 *
+	 * Note that system attributes are never analyzed, so we just reject them
+	 * at the lookup stage.  We also reject duplicate column mentions.  (We
+	 * could alternatively ignore duplicates, but analyzing a column twice
+	 * won't work; we'd end up making a conflicting update in pg_statistic.)
+	 */
+	if (va_cols != NIL)
+	{
+		Bitmapset  *unique_cols = NULL;
+		ListCell   *le;
+
+		vacattrstats = (VacAttrStats **) palloc(list_length(va_cols) *
+												sizeof(VacAttrStats *));
+		tcnt = 0;
+		foreach(le, va_cols)
+		{
+			char	   *col = strVal(lfirst(le));
+
+			i = attnameAttNum(onerel, col, false);
+			if (i == InvalidAttrNumber)
+				ereport(ERROR,
+						(errcode(ERRCODE_UNDEFINED_COLUMN),
+						 errmsg("column \"%s\" of relation \"%s\" does not exist",
+								col, RelationGetRelationName(onerel))));
+			if (bms_is_member(i, unique_cols))
+				ereport(ERROR,
+						(errcode(ERRCODE_DUPLICATE_COLUMN),
+						 errmsg("column \"%s\" of relation \"%s\" appears more than once",
+								col, RelationGetRelationName(onerel))));
+			unique_cols = bms_add_member(unique_cols, i);
+
+			vacattrstats[tcnt] = examine_attribute(onerel, i, NULL);
+			if (vacattrstats[tcnt] != NULL)
+				tcnt++;
+		}
+		attr_cnt = tcnt;
+	}
+	else
+	{
+		attr_cnt = onerel->rd_att->natts;
+		vacattrstats = (VacAttrStats **)
+			palloc(attr_cnt * sizeof(VacAttrStats *));
+		tcnt = 0;
+		for (i = 1; i <= attr_cnt; i++)
+		{
+			vacattrstats[tcnt] = examine_attribute(onerel, i, NULL);
+			if (vacattrstats[tcnt] != NULL)
+				tcnt++;
+		}
+		attr_cnt = tcnt;
+	}
+
+	/*
+	 * XXX We don't want to touch the parent's indexes at all.
+	 */
+
+	/**/
+	children = find_all_inheritors(RelationGetRelid(onerel), AccessShareLock, NULL);
+
+	/* filter interesting relations (leafs only) */
+	rels = (Relation *) palloc(list_length(children) * sizeof(Relation));
+	relblocks = (double *) palloc(list_length(children) * sizeof(double));
+	reltuples = (double *) palloc(list_length(children) * sizeof(double));
+	totalblocks = 0;
+	totaltuples = 0;
+
+	nrels = 0;
+
+	foreach(lc, children)
+	{
+		Oid			childoid = lfirst_oid(lc);
+		Relation	childrel;
+
+		/* We already got the needed lock */
+		childrel = table_open(childoid, NoLock);
+
+		/* Ignore if temp table of another backend */
+		if (RELATION_IS_OTHER_TEMP(childrel))
+		{
+			/* ... but release the lock on it */
+			Assert(childrel != onerel);
+			table_close(childrel, AccessShareLock);
+			continue;
+		}
+
+		/*
+		 * ignore anything but valid leaf relatiins with data, but release
+		 * the lock on it.  don't try to unlock the passed-in relation
+		 */
+		if (childrel->rd_rel->relkind != RELKIND_RELATION &&
+			childrel->rd_rel->relkind != RELKIND_MATVIEW &&
+			childrel->rd_rel->relkind != RELKIND_FOREIGN_TABLE)
+		{
+			Assert(childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+			if (childrel != onerel)
+				table_close(childrel, AccessShareLock);
+			else
+				table_close(childrel, NoLock);
+			continue;
+		}
+
+		/* OK, we'll process this child */
+		rels[nrels] = childrel;
+		relblocks[nrels] = (double) RelationGetNumberOfBlocks(childrel);
+		reltuples[nrels] = (double) childrel->rd_rel->reltuples;
+		totalblocks += relblocks[nrels];
+		totaltuples += reltuples[nrels];
+		nrels++;
+	}
+
+	statistic_tuples = (HeapTuple *) palloc(list_length(children) * sizeof(HeapTuple));
+
+	/* FIXME do the merge here */
+	for (i = 0; i < attr_cnt; i++)
+	{
+		int		j;
+		float4	r_correlation,
+				r_nullfrac;
+		int		slot_idx;
+		int64	r_stawidth = 0;
+		double	r_ndistinct = 0;
+
+		MCVMergeItem *mcv;
+		int		max_mcv_items = 64;
+		int		mcv_items = 0;
+		int		mcv_size;
+
+		HistogramMergeItem *hist;
+		int		max_hist_items = 64;
+		int		hist_items = 0;
+
+		Datum  *values;
+		int		nvalues;
+
+		TypeCacheEntry *typcache;
+		merge_context_t cxt;
+
+		VacAttrStats *stats = vacattrstats[i];
+
+			Oid		ltopr,
+					eqopr;
+
+			/* Look for default "<" and "=" operators for column's type */
+			get_sort_group_operators(stats->attrtypid,
+									 false, false, false,
+									 &ltopr, &eqopr, NULL,
+									 NULL);
+
+		typcache = lookup_type_cache(vacattrstats[i]->attr->atttypid,
+									 TYPECACHE_LT_OPR);
+
+		mcv = (MCVMergeItem *) palloc(sizeof(MCVMergeItem) * max_mcv_items);
+		hist = (HistogramMergeItem *) palloc(sizeof(HistogramMergeItem) * max_hist_items);
+
+		r_correlation = 0;
+		r_nullfrac = 0;
+
+		/* fetch stats for the attribute from all children */
+		for (j = 0; j < nrels; j++)
+		{
+			HeapTuple htup;
+			Form_pg_statistic stats;
+			double nullfrac;
+			double mcvfrac;
+
+			statistic_tuples[j] = NULL;
+
+			htup = SearchSysCache3(STATRELATTINH,
+								   ObjectIdGetDatum(RelationGetRelid(rels[j])),
+								   Int16GetDatum(vacattrstats[i]->attr->attnum),
+								   BoolGetDatum(false));
+
+			if (!HeapTupleIsValid(htup))
+			{
+				elog(WARNING, "stats for %d %d not found",
+					 RelationGetRelid(rels[j]), vacattrstats[i]->attr->attnum);
+				continue;
+			}
+
+			stats = (Form_pg_statistic) GETSTRUCT(htup);
+
+			nullfrac = stats->stanullfrac;
+			mcvfrac = 0;
+
+			/* increment the result null fraction */
+			r_nullfrac += (nullfrac * reltuples[j]);
+
+			/* increment the average width */
+			r_stawidth += (stats->stawidth * reltuples[j]);
+
+			/*
+			 * XXX We sum the ndistinct for child tables. This is probably
+			 * overkill, because the child tables likely share at least some
+			 * values (except for the partition key case). But it's similar
+			 * to what we do to estimate number of groups before calling
+			 * create_gather_merge_path, etc.
+			 */
+			if (stats->stadistinct < 0)
+				r_ndistinct += -(stats->stadistinct * reltuples[j]);
+			else
+				r_ndistinct += stats->stadistinct;
+
+			/* FIXME do ACL checks here */
+
+			/* extract stats about the attribute */
+			{
+				AttStatsSlot sslot;
+
+				/* expand the MCV list */
+				if (get_attstatsslot(&sslot, htup,
+									 STATISTIC_KIND_MCV, InvalidOid,
+									 ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS))
+				{
+					int k;
+
+					for (k = 0; k < sslot.nvalues; k++)
+					{
+						if (max_mcv_items == mcv_items)
+						{
+							max_mcv_items *= 2;
+							mcv = repalloc(mcv, sizeof(MCVMergeItem) * max_mcv_items);
+						}
+
+						mcvfrac += sslot.numbers[k];
+
+						mcv[mcv_items].value = sslot.values[k];
+						mcv[mcv_items].nrows = (sslot.numbers[k] * reltuples[j]);
+						mcv_items++;
+					}
+				}
+
+				/* expand the histogram */
+				if (get_attstatsslot(&sslot, htup,
+									 STATISTIC_KIND_HISTOGRAM, InvalidOid,
+									 ATTSTATSSLOT_VALUES))
+				{
+					int k;
+					/* one histogram bin represents this fraction of rows */
+					double	frac = (1 - mcvfrac - nullfrac) / (sslot.nvalues - 1);
+
+					for (k = 0; k < sslot.nvalues - 1; k++)
+					{
+						if (max_hist_items == hist_items)
+						{
+							max_hist_items *= 2;
+							hist = repalloc(hist, sizeof(HistogramMergeItem) * max_hist_items);
+						}
+
+						hist[hist_items].minval = sslot.values[k];
+						hist[hist_items].maxval = sslot.values[k+1];
+						hist[hist_items].nrows = (frac * reltuples[j]);
+						hist_items++;
+					}
+				}
+
+				/* expand the correlation */
+				if (get_attstatsslot(&sslot, htup,
+									 STATISTIC_KIND_CORRELATION, InvalidOid,
+									 ATTSTATSSLOT_NUMBERS))
+					r_correlation += sslot.numbers[0] * (reltuples[j] / totaltuples);
+
+				/* FIXME handle the other stakind types (array elements etc.) */
+			}
+
+			statistic_tuples[j] = htup;
+		}
+
+		fmgr_info(get_opcode(typcache->lt_opr), &cxt.opproc);
+		cxt.collation = stats->attrcollid;
+
+		/* merge values in the MCV lists (if any) */
+		mcv_size = merge_mcv(mcv, &mcv_items, &cxt);
+
+		/*
+		 * If there are MCV items that did not make it to the combined MCV list,
+		 * add them to the histogram as ranges.
+		 */
+		for (j = mcv_size; j < mcv_items; j++)
+		{
+			if (max_hist_items == hist_items)
+			{
+				max_hist_items *= 2;
+				hist = repalloc(hist, sizeof(HistogramMergeItem) * max_hist_items);
+			}
+
+			hist[hist_items].minval = mcv[j].value;
+			hist[hist_items].maxval = mcv[j].value;
+			hist[hist_items].nrows = mcv[j].nrows;
+			hist_items++;
+		}
+
+		/* merge histograms and remaining parts of MCV */
+		values = merge_histogram(hist, hist_items, &nvalues, &cxt);
+
+		slot_idx = 0;
+
+		if (mcv_size > 0)
+		{
+			Datum  *mcv_values = (Datum *) palloc(sizeof(Datum) * mcv_size);
+			float4 *mcv_freqs  = (float4 *) palloc(sizeof(float4) * mcv_size);
+
+			for (j = 0; j < mcv_size; j++)
+			{
+				mcv_values[j] = mcv[j].value;
+				mcv_freqs[j] = (mcv[j].nrows / totaltuples);
+			}
+
+			stats->stakind[slot_idx] = STATISTIC_KIND_MCV;
+			stats->staop[slot_idx] = eqopr;
+			stats->stacoll[slot_idx] = stats->attrcollid;
+			stats->stanumbers[slot_idx] = mcv_freqs;
+			stats->numnumbers[slot_idx] = mcv_size;
+			stats->stavalues[slot_idx] = mcv_values;
+			stats->numvalues[slot_idx] = mcv_size;
+			slot_idx++;
+		}
+
+		if (nvalues > 0)
+		{
+			stats->stakind[slot_idx] = STATISTIC_KIND_HISTOGRAM;
+			stats->staop[slot_idx] = ltopr;
+			stats->stacoll[slot_idx] = stats->attrcollid;
+			stats->stavalues[slot_idx] = values;
+			stats->numvalues[slot_idx] = nvalues;
+			slot_idx++;
+		}
+
+		{
+			float4 *corrs = (float4 *) palloc(sizeof(float4));
+			corrs[0] = r_correlation;
+
+			stats->stakind[slot_idx] = STATISTIC_KIND_CORRELATION;
+			stats->staop[slot_idx] = ltopr;
+			stats->stacoll[slot_idx] = stats->attrcollid;
+			stats->stanumbers[slot_idx] = corrs;
+			stats->numnumbers[slot_idx] = 1;
+			slot_idx++;
+		}
+
+		stats->stanullfrac = r_nullfrac / totaltuples;
+		stats->stawidth = r_stawidth / totaltuples;
+
+		if (r_ndistinct > 0.1 * totaltuples)
+			stats->stadistinct = -(r_ndistinct / totaltuples);
+		else
+			stats->stadistinct = r_ndistinct;
+
+		stats->stats_valid = true;
+
+		/* update the statistics */
+		update_attstats(RelationGetRelid(onerel), inh,
+						1, &stats);
+
+		/* release the stat tuples */
+		for (j = 0; j < nrels; j++)
+			ReleaseSysCache(statistic_tuples[j]);
+	}
+
+	/* close the remaining rels */
+	for (i = 0; i < nrels; i++)
+		table_close(rels[i], NoLock);
+
+	pgstat_progress_update_param(PROGRESS_ANALYZE_PHASE,
+								 PROGRESS_ANALYZE_PHASE_FINALIZE_ANALYZE);
+
+	/* Log the action if appropriate */
+	if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
+	{
+		TimestampTz endtime = GetCurrentTimestamp();
+
+		if (params->log_min_duration == 0 ||
+			TimestampDifferenceExceeds(starttime, endtime,
+									   params->log_min_duration))
+		{
+			long		delay_in_ms;
+			double		read_rate = 0;
+			double		write_rate = 0;
+			StringInfoData buf;
+
+			/*
+			 * Calculate the difference in the Page Hit/Miss/Dirty that
+			 * happened as part of the analyze by subtracting out the
+			 * pre-analyze values which we saved above.
+			 */
+			AnalyzePageHit = VacuumPageHit - AnalyzePageHit;
+			AnalyzePageMiss = VacuumPageMiss - AnalyzePageMiss;
+			AnalyzePageDirty = VacuumPageDirty - AnalyzePageDirty;
+
+			/*
+			 * We do not expect an analyze to take > 25 days and it simplifies
+			 * things a bit to use TimestampDifferenceMilliseconds.
+			 */
+			delay_in_ms = TimestampDifferenceMilliseconds(starttime, endtime);
+
+			/*
+			 * Note that we are reporting these read/write rates in the same
+			 * manner as VACUUM does, which means that while the 'average read
+			 * rate' here actually corresponds to page misses and resulting
+			 * reads which are also picked up by track_io_timing, if enabled,
+			 * the 'average write rate' is actually talking about the rate of
+			 * pages being dirtied, not being written out, so it's typical to
+			 * have a non-zero 'avg write rate' while I/O Timings only reports
+			 * reads.
+			 *
+			 * It's not clear that an ANALYZE will ever result in
+			 * FlushBuffer() being called, but we track and support reporting
+			 * on I/O write time in case that changes as it's practically free
+			 * to do so anyway.
+			 */
+
+			if (delay_in_ms > 0)
+			{
+				read_rate = (double) BLCKSZ * AnalyzePageMiss / (1024 * 1024) /
+					(delay_in_ms / 1000.0);
+				write_rate = (double) BLCKSZ * AnalyzePageDirty / (1024 * 1024) /
+					(delay_in_ms / 1000.0);
+			}
+
+			/*
+			 * We split this up so we don't emit empty I/O timing values when
+			 * track_io_timing isn't enabled.
+			 */
+
+			initStringInfo(&buf);
+			appendStringInfo(&buf, _("automatic analyze of table \"%s.%s.%s\"\n"),
+							 get_database_name(MyDatabaseId),
+							 get_namespace_name(RelationGetNamespace(onerel)),
+							 RelationGetRelationName(onerel));
+			appendStringInfo(&buf, _("buffer usage: %lld hits, %lld misses, %lld dirtied\n"),
+							 (long long) AnalyzePageHit,
+							 (long long) AnalyzePageMiss,
+							 (long long) AnalyzePageDirty);
+			appendStringInfo(&buf, _("avg read rate: %.3f MB/s, avg write rate: %.3f MB/s\n"),
+							 read_rate, write_rate);
+			if (track_io_timing)
+			{
+				appendStringInfoString(&buf, _("I/O Timings:"));
+				if (pgStatBlockReadTime - startreadtime > 0)
+					appendStringInfo(&buf, _(" read=%.3f"),
+									 (double) (pgStatBlockReadTime - startreadtime) / 1000);
+				if (pgStatBlockWriteTime - startwritetime > 0)
+					appendStringInfo(&buf, _(" write=%.3f"),
+									 (double) (pgStatBlockWriteTime - startwritetime) / 1000);
+				appendStringInfoChar(&buf, '\n');
+			}
+			appendStringInfo(&buf, _("system usage: %s"), pg_rusage_show(&ru0));
+
+			ereport(LOG,
+					(errmsg_internal("%s", buf.data)));
+
+			pfree(buf.data);
+		}
+	}
+
+	/* Roll back any GUC changes executed by index functions */
+	AtEOXact_GUC(false, save_nestlevel);
+
+	/* Restore userid and security context */
+	SetUserIdAndSecContext(save_userid, save_sec_context);
+
+	/* Restore current context and release memory */
+	MemoryContextSwitchTo(caller_context);
+	MemoryContextDelete(anl_context);
+	anl_context = NULL;
+}
+
 /*
  * Compute statistics about indexes of a relation
  */
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index c064352e23..e2cd2ee317 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -103,6 +103,7 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 	bool		analyze = false;
 	bool		freeze = false;
 	bool		full = false;
+	bool		merge_stats = false;
 	bool		disable_page_skipping = false;
 	bool		process_toast = true;
 	ListCell   *lc;
@@ -124,6 +125,8 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 			verbose = defGetBoolean(opt);
 		else if (strcmp(opt->defname, "skip_locked") == 0)
 			skip_locked = defGetBoolean(opt);
+		else if (strcmp(opt->defname, "merge") == 0)
+			merge_stats = defGetBoolean(opt);
 		else if (!vacstmt->is_vacuumcmd)
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -193,7 +196,8 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 		(freeze ? VACOPT_FREEZE : 0) |
 		(full ? VACOPT_FULL : 0) |
 		(disable_page_skipping ? VACOPT_DISABLE_PAGE_SKIPPING : 0) |
-		(process_toast ? VACOPT_PROCESS_TOAST : 0);
+		(process_toast ? VACOPT_PROCESS_TOAST : 0) |
+		(merge_stats ? VACOPT_MERGE_STATS : 0);
 
 	/* sanity checks on options */
 	Assert(params.options & (VACOPT_VACUUM | VACOPT_ANALYZE));
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index d029da5ac0..688f475162 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -183,6 +183,7 @@ typedef struct VacAttrStats
 #define VACOPT_SKIP_LOCKED 0x20 /* skip if cannot get lock */
 #define VACOPT_PROCESS_TOAST 0x40	/* process the TOAST table, if any */
 #define VACOPT_DISABLE_PAGE_SKIPPING 0x80	/* don't skip any pages */
+#define VACOPT_MERGE_STATS 0x0100	/* merge statistics from children */
 
 /*
  * A ternary value used by vacuum parameters.
-- 
2.30.2

#2Justin Pryzby
pryzby@telsasoft.com
In reply to: Tomas Vondra (#1)
Re: Merging statistics from children instead of re-sampling everything

Thanks for taking a fresh look at this.

As you've written it, this can apply to either/both partitioned or inheritence.
I imagine when "MERGE" goes away, this should apply only to partitioned tables.
(Actually, personally I would advocate to consider applying it to *both*, but I
don't think that's been the tendency over the last 4 years. I wrote here about
some arguably-gratuitous differences between inheritence and partitioning.
/messages/by-id/20180601221428.GU5164@telsasoft.com)

+     if (*mcv_items > default_statistics_target)
+             n = default_statistics_target;

It should use any non-default stats target of the parent's column

+ * ignore anything but valid leaf relatiins with data, but release

sp: relatiins.

+				elog(WARNING, "stats for %d %d not found",
+                                        RelationGetRelid(rels[j]), vacattrstats[i]->attr->attnum);

should be %u %d

This code duplication is undesirable:

Show quoted text

+ /* Log the action if appropriate */
+ * Determine which columns to analyze

#3Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Justin Pryzby (#2)
Re: Merging statistics from children instead of re-sampling everything

On 3/29/21 8:36 PM, Justin Pryzby wrote:

Thanks for taking a fresh look at this.

As you've written it, this can apply to either/both partitioned or inheritence.
I imagine when "MERGE" goes away, this should apply only to partitioned tables.
(Actually, personally I would advocate to consider applying it to *both*, but I
don't think that's been the tendency over the last 4 years. I wrote here about
some arguably-gratuitous differences between inheritence and partitioning.
/messages/by-id/20180601221428.GU5164@telsasoft.com)

Haven't thought about that too much at this point, but I don't see any
reason to not apply it only to both cases. I might be missing something,
but the fact that with declarative partitioning we analyze the children,
while with inheritance we don't, seems a bit strange. Not sure.

+     if (*mcv_items > default_statistics_target)
+             n = default_statistics_target;

It should use any non-default stats target of the parent's column

Yeah. That's a simplification, the non-PoC code would have to look at
per-column statistics target for the target / all the children, and make
some calculation based on that. I ignored that in the PoC.

+ * ignore anything but valid leaf relatiins with data, but release

sp: relatiins.

+				elog(WARNING, "stats for %d %d not found",
+                                        RelationGetRelid(rels[j]), vacattrstats[i]->attr->attnum);

should be %u %d

This code duplication is undesirable:

+ /* Log the action if appropriate */
+ * Determine which columns to analyze

Yeah. Fine for PoC, but needs to be cleaned up in future patch.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#4Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Tomas Vondra (#3)
Re: Merging statistics from children instead of re-sampling everything

On 3/29/21 9:24 PM, Tomas Vondra wrote:

On 3/29/21 8:36 PM, Justin Pryzby wrote:

Thanks for taking a fresh look at this.

As you've written it, this can apply to either/both partitioned or inheritence.
I imagine when "MERGE" goes away, this should apply only to partitioned tables.
(Actually, personally I would advocate to consider applying it to *both*, but I
don't think that's been the tendency over the last 4 years. I wrote here about
some arguably-gratuitous differences between inheritence and partitioning.
/messages/by-id/20180601221428.GU5164@telsasoft.com)

Haven't thought about that too much at this point, but I don't see any
reason to not apply it only to both cases. I might be missing something,
but the fact that with declarative partitioning we analyze the children,
while with inheritance we don't, seems a bit strange. Not sure.

BTW I'm not sure the "merge" will / should go away. What I'd expect is
that we'll keep it, and you can do either "ANALYZE" or "ANALYZE
(MERGE)". The former does regular sampling, while the latter does the
proposed merging of stats.

For the autovacuum, I think a new autovacuum_analyze_merge GUC and
reloption would work - chances are people will want to set the default
and then maybe enable merging only for some cases. Not sure.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#5Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Tomas Vondra (#1)
Re: Merging statistics from children instead of re-sampling everything

Hi,

I'd like to point out two less obvious things, about how this relates to
Tom's proposal [1]/messages/by-id/7363.1426537103@sss.pgh.pa.us and patch [2]/messages/by-id/22598.1425686096@sss.pgh.pa.us from 2015. Tom approached the problem
from a different direction, essentially allowing Var to be associated
with a list of statistics instead of just one.

So it's a somewhat orthogonal solution, and it has pros and cons. The
pro is that it can ignore statistics for eliminated partitions, thus
producing better estimates. The con is that it requires all the places
dealing with VariableStatData to assume there's a list, not just one,
making the code more complex and more CPU expensive (with sufficiently
many partitions).

However, it seems to me we could easily combine those two things - we
can merge the statistics (the way I proposed here), so that each Var has
still just one VariableStatData. That'd mean the various places don't
need to start dealing with a list, and it'd still allow ignoring stats
for eliminated partitions.

Of course, that assumes the merge is cheaper than processing the list of
statistics, but I find that plausible, especially the list needs to be
processed multiple (e.g. when considering different join orders, filters
and so on).

Haven't tried, but might be worth exploring in the future.

regards

[1]: /messages/by-id/7363.1426537103@sss.pgh.pa.us

[2]: /messages/by-id/22598.1425686096@sss.pgh.pa.us

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#6Andrey Lepikhov
a.lepikhov@postgrespro.ru
In reply to: Tomas Vondra (#1)
Re: Merging statistics from children instead of re-sampling everything

Sorry, I forgot to send CC into pgsql-hackers.
On 29/6/21 13:23, Tomas Vondra wrote:

Because sampling is fairly expensive, especially if you have to do it
for large number of child relations. And you'd have to do that every
time *any* child triggers autovacuum, pretty much. Merging the stats is
way cheaper.

See the other thread linked from the first message.

Maybe i couldn't describe my idea clearly.
The most commonly partitioning is used for large tables.
I suppose to store a sampling reservoir for each partition, replace on
update of statistics and merge to build statistics for parent table.
It can be spilled into tuplestore on a disk, or stored in a parent table.
In the case of complex inheritance we can store sampling reservoirs only
for leafs.
You can consider this idea as an imagination, but the merging statistics
approach has an extensibility problem on another types of statistics.

On 6/29/21 9:01 AM, Andrey Lepikhov wrote:

On 30/3/21 03:51, Tomas Vondra wrote:

Of course, that assumes the merge is cheaper than processing the list of
statistics, but I find that plausible, especially the list needs to be
processed multiple (e.g. when considering different join orders, filters
and so on).

I think your approach have a chance. But I didn't understand: why do
you merge statistics? I think we could merge only samples of each
children and build statistics as usual.
Error of a sample merging procedure would be quite limited.

--
regards,
Andrey Lepikhov
Postgres Professional

#7Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Andrey Lepikhov (#6)
Re: Merging statistics from children instead of re-sampling everything

On 6/30/21 2:55 PM, Andrey Lepikhov wrote:

Sorry, I forgot to send CC into pgsql-hackers.
On 29/6/21 13:23, Tomas Vondra wrote:

Because sampling is fairly expensive, especially if you have to do it
for large number of child relations. And you'd have to do that every
time *any* child triggers autovacuum, pretty much. Merging the stats
is way cheaper.

See the other thread linked from the first message.

Maybe i couldn't describe my idea clearly.
The most commonly partitioning is used for large tables.
I suppose to store a sampling reservoir for each partition, replace on
update of statistics and merge to build statistics for parent table.
It can be spilled into tuplestore on a disk, or stored in a parent table.
In the case of complex inheritance we can store sampling reservoirs only
for leafs.
You can consider this idea as an imagination, but the merging statistics
approach has an extensibility problem on another types of statistics.

Well, yeah - we might try that too, of course. This is simply exploring
the "merge statistics" idea from [1], which is why it does not even
attempt to do what you suggested. We may explore the approach with
keeping per-partition samples, of course.

You're right maintaining a per-partition samples and merging those might
solve (or at least reduce) some of the problems, e.g. eliminating most
of the I/O that'd be needed for sampling. And yeah, it's not entirely
clear how to merge some of the statistics types (like ndistinct). But
for a lot of the basic stats it works quite nicely, I think.

I'm sure there'll be some complexity due to handling large / toasted
values, etc. And we probably need to design this for large hierarchies
(IMHO it should work with 10k partitions, not just 100), in which case
it may still be quite a bit more expensive than merging the stats.

So maybe we should really support both, and combine them somehow?

regards

/messages/by-id/CAM-w4HO9hUHvJDVwQ8=Fgm-znF9WNvQiWsfyBjCr-5FD7gWKGA@mail.gmail.com

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#8Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Tomas Vondra (#7)
Re: Merging statistics from children instead of re-sampling everything

On 6/30/21 17:15, Tomas Vondra wrote:

On 6/30/21 2:55 PM, Andrey Lepikhov wrote:

Sorry, I forgot to send CC into pgsql-hackers.
On 29/6/21 13:23, Tomas Vondra wrote:

Because sampling is fairly expensive, especially if you have to do it
for large number of child relations. And you'd have to do that every
time *any* child triggers autovacuum, pretty much. Merging the stats
is way cheaper.

See the other thread linked from the first message.

Maybe i couldn't describe my idea clearly.
The most commonly partitioning is used for large tables.
I suppose to store a sampling reservoir for each partition, replace on
update of statistics and merge to build statistics for parent table.
It can be spilled into tuplestore on a disk, or stored in a parent table.
In the case of complex inheritance we can store sampling reservoirs
only for leafs.
You can consider this idea as an imagination, but the merging
statistics approach has an extensibility problem on another types of
statistics.

Well, yeah - we might try that too, of course. This is simply exploring
the "merge statistics" idea from [1], which is why it does not even
attempt to do what you suggested. We may explore the approach with
keeping per-partition samples, of course.

You're right maintaining a per-partition samples and merging those might
solve (or at least reduce) some of the problems, e.g. eliminating most
of the I/O that'd be needed for sampling. And yeah, it's not entirely
clear how to merge some of the statistics types (like ndistinct). But
for a lot of the basic stats it works quite nicely, I think.

I'm sure there'll be some complexity due to handling large / toasted
values, etc. And we probably need to design this for large hierarchies
(IMHO it should work with 10k partitions, not just 100), in which case
it may still be quite a bit more expensive than merging the stats.

So maybe we should really support both, and combine them somehow?

I've been thinking about this PoC patch regularly since I submitted it a
year ago, and I still think merging the statistics is an interesting
idea. It may be a huge win in various scenarios, like:

1) Multi-level partitioning hierarchies, where analyze of each level has
to re-sample all the leaf relations, causing a lot of I/O.

2) Partitioning with foreign/remote partitions, where analyze has to
retrieve significant amounts of data from the remote node over network
(so a different kind of I/O).

These issues will only get worse as the number of partitions used by
systems in the wild grows - we continuously reduce the per-partition
overhead, so people are likely to leverage that by using more of them.

But I don't have a very good idea what to do about statistics that we
can't really merge. For some types of statistics it's rather tricky to
reasonably merge the results - ndistinct is a simple example, although
we could work around that by building and merging hyperloglog counters.

What's trickier are extended statistics - I can't quite imagine merging
of functional dependencies, mvdistinct, etc. So if there are extended
stats we'd have to do the sampling anyway. (Some of the extended
statistics can also be defined only on the parent relation, in which
case we have nothing to merge. But that seems like a rare case.)

In any case, I don't have a very good plan how to move this patch
forward, so unless people have some interesting ideas I'll mark it as
returned with feedback. It's been lurking in the CF for ~1 year ...

However, It'd be a mistake to also discard the approach proposed by
Andrey Lepikhov - storing samples for individual relations, and then
just using those while analyzing the parent relations. That is more
expensive, but it does not have the issues with merging some types of
statistics, and so on.

It seems interesting also in for the dynamic sampling PoC patch [1]https://commitfest.postgresql.org/36/3211/,
which does sampling as part of the query planning. In that context the
cost of collecting the sample is clearly a major issue, and storing the
sample somewhere would help a lot. But the question is how/where to
store it. Joins are another issue, because we can't just build two
random samples. But let's discuss that in the other thread [1]https://commitfest.postgresql.org/36/3211/.

[1]: https://commitfest.postgresql.org/36/3211/

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#9Andrey Lepikhov
a.lepikhov@postgrespro.ru
In reply to: Tomas Vondra (#8)
Re: Merging statistics from children instead of re-sampling everything

On 21/1/2022 01:25, Tomas Vondra wrote:

But I don't have a very good idea what to do about statistics that we
can't really merge. For some types of statistics it's rather tricky to
reasonably merge the results - ndistinct is a simple example, although
we could work around that by building and merging hyperloglog counters.

I think, as a first step on this way we can reduce a number of pulled
tuples. We don't really needed to pull all tuples from a remote server.
To construct a reservoir, we can pull only a tuple sample. Reservoir
method needs only a few arguments to return a sample like you read
tuples locally. Also, to get such parts of samples asynchronously, we
can get size of each partition on a preliminary step of analysis.
In my opinion, even this solution can reduce heaviness of a problem
drastically.

--
regards,
Andrey Lepikhov
Postgres Professional

#10Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Andrey Lepikhov (#9)
Re: Merging statistics from children instead of re-sampling everything

On 2/10/22 12:50, Andrey Lepikhov wrote:

On 21/1/2022 01:25, Tomas Vondra wrote:

But I don't have a very good idea what to do about statistics that we
can't really merge. For some types of statistics it's rather tricky to
reasonably merge the results - ndistinct is a simple example, although
we could work around that by building and merging hyperloglog counters.

I think, as a first step on this way we can reduce a number of pulled
tuples. We don't really needed to pull all tuples from a remote server.
To construct a reservoir, we can pull only a tuple sample. Reservoir
method needs only a few arguments to return a sample like you read
tuples locally. Also, to get such parts of samples asynchronously, we
can get size of each partition on a preliminary step of analysis.
In my opinion, even this solution can reduce heaviness of a problem
drastically.

Oh, wow! I haven't realized we're fetching all the rows from foreign
(postgres_fdw) partitions. For local partitions we already do that,
because that uses the usual acquire function, with a reservoir
proportional to partition size. I have assumed we use tablesample to
fetch just a small fraction of rows from FDW partitions, and I agree
doing that would be a pretty huge benefit.

I actually tried hacking that together - there's a couple problems with
that (e.g. determining what fraction to sample using bernoulli/system),
but in principle it seems quite doable. Some minor changes to the FDW
API may be necessary, not sure.

Not sure about the async execution - that seems way more complicated,
and the sampling reduces the total cost, async just parallelizes it.

That being said, this thread was not really about foreign partitions,
but about re-analyzing inheritance trees in general. And sampling
foreign partitions doesn't really solve that - we'll still do the
sampling over and over.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11Andrey V. Lepikhov
a.lepikhov@postgrespro.ru
In reply to: Tomas Vondra (#10)
Re: Merging statistics from children instead of re-sampling everything

On 2/11/22 03:37, Tomas Vondra wrote:

That being said, this thread was not really about foreign partitions,
but about re-analyzing inheritance trees in general. And sampling
foreign partitions doesn't really solve that - we'll still do the
sampling over and over.

IMO, to solve the problem we should do two things:
1. Avoid repeatable partition scans in the case inheritance tree.
2. Avoid to re-analyze everything in the case of active changes in small
subset of partitions.

For (1) i can imagine a solution like multiplexing: on the stage of
defining which relations to scan, group them and prepare parameters of
scanning to make multiple samples in one shot.
It looks like we need a separate logic for analysis of partitioned
tables - we should form and cache samples on each partition before an
analysis.
It requires a prototype to understand complexity of such solution and
can be done separately from (2).

Task (2) is more difficult to solve. Here we can store samples from each
partition in values[] field of pg_statistic or in specific table which
stores a 'most probable values' snapshot of each table.
Most difficult problem here, as you mentioned, is ndistinct value. Is it
possible to store not exactly calculated value of ndistinct, but an
'expected value', based on analysis of samples and histograms on
partitions? Such value can solve also a problem of estimation of a SETOP
result grouping (joining of them, etc), where we have statistics only on
sources of the union.

--
regards,
Andrey Lepikhov
Postgres Professional

#12Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Andrey V. Lepikhov (#11)
Re: Merging statistics from children instead of re-sampling everything

On 2/11/22 05:29, Andrey V. Lepikhov wrote:

On 2/11/22 03:37, Tomas Vondra wrote:

That being said, this thread was not really about foreign partitions,
but about re-analyzing inheritance trees in general. And sampling
foreign partitions doesn't really solve that - we'll still do the
sampling over and over.

IMO, to solve the problem we should do two things:
1. Avoid repeatable partition scans in the case inheritance tree.
2. Avoid to re-analyze everything in the case of active changes in small
subset of partitions.

For (1) i can imagine a solution like multiplexing: on the stage of
defining which relations to scan, group them and prepare parameters of
scanning to make multiple samples in one shot.

It looks like we need a separate logic for analysis of partitioned

tables - we should form and cache samples on each partition before an
analysis.
It requires a prototype to understand complexity of such solution and
can be done separately from (2).

I'm not sure I understand what you mean by multiplexing. The term
usually means "sending multiple signals at once" but I'm not sure how
that applies to this issue. Can you elaborate?

I assume you mean something like collecting a sample for a partition
once, and then keeping and reusing the sample for future ANALYZE runs,
until invalidated in some sense.

Yeah, I agree that'd be useful - and not just for partitions, actually.
I've been pondering something like that for regular tables, because the
sample might be used for estimation of clauses directly.

But it requires storing the sample somewhere, and I haven't found a good
and simple way to do that. We could serialize that into bytea, or we
could create a new fork, or something, but what should that do with
oversized attributes (how would TOAST work for a fork) and/or large
samples (which might not fit into 1GB bytea)?

Task (2) is more difficult to solve. Here we can store samples from each
partition in values[] field of pg_statistic or in specific table which
stores a 'most probable values' snapshot of each table.

I think storing samples in pg_statistic is problematic, because values[]
is subject to 1GB limit etc. Not an issue for small MCV list of a single
attribute, certainly an issue for larger samples. Even if the data fit,
the size of pg_statistic would explode.

Most difficult problem here, as you mentioned, is ndistinct value. Is it
possible to store not exactly calculated value of ndistinct, but an
'expected value', based on analysis of samples and histograms on
partitions? Such value can solve also a problem of estimation of a SETOP
result grouping (joining of them, etc), where we have statistics only on
sources of the union.

I think ndistinct is problem only when merging final estimates. But if
we're able to calculate and store some intermediate results, we can
easily use HLL and merge that.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#13Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#7)
Re: Merging statistics from children instead of re-sampling everything

On Wed, Jun 30, 2021 at 11:15 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

You're right maintaining a per-partition samples and merging those might
solve (or at least reduce) some of the problems, e.g. eliminating most
of the I/O that'd be needed for sampling. And yeah, it's not entirely
clear how to merge some of the statistics types (like ndistinct). But
for a lot of the basic stats it works quite nicely, I think.

It feels like you might in some cases get very different answers.
Let's say you have 1000 partitions. In each of those partitions, there
is a particular value that appears in column X in 50% of the rows.
This value differs for every partition. So you can imagine for example
that in partition 1, X = 1 with probability 50%; in partition 2, X = 2
with probability 50%, etc. There is also a value, let's say 0, which
appears in 0.5% of the rows in every partition. It seems possible that
0 is not an MCV in any partition, or in only some of them, but it
might be more common overall than the #1 MCV of any single partition.

--
Robert Haas
EDB: http://www.enterprisedb.com

#14Andrey V. Lepikhov
a.lepikhov@postgrespro.ru
In reply to: Tomas Vondra (#12)
Re: Merging statistics from children instead of re-sampling everything

On 2/11/22 20:12, Tomas Vondra wrote:

On 2/11/22 05:29, Andrey V. Lepikhov wrote:

On 2/11/22 03:37, Tomas Vondra wrote:

That being said, this thread was not really about foreign partitions,
but about re-analyzing inheritance trees in general. And sampling
foreign partitions doesn't really solve that - we'll still do the
sampling over and over.

IMO, to solve the problem we should do two things:
1. Avoid repeatable partition scans in the case inheritance tree.
2. Avoid to re-analyze everything in the case of active changes in
small subset of partitions.

For (1) i can imagine a solution like multiplexing: on the stage of
defining which relations to scan, group them and prepare parameters of
scanning to make multiple samples in one shot.

I'm not sure I understand what you mean by multiplexing. The term
usually means "sending multiple signals at once" but I'm not sure how
that applies to this issue. Can you elaborate?

I suppose to make a set of samples in one scan: one sample for plane
table, another - for a parent and so on, according to the inheritance
tree. And cache these samples in memory. We can calculate all parameters
of reservoir method to do it.

sample might be used for estimation of clauses directly.

You mean, to use them in difficult cases, such of estimation of grouping
over APPEND ?

But it requires storing the sample somewhere, and I haven't found a good
and simple way to do that. We could serialize that into bytea, or we
could create a new fork, or something, but what should that do with
oversized attributes (how would TOAST work for a fork) and/or large
samples (which might not fit into 1GB bytea)?

This feature looks like meta-info over a database. It can be stored in
separate relation. It is not obvious that we need to use it for each
relation, for example, with large samples. I think, it can be controlled
by a table parameter.

--
regards,
Andrey Lepikhov
Postgres Professional

#15Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Andrey V. Lepikhov (#14)
Re: Merging statistics from children instead of re-sampling everything

On 2/14/22 11:22, Andrey V. Lepikhov wrote:

On 2/11/22 20:12, Tomas Vondra wrote:

On 2/11/22 05:29, Andrey V. Lepikhov wrote:

On 2/11/22 03:37, Tomas Vondra wrote:

That being said, this thread was not really about foreign partitions,
but about re-analyzing inheritance trees in general. And sampling
foreign partitions doesn't really solve that - we'll still do the
sampling over and over.

IMO, to solve the problem we should do two things:
1. Avoid repeatable partition scans in the case inheritance tree.
2. Avoid to re-analyze everything in the case of active changes in
small subset of partitions.

For (1) i can imagine a solution like multiplexing: on the stage of
defining which relations to scan, group them and prepare parameters
of scanning to make multiple samples in one shot.

I'm not sure I understand what you mean by multiplexing. The term
usually means "sending multiple signals at once" but I'm not sure how
that applies to this issue. Can you elaborate?

I suppose to make a set of samples in one scan: one sample for plane
table, another - for a parent and so on, according to the inheritance
tree. And cache these samples in memory. We can calculate all parameters
of reservoir method to do it.

I doubt keeping the samples just in memory is a good solution. Firstly,
there's the question of memory consumption. Imagine a large partitioned
table with 1-10k partitions. If we keep a "regular" sample (30k rows)
per partition, that's 30M-300M rows. If each row needs 100B, that's
3-30GB of data.

Sure, maybe we could keep smaller per-partition samples, large enough to
get the merged sample of 30k row. But then you can also have higher
statistics target values, the rows can be larger, etc.

So a couple of GB per inheritance tree can easily happen. And this data
may not be used all that often, so keeping it in memory may be wasteful.

But maybe you have an idea how to optimize sizes per-partition samples?
In principle we need

30k * size(partition) / size(total)

for each partition, but the trouble is partitions may be detached, data
may be deleted from some partitions, etc.

Also, what would happen after a restart? If we lose the samples, we'll
have to resample everything anyway - and after a restart the system is
usually fairly busy, so that's not a great timing.

So IMHO the samples need to be serialized, in some way.

sample might be used for estimation of clauses directly.

You mean, to use them in difficult cases, such of estimation of grouping
over APPEND ?

That's one example, yes. But the sample might be used even to estimate
complex conditions on a single partition (there's plenty of stuff we
can't estimate from MCV/histogram).

But it requires storing the sample somewhere, and I haven't found a
good and simple way to do that. We could serialize that into bytea, or
we could create a new fork, or something, but what should that do with
oversized attributes (how would TOAST work for a fork) and/or large
samples (which might not fit into 1GB bytea)?

This feature looks like meta-info over a database. It can be stored in
separate relation. It is not obvious that we need to use it for each
relation, for example, with large samples. I think, it can be controlled
by a table parameter.

Well, a separate catalog is one of the options. But I don't see how that
deals with large samples, etc.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#16Andrey V. Lepikhov
a.lepikhov@postgrespro.ru
In reply to: Tomas Vondra (#15)
Re: Merging statistics from children instead of re-sampling everything

On 2/14/22 20:16, Tomas Vondra wrote:

On 2/14/22 11:22, Andrey V. Lepikhov wrote:

On 2/11/22 20:12, Tomas Vondra wrote:

On 2/11/22 05:29, Andrey V. Lepikhov wrote:

On 2/11/22 03:37, Tomas Vondra wrote:

That being said, this thread was not really about foreign partitions,
but about re-analyzing inheritance trees in general. And sampling
foreign partitions doesn't really solve that - we'll still do the
sampling over and over.

IMO, to solve the problem we should do two things:
1. Avoid repeatable partition scans in the case inheritance tree.
2. Avoid to re-analyze everything in the case of active changes in
small subset of partitions.

For (1) i can imagine a solution like multiplexing: on the stage of
defining which relations to scan, group them and prepare parameters
of scanning to make multiple samples in one shot.

I'm not sure I understand what you mean by multiplexing. The term
usually means "sending multiple signals at once" but I'm not sure how
that applies to this issue. Can you elaborate?

I suppose to make a set of samples in one scan: one sample for plane
table, another - for a parent and so on, according to the inheritance
tree. And cache these samples in memory. We can calculate all
parameters of reservoir method to do it.

I doubt keeping the samples just in memory is a good solution. Firstly,
there's the question of memory consumption. Imagine a large partitioned
table with 1-10k partitions. If we keep a "regular" sample (30k rows)
per partition, that's 30M-300M rows. If each row needs 100B, that's
3-30GB of data.

I tell about caching a sample only for a time that it needed in this
ANALYZE operation. Imagine 3 levels of partitioned table. On each
partition you should create and keep three different samples (we can do
it in one scan). Sample for a plane table we can use immediately and
destroy it.
Sample for the partition on second level of hierarchy: we can save a
copy of sample for future usage (maybe, repeated analyze) to a disk.
In-memory data used to form a reservoir, that has a limited size and can
be destroyed immediately. At the third level we can use the same logic.
So, at one moment we only use as many samples as many levels of
hierarchy we have. IMO, it isn't large number.

the trouble is partitions may be detached, data may be deleted from
some partitions, etc.

Because statistics hasn't strong relation with data, we can use two
strategies: In the case of explicit 'ANALYZE <table>' we can recalculate
all samples for all partitions, but in autovacuum case or implicit
analysis we can use not-so-old versions of samples and samples of
detached (but not destroyed) partitions in optimistic assumption that it
doesn't change statistic drastically.

So IMHO the samples need to be serialized, in some way.

Agreed

Well, a separate catalog is one of the options. But I don't see how that
deals with large samples, etc.

I think, we can design fall back to previous approach in the case of
very large tuples, like a switch from HashJoin to NestedLoop if we
estimate, that we haven't enough memory.

--
regards,
Andrey Lepikhov
Postgres Professional

#17Damir Belyalov
dam.bel07@gmail.com
In reply to: Tomas Vondra (#1)
Fwd: Merging statistics from children instead of re-sampling everything

3) stadistinct - This is quite problematic. We only have the per-child
estimates, and it's not clear if there's any overlap. For now I've just
summed it up, because that's safer / similar to what we do for gather
merge paths etc. Maybe we could improve this by estimating the overlap
somehow (e.g. from MCV lists / histograms). But honestly, I doubt the
estimates based on tiny sample of each child are any better. I suppose
we could introduce a column option, determining how to combine ndistinct
(similar to how we can override n_distinct itself).

4) MCV - It's trivial to build a new "parent" MCV list, although it may
be too large (in which case we cut it at statistics target, and copy the
remaining bits to the histogram)

I think there is one approach to solve the problem with calculating mcv and
distinct statistics.
To do this, you need to calculate the density of the sample distribution
and store it, for example, in some slot.
Then, when merging statistics, we will sum up the densities of all
partitions as functions and get a new density.
According to the new density, you can find out which values are most common
and which are distinct.

To calculate the partition densities, you can use the "Kernel density
Estimation" -
https://www.statsmodels.org/dev/examples/notebooks/generated/kernel_density
html

The approach may be very inaccurate and difficult to implement, but solves
the problem.

Regards,
Damir Belyalov
Postgres Professional