PoC Refactor AM analyse API

Started by Смирнов Денисabout 5 years ago14 messages

sd@arenadata.io

about 5 years ago

1 attachment(s)

Hello all!

I suggest a refactoring of analyze AM API as it is too much heap specific at the moment. The problem was inspired by Greenplum’s analyze improvement for append-optimized row and column AM with variable size compressed blocks.
Currently we do analyze in two steps.

1. Sample fix size blocks with algorithm S from Knuth (BlockSampler function)
2. Collect tuples into reservoir with algorithm Z from Vitter.

So this doesn’t work for AMs using variable sized physical blocks for example. They need weight random sampling (WRS) algorithms like A-Chao or logical blocks to follow S-Knuth (and have a problem with RelationGetNumberOfBlocks() estimating a physical number of blocks). Another problem with columns - they are not passed to analyze begin scan and can’t benefit from column storage at ANALYZE TABLE (COL).

The suggestion is to replace table_scan_analyze_next_block() and table_scan_analyze_next_tuple() with a single function: table_acquire_sample_rows(). The AM implementation of table_acquire_sample_rows() can use the BlockSampler functions if it wants to, but if the AM is not block-oriented, it could do something else. This suggestion also passes VacAttrStats to table_acquire_sample_rows() for column-oriented AMs and removes PROGRESS_ANALYZE_BLOCKS_TOTAL and PROGRESS_ANALYZE_BLOCKS_DONE definitions as not all AMs can be block-oriented.

Best regards,
Denis Smirnov | Developer
sd@arenadata.io
Arenadata | Godovikova 9-17, Moscow 129085 Russia

Attachments:

am-analyze-1.patchapplication/octet-stream; name=am-analyze-1.patch; x-unix-mode=0644Download

diff --git a/contrib/file_fdw/file_fdw.c b/contrib/file_fdw/file_fdw.c
index 9863e32748..3cdb839489 100644
--- a/contrib/file_fdw/file_fdw.c
+++ b/contrib/file_fdw/file_fdw.c
@@ -159,6 +159,7 @@ static void estimate_costs(PlannerInfo *root, RelOptInfo *baserel,
 						   FileFdwPlanState *fdw_private,
 						   Cost *startup_cost, Cost *total_cost);
 static int	file_acquire_sample_rows(Relation onerel, int elevel,
+									 int natts, VacAttrStats **stats,
 									 HeapTuple *rows, int targrows,
 									 double *totalrows, double *totaldeadrows);
 
@@ -1093,6 +1094,7 @@ estimate_costs(PlannerInfo *root, RelOptInfo *baserel,
  */
 static int
 file_acquire_sample_rows(Relation onerel, int elevel,
+						 int natts, VacAttrStats **stats,
 						 HeapTuple *rows, int targrows,
 						 double *totalrows, double *totaldeadrows)
 {
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index b6c72e1d1e..471970601a 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -466,6 +466,7 @@ static void process_query_params(ExprContext *econtext,
 								 List *param_exprs,
 								 const char **param_values);
 static int	postgresAcquireSampleRowsFunc(Relation relation, int elevel,
+										  int natts, VacAttrStats **stats,
 										  HeapTuple *rows, int targrows,
 										  double *totalrows,
 										  double *totaldeadrows);
@@ -4483,6 +4484,7 @@ postgresAnalyzeForeignTable(Relation relation,
  */
 static int
 postgresAcquireSampleRowsFunc(Relation relation, int elevel,
+							  int natts, VacAttrStats **stats,
 							  HeapTuple *rows, int targrows,
 							  double *totalrows,
 							  double *totaldeadrows)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 3eea215b85..b2f31626b8 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -33,6 +33,7 @@
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
 #include "commands/progress.h"
+#include "commands/vacuum.h"
 #include "executor/executor.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -44,6 +45,7 @@
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
+#include "utils/sampling.h"
 
 static void reform_and_rewrite_tuple(HeapTuple tuple,
 									 Relation OldHeap, Relation NewHeap,
@@ -1154,6 +1156,151 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
 	return false;
 }
 
+/*
+ * As of May 2004 we use a new two-stage method:  Stage one selects up
+ * to targrows random blocks (or all blocks, if there aren't so many).
+ * Stage two scans these blocks and uses the Vitter algorithm to create
+ * a random sample of targrows rows (or less, if there are less in the
+ * sample of blocks).  The two stages are executed simultaneously: each
+ * block is processed as soon as stage one returns its number and while
+ * the rows are read stage two controls which ones are to be inserted
+ * into the sample.
+ *
+ * Although every row has an equal chance of ending up in the final
+ * sample, this sampling method is not perfect: not every possible
+ * sample has an equal chance of being selected.  For large relations
+ * the number of different blocks represented by the sample tends to be
+ * too small.  We can live with that for now.  Improvements are welcome.
+ *
+ * An important property of this sampling method is that because we do
+ * look at a statistically unbiased set of blocks, we should get
+ * unbiased estimates of the average numbers of live and dead rows per
+ * block.  The previous sampling method put too much credence in the row
+ * density near the start of the table.
+ */
+static int
+heapam_acquire_sample_rows(TableScanDesc scan, int elevel,
+						   int natts, VacAttrStats **stats,
+						   BufferAccessStrategy bstrategy,
+						   HeapTuple *rows, int targrows,
+						   double *totalrows, double *totaldeadrows)
+{
+	int			numrows = 0;	/* # rows now in reservoir */
+	double		samplerows = 0; /* total # rows collected */
+	double		liverows = 0;	/* # live rows seen */
+	double		deadrows = 0;	/* # dead rows seen */
+	double		rowstoskip = -1;	/* -1 means not set yet */
+	BlockNumber totalblocks;
+	TransactionId OldestXmin;
+	BlockSamplerData bs;
+	ReservoirStateData rstate;
+	TupleTableSlot *slot;
+	BlockNumber nblocks;
+
+	totalblocks = RelationGetNumberOfBlocks(scan->rs_rd);
+
+	/* Need a cutoff xmin for HeapTupleSatisfiesVacuum */
+	OldestXmin = GetOldestNonRemovableTransactionId(scan->rs_rd);
+
+	/* Prepare for sampling block numbers */
+	nblocks = BlockSampler_Init(&bs, totalblocks, targrows, random());
+
+	/* Prepare for sampling rows */
+	reservoir_init_selection_state(&rstate, targrows);
+
+	slot = table_slot_create(scan->rs_rd, NULL);
+
+	/* Outer loop over blocks to sample */
+	while (BlockSampler_HasMore(&bs))
+	{
+		BlockNumber targblock = BlockSampler_Next(&bs);
+
+		vacuum_delay_point();
+
+		if (!heapam_scan_analyze_next_block(scan, targblock, bstrategy))
+			continue;
+
+		while (heapam_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
+		{
+			/*
+			 * The first targrows sample rows are simply copied into the
+			 * reservoir. Then we start replacing tuples in the sample until
+			 * we reach the end of the relation.  This algorithm is from Jeff
+			 * Vitter's paper (see full citation in utils/misc/sampling.c). It
+			 * works by repeatedly computing the number of tuples to skip
+			 * before selecting a tuple, which replaces a randomly chosen
+			 * element of the reservoir (current set of tuples).  At all times
+			 * the reservoir is a true random sample of the tuples we've
+			 * passed over so far, so when we fall off the end of the relation
+			 * we're done.
+			 */
+			if (numrows < targrows)
+				rows[numrows++] = ExecCopySlotHeapTuple(slot);
+			else
+			{
+				/*
+				 * t in Vitter's paper is the number of records already
+				 * processed.  If we need to compute a new S value, we must
+				 * use the not-yet-incremented value of samplerows as t.
+				 */
+				if (rowstoskip < 0)
+					rowstoskip = reservoir_get_next_S(&rstate, samplerows, targrows);
+
+				if (rowstoskip <= 0)
+				{
+					/*
+					 * Found a suitable tuple, so save it, replacing one old
+					 * tuple at random
+					 */
+					int			k = (int) (targrows * sampler_random_fract(rstate.randstate));
+
+					Assert(k >= 0 && k < targrows);
+					heap_freetuple(rows[k]);
+					rows[k] = ExecCopySlotHeapTuple(slot);
+				}
+
+				rowstoskip -= 1;
+			}
+
+			samplerows += 1;
+		}
+	}
+
+	ExecDropSingleTupleTableSlot(slot);
+
+	/*
+	 * Estimate total numbers of live and dead rows in relation, extrapolating
+	 * on the assumption that the average tuple density in pages we didn't
+	 * scan is the same as in the pages we did scan.  Since what we scanned is
+	 * a random sample of the pages in the relation, this should be a good
+	 * assumption.
+	 */
+	if (bs.m > 0)
+	{
+		*totalrows = floor((liverows / bs.m) * totalblocks + 0.5);
+		*totaldeadrows = floor((deadrows / bs.m) * totalblocks + 0.5);
+	}
+	else
+	{
+		*totalrows = 0.0;
+		*totaldeadrows = 0.0;
+	}
+
+	/*
+	 * Emit some interesting relation info
+	 */
+	ereport(elevel,
+			(errmsg("\"%s\": scanned %d of %u pages, "
+					"containing %.0f live rows and %.0f dead rows; "
+					"%d rows in sample, %.0f estimated total rows",
+					RelationGetRelationName(scan->rs_rd),
+					bs.m, totalblocks,
+					liverows, deadrows,
+					numrows, *totalrows)));
+
+	return numrows;
+}
+
 static double
 heapam_index_build_range_scan(Relation heapRelation,
 							  Relation indexRelation,
@@ -1657,13 +1804,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 			offnum = ItemPointerGetOffsetNumber(&heapTuple->t_self);
 
 			/*
-			 * If a HOT tuple points to a root that we don't know
-			 * about, obtain root items afresh.  If that still fails,
-			 * report it as corruption.
+			 * If a HOT tuple points to a root that we don't know about,
+			 * obtain root items afresh.  If that still fails, report it as
+			 * corruption.
 			 */
 			if (root_offsets[offnum - 1] == InvalidOffsetNumber)
 			{
-				Page	page = BufferGetPage(hscan->rs_cbuf);
+				Page		page = BufferGetPage(hscan->rs_cbuf);
 
 				LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
 				heap_get_root_tuples(page, root_offsets);
@@ -2569,8 +2716,7 @@ static const TableAmRoutine heapam_methods = {
 	.relation_copy_data = heapam_relation_copy_data,
 	.relation_copy_for_cluster = heapam_relation_copy_for_cluster,
 	.relation_vacuum = heap_vacuum_rel,
-	.scan_analyze_next_block = heapam_scan_analyze_next_block,
-	.scan_analyze_next_tuple = heapam_scan_analyze_next_tuple,
+	.acquire_sample_rows = heapam_acquire_sample_rows,
 	.index_build_range_scan = heapam_index_build_range_scan,
 	.index_validate_scan = heapam_index_validate_scan,
 
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index 58de0743ba..0a981e20e4 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -87,8 +87,7 @@ GetTableAmRoutine(Oid amhandler)
 	Assert(routine->relation_copy_data != NULL);
 	Assert(routine->relation_copy_for_cluster != NULL);
 	Assert(routine->relation_vacuum != NULL);
-	Assert(routine->scan_analyze_next_block != NULL);
-	Assert(routine->scan_analyze_next_tuple != NULL);
+	Assert(routine->acquire_sample_rows != NULL);
 	Assert(routine->index_build_range_scan != NULL);
 	Assert(routine->index_validate_scan != NULL);
 
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8af12b5c6b..e6476d7047 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -37,7 +37,6 @@
 #include "commands/dbcommands.h"
 #include "commands/progress.h"
 #include "commands/tablecmds.h"
-#include "commands/vacuum.h"
 #include "executor/executor.h"
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
@@ -61,7 +60,6 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/pg_rusage.h"
-#include "utils/sampling.h"
 #include "utils/sortsupport.h"
 #include "utils/syscache.h"
 #include "utils/timestamp.h"
@@ -96,10 +94,12 @@ static void compute_index_stats(Relation onerel, double totalrows,
 static VacAttrStats *examine_attribute(Relation onerel, int attnum,
 									   Node *index_expr);
 static int	acquire_sample_rows(Relation onerel, int elevel,
+								int natts, VacAttrStats **stats,
 								HeapTuple *rows, int targrows,
 								double *totalrows, double *totaldeadrows);
 static int	compare_rows(const void *a, const void *b);
 static int	acquire_inherited_sample_rows(Relation onerel, int elevel,
+										  int natts, VacAttrStats **stats,
 										  HeapTuple *rows, int targrows,
 										  double *totalrows, double *totaldeadrows);
 static void update_attstats(Oid relid, bool inh,
@@ -505,10 +505,12 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 								 PROGRESS_ANALYZE_PHASE_ACQUIRE_SAMPLE_ROWS);
 	if (inh)
 		numrows = acquire_inherited_sample_rows(onerel, elevel,
+												attr_cnt, vacattrstats,
 												rows, targrows,
 												&totalrows, &totaldeadrows);
 	else
 		numrows = (*acquirefunc) (onerel, elevel,
+								  attr_cnt, vacattrstats,
 								  rows, targrows,
 								  &totalrows, &totaldeadrows);
 
@@ -999,127 +1001,26 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
  *
  * The returned list of tuples is in order by physical position in the table.
  * (We will rely on this later to derive correlation estimates.)
- *
- * As of May 2004 we use a new two-stage method:  Stage one selects up
- * to targrows random blocks (or all blocks, if there aren't so many).
- * Stage two scans these blocks and uses the Vitter algorithm to create
- * a random sample of targrows rows (or less, if there are less in the
- * sample of blocks).  The two stages are executed simultaneously: each
- * block is processed as soon as stage one returns its number and while
- * the rows are read stage two controls which ones are to be inserted
- * into the sample.
- *
- * Although every row has an equal chance of ending up in the final
- * sample, this sampling method is not perfect: not every possible
- * sample has an equal chance of being selected.  For large relations
- * the number of different blocks represented by the sample tends to be
- * too small.  We can live with that for now.  Improvements are welcome.
- *
- * An important property of this sampling method is that because we do
- * look at a statistically unbiased set of blocks, we should get
- * unbiased estimates of the average numbers of live and dead rows per
- * block.  The previous sampling method put too much credence in the row
- * density near the start of the table.
  */
 static int
 acquire_sample_rows(Relation onerel, int elevel,
+					int natts, VacAttrStats **stats,
 					HeapTuple *rows, int targrows,
 					double *totalrows, double *totaldeadrows)
 {
 	int			numrows = 0;	/* # rows now in reservoir */
-	double		samplerows = 0; /* total # rows collected */
-	double		liverows = 0;	/* # live rows seen */
-	double		deadrows = 0;	/* # dead rows seen */
-	double		rowstoskip = -1;	/* -1 means not set yet */
-	BlockNumber totalblocks;
-	TransactionId OldestXmin;
-	BlockSamplerData bs;
-	ReservoirStateData rstate;
-	TupleTableSlot *slot;
 	TableScanDesc scan;
-	BlockNumber nblocks;
-	BlockNumber blksdone = 0;
 
 	Assert(targrows > 0);
 
-	totalblocks = RelationGetNumberOfBlocks(onerel);
-
-	/* Need a cutoff xmin for HeapTupleSatisfiesVacuum */
-	OldestXmin = GetOldestNonRemovableTransactionId(onerel);
-
-	/* Prepare for sampling block numbers */
-	nblocks = BlockSampler_Init(&bs, totalblocks, targrows, random());
-
-	/* Report sampling block numbers */
-	pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_TOTAL,
-								 nblocks);
-
-	/* Prepare for sampling rows */
-	reservoir_init_selection_state(&rstate, targrows);
-
 	scan = table_beginscan_analyze(onerel);
-	slot = table_slot_create(onerel, NULL);
 
-	/* Outer loop over blocks to sample */
-	while (BlockSampler_HasMore(&bs))
-	{
-		BlockNumber targblock = BlockSampler_Next(&bs);
+	numrows = table_acquire_sample_rows(scan, elevel,
+										natts, stats,
+										vac_strategy, rows,
+										targrows, totalrows,
+										totaldeadrows);
 
-		vacuum_delay_point();
-
-		if (!table_scan_analyze_next_block(scan, targblock, vac_strategy))
-			continue;
-
-		while (table_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
-		{
-			/*
-			 * The first targrows sample rows are simply copied into the
-			 * reservoir. Then we start replacing tuples in the sample until
-			 * we reach the end of the relation.  This algorithm is from Jeff
-			 * Vitter's paper (see full citation in utils/misc/sampling.c). It
-			 * works by repeatedly computing the number of tuples to skip
-			 * before selecting a tuple, which replaces a randomly chosen
-			 * element of the reservoir (current set of tuples).  At all times
-			 * the reservoir is a true random sample of the tuples we've
-			 * passed over so far, so when we fall off the end of the relation
-			 * we're done.
-			 */
-			if (numrows < targrows)
-				rows[numrows++] = ExecCopySlotHeapTuple(slot);
-			else
-			{
-				/*
-				 * t in Vitter's paper is the number of records already
-				 * processed.  If we need to compute a new S value, we must
-				 * use the not-yet-incremented value of samplerows as t.
-				 */
-				if (rowstoskip < 0)
-					rowstoskip = reservoir_get_next_S(&rstate, samplerows, targrows);
-
-				if (rowstoskip <= 0)
-				{
-					/*
-					 * Found a suitable tuple, so save it, replacing one old
-					 * tuple at random
-					 */
-					int			k = (int) (targrows * sampler_random_fract(rstate.randstate));
-
-					Assert(k >= 0 && k < targrows);
-					heap_freetuple(rows[k]);
-					rows[k] = ExecCopySlotHeapTuple(slot);
-				}
-
-				rowstoskip -= 1;
-			}
-
-			samplerows += 1;
-		}
-
-		pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_DONE,
-									 ++blksdone);
-	}
-
-	ExecDropSingleTupleTableSlot(slot);
 	table_endscan(scan);
 
 	/*
@@ -1133,36 +1034,6 @@ acquire_sample_rows(Relation onerel, int elevel,
 	if (numrows == targrows)
 		qsort((void *) rows, numrows, sizeof(HeapTuple), compare_rows);
 
-	/*
-	 * Estimate total numbers of live and dead rows in relation, extrapolating
-	 * on the assumption that the average tuple density in pages we didn't
-	 * scan is the same as in the pages we did scan.  Since what we scanned is
-	 * a random sample of the pages in the relation, this should be a good
-	 * assumption.
-	 */
-	if (bs.m > 0)
-	{
-		*totalrows = floor((liverows / bs.m) * totalblocks + 0.5);
-		*totaldeadrows = floor((deadrows / bs.m) * totalblocks + 0.5);
-	}
-	else
-	{
-		*totalrows = 0.0;
-		*totaldeadrows = 0.0;
-	}
-
-	/*
-	 * Emit some interesting relation info
-	 */
-	ereport(elevel,
-			(errmsg("\"%s\": scanned %d of %u pages, "
-					"containing %.0f live rows and %.0f dead rows; "
-					"%d rows in sample, %.0f estimated total rows",
-					RelationGetRelationName(onerel),
-					bs.m, totalblocks,
-					liverows, deadrows,
-					numrows, *totalrows)));
-
 	return numrows;
 }
 
@@ -1201,6 +1072,7 @@ compare_rows(const void *a, const void *b)
  */
 static int
 acquire_inherited_sample_rows(Relation onerel, int elevel,
+							  int natts, VacAttrStats **stats,
 							  HeapTuple *rows, int targrows,
 							  double *totalrows, double *totaldeadrows)
 {
@@ -1374,6 +1246,7 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
 
 				/* Fetch a random sample of the child's rows */
 				childrows = (*acquirefunc) (childrel, elevel,
+											natts, stats,
 											rows + numrows, childtargrows,
 											&trows, &tdrows);
 
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 387eb34a61..f9359e27ed 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -20,6 +20,7 @@
 #include "access/relscan.h"
 #include "access/sdir.h"
 #include "access/xact.h"
+#include "commands/vacuum.h"
 #include "utils/guc.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
@@ -502,39 +503,18 @@ typedef struct TableAmRoutine
 									BufferAccessStrategy bstrategy);
 
 	/*
-	 * Prepare to analyze block `blockno` of `scan`. The scan has been started
-	 * with table_beginscan_analyze().  See also
-	 * table_scan_analyze_next_block().
-	 *
-	 * The callback may acquire resources like locks that are held until
-	 * table_scan_analyze_next_tuple() returns false. It e.g. can make sense
-	 * to hold a lock until all tuples on a block have been analyzed by
-	 * scan_analyze_next_tuple.
-	 *
-	 * The callback can return false if the block is not suitable for
-	 * sampling, e.g. because it's a metapage that could never contain tuples.
-	 *
-	 * XXX: This obviously is primarily suited for block-based AMs. It's not
-	 * clear what a good interface for non block based AMs would be, so there
-	 * isn't one yet.
+	 * This callback needs to fill reservour with sample rows during analyze
+	 * scan.
 	 */
-	bool		(*scan_analyze_next_block) (TableScanDesc scan,
-											BlockNumber blockno,
-											BufferAccessStrategy bstrategy);
-
-	/*
-	 * See table_scan_analyze_next_tuple().
-	 *
-	 * Not every AM might have a meaningful concept of dead rows, in which
-	 * case it's OK to not increment *deadrows - but note that that may
-	 * influence autovacuum scheduling (see comment for relation_vacuum
-	 * callback).
-	 */
-	bool		(*scan_analyze_next_tuple) (TableScanDesc scan,
-											TransactionId OldestXmin,
-											double *liverows,
-											double *deadrows,
-											TupleTableSlot *slot);
+	int			(*acquire_sample_rows) (TableScanDesc scan,
+										int elevel,
+										int natts,
+										VacAttrStats **stats,
+										BufferAccessStrategy bstrategy,
+										HeapTuple *rows,
+										int targrows,
+										double *totalrows,
+										double *totaldeadrows);
 
 	/* see table_index_build_range_scan for reference about parameters */
 	double		(*index_build_range_scan) (Relation table_rel,
@@ -1485,40 +1465,18 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
 	rel->rd_tableam->relation_vacuum(rel, params, bstrategy);
 }
 
-/*
- * Prepare to analyze block `blockno` of `scan`. The scan needs to have been
- * started with table_beginscan_analyze().  Note that this routine might
- * acquire resources like locks that are held until
- * table_scan_analyze_next_tuple() returns false.
- *
- * Returns false if block is unsuitable for sampling, true otherwise.
- */
-static inline bool
-table_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
-							  BufferAccessStrategy bstrategy)
+static inline int
+table_acquire_sample_rows(TableScanDesc scan, int elevel,
+						  int natts, VacAttrStats **stats,
+						  BufferAccessStrategy bstrategy,
+						  HeapTuple *rows, int targrows,
+						  double *totalrows, double *totaldeadrows)
 {
-	return scan->rs_rd->rd_tableam->scan_analyze_next_block(scan, blockno,
-															bstrategy);
-}
-
-/*
- * Iterate over tuples in the block selected with
- * table_scan_analyze_next_block() (which needs to have returned true, and
- * this routine may not have returned false for the same block before). If a
- * tuple that's suitable for sampling is found, true is returned and a tuple
- * is stored in `slot`.
- *
- * *liverows and *deadrows are incremented according to the encountered
- * tuples.
- */
-static inline bool
-table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
-							  double *liverows, double *deadrows,
-							  TupleTableSlot *slot)
-{
-	return scan->rs_rd->rd_tableam->scan_analyze_next_tuple(scan, OldestXmin,
-															liverows, deadrows,
-															slot);
+	return scan->rs_rd->rd_tableam->acquire_sample_rows(scan, elevel,
+														natts, stats,
+														bstrategy, rows,
+														targrows, totalrows,
+														totaldeadrows);
 }
 
 /*
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 36b073e677..397c8612d4 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -36,13 +36,11 @@
 
 /* Progress parameters for analyze */
 #define PROGRESS_ANALYZE_PHASE						0
-#define PROGRESS_ANALYZE_BLOCKS_TOTAL				1
-#define PROGRESS_ANALYZE_BLOCKS_DONE				2
-#define PROGRESS_ANALYZE_EXT_STATS_TOTAL			3
-#define PROGRESS_ANALYZE_EXT_STATS_COMPUTED			4
-#define PROGRESS_ANALYZE_CHILD_TABLES_TOTAL			5
-#define PROGRESS_ANALYZE_CHILD_TABLES_DONE			6
-#define PROGRESS_ANALYZE_CURRENT_CHILD_TABLE_RELID	7
+#define PROGRESS_ANALYZE_EXT_STATS_TOTAL			1
+#define PROGRESS_ANALYZE_EXT_STATS_COMPUTED			2
+#define PROGRESS_ANALYZE_CHILD_TABLES_TOTAL			3
+#define PROGRESS_ANALYZE_CHILD_TABLES_DONE			4
+#define PROGRESS_ANALYZE_CURRENT_CHILD_TABLE_RELID	5
 
 /* Phases of analyze (as advertised via PROGRESS_ANALYZE_PHASE) */
 #define PROGRESS_ANALYZE_PHASE_ACQUIRE_SAMPLE_ROWS		1
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 95556dfb15..f056d02b4a 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -13,6 +13,7 @@
 #define FDWAPI_H
 
 #include "access/parallel.h"
+#include "commands/vacuum.h"
 #include "nodes/execnodes.h"
 #include "nodes/pathnodes.h"
 
@@ -140,6 +141,7 @@ typedef void (*ExplainDirectModify_function) (ForeignScanState *node,
 											  struct ExplainState *es);
 
 typedef int (*AcquireSampleRowsFunc) (Relation relation, int elevel,
+									  int natts, VacAttrStats **stats,
 									  HeapTuple *rows, int targrows,
 									  double *totalrows,
 									  double *totaldeadrows);

Denis Smirnov

sd@arenadata.io

about 5 years ago

In reply to: Смирнов Денис (#1)

1 attachment(s)

Re: PoC Refactor AM analyse API

It seems that my mailing client set wrong MIME types for attached patch and it was filtered by the web archive. So I attach the patch again for the web archive.

Attachments:

am-analyze-1.patch.txttext/plain; name=am-analyze-1.patch.txt; x-unix-mode=0644Download

diff --git a/contrib/file_fdw/file_fdw.c b/contrib/file_fdw/file_fdw.c
index 9863e32748..3cdb839489 100644
--- a/contrib/file_fdw/file_fdw.c
+++ b/contrib/file_fdw/file_fdw.c
@@ -159,6 +159,7 @@ static void estimate_costs(PlannerInfo *root, RelOptInfo *baserel,
 						   FileFdwPlanState *fdw_private,
 						   Cost *startup_cost, Cost *total_cost);
 static int	file_acquire_sample_rows(Relation onerel, int elevel,
+									 int natts, VacAttrStats **stats,
 									 HeapTuple *rows, int targrows,
 									 double *totalrows, double *totaldeadrows);
 
@@ -1093,6 +1094,7 @@ estimate_costs(PlannerInfo *root, RelOptInfo *baserel,
  */
 static int
 file_acquire_sample_rows(Relation onerel, int elevel,
+						 int natts, VacAttrStats **stats,
 						 HeapTuple *rows, int targrows,
 						 double *totalrows, double *totaldeadrows)
 {
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index b6c72e1d1e..471970601a 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -466,6 +466,7 @@ static void process_query_params(ExprContext *econtext,
 								 List *param_exprs,
 								 const char **param_values);
 static int	postgresAcquireSampleRowsFunc(Relation relation, int elevel,
+										  int natts, VacAttrStats **stats,
 										  HeapTuple *rows, int targrows,
 										  double *totalrows,
 										  double *totaldeadrows);
@@ -4483,6 +4484,7 @@ postgresAnalyzeForeignTable(Relation relation,
  */
 static int
 postgresAcquireSampleRowsFunc(Relation relation, int elevel,
+							  int natts, VacAttrStats **stats,
 							  HeapTuple *rows, int targrows,
 							  double *totalrows,
 							  double *totaldeadrows)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 3eea215b85..b2f31626b8 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -33,6 +33,7 @@
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
 #include "commands/progress.h"
+#include "commands/vacuum.h"
 #include "executor/executor.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -44,6 +45,7 @@
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
+#include "utils/sampling.h"
 
 static void reform_and_rewrite_tuple(HeapTuple tuple,
 									 Relation OldHeap, Relation NewHeap,
@@ -1154,6 +1156,151 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
 	return false;
 }
 
+/*
+ * As of May 2004 we use a new two-stage method:  Stage one selects up
+ * to targrows random blocks (or all blocks, if there aren't so many).
+ * Stage two scans these blocks and uses the Vitter algorithm to create
+ * a random sample of targrows rows (or less, if there are less in the
+ * sample of blocks).  The two stages are executed simultaneously: each
+ * block is processed as soon as stage one returns its number and while
+ * the rows are read stage two controls which ones are to be inserted
+ * into the sample.
+ *
+ * Although every row has an equal chance of ending up in the final
+ * sample, this sampling method is not perfect: not every possible
+ * sample has an equal chance of being selected.  For large relations
+ * the number of different blocks represented by the sample tends to be
+ * too small.  We can live with that for now.  Improvements are welcome.
+ *
+ * An important property of this sampling method is that because we do
+ * look at a statistically unbiased set of blocks, we should get
+ * unbiased estimates of the average numbers of live and dead rows per
+ * block.  The previous sampling method put too much credence in the row
+ * density near the start of the table.
+ */
+static int
+heapam_acquire_sample_rows(TableScanDesc scan, int elevel,
+						   int natts, VacAttrStats **stats,
+						   BufferAccessStrategy bstrategy,
+						   HeapTuple *rows, int targrows,
+						   double *totalrows, double *totaldeadrows)
+{
+	int			numrows = 0;	/* # rows now in reservoir */
+	double		samplerows = 0; /* total # rows collected */
+	double		liverows = 0;	/* # live rows seen */
+	double		deadrows = 0;	/* # dead rows seen */
+	double		rowstoskip = -1;	/* -1 means not set yet */
+	BlockNumber totalblocks;
+	TransactionId OldestXmin;
+	BlockSamplerData bs;
+	ReservoirStateData rstate;
+	TupleTableSlot *slot;
+	BlockNumber nblocks;
+
+	totalblocks = RelationGetNumberOfBlocks(scan->rs_rd);
+
+	/* Need a cutoff xmin for HeapTupleSatisfiesVacuum */
+	OldestXmin = GetOldestNonRemovableTransactionId(scan->rs_rd);
+
+	/* Prepare for sampling block numbers */
+	nblocks = BlockSampler_Init(&bs, totalblocks, targrows, random());
+
+	/* Prepare for sampling rows */
+	reservoir_init_selection_state(&rstate, targrows);
+
+	slot = table_slot_create(scan->rs_rd, NULL);
+
+	/* Outer loop over blocks to sample */
+	while (BlockSampler_HasMore(&bs))
+	{
+		BlockNumber targblock = BlockSampler_Next(&bs);
+
+		vacuum_delay_point();
+
+		if (!heapam_scan_analyze_next_block(scan, targblock, bstrategy))
+			continue;
+
+		while (heapam_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
+		{
+			/*
+			 * The first targrows sample rows are simply copied into the
+			 * reservoir. Then we start replacing tuples in the sample until
+			 * we reach the end of the relation.  This algorithm is from Jeff
+			 * Vitter's paper (see full citation in utils/misc/sampling.c). It
+			 * works by repeatedly computing the number of tuples to skip
+			 * before selecting a tuple, which replaces a randomly chosen
+			 * element of the reservoir (current set of tuples).  At all times
+			 * the reservoir is a true random sample of the tuples we've
+			 * passed over so far, so when we fall off the end of the relation
+			 * we're done.
+			 */
+			if (numrows < targrows)
+				rows[numrows++] = ExecCopySlotHeapTuple(slot);
+			else
+			{
+				/*
+				 * t in Vitter's paper is the number of records already
+				 * processed.  If we need to compute a new S value, we must
+				 * use the not-yet-incremented value of samplerows as t.
+				 */
+				if (rowstoskip < 0)
+					rowstoskip = reservoir_get_next_S(&rstate, samplerows, targrows);
+
+				if (rowstoskip <= 0)
+				{
+					/*
+					 * Found a suitable tuple, so save it, replacing one old
+					 * tuple at random
+					 */
+					int			k = (int) (targrows * sampler_random_fract(rstate.randstate));
+
+					Assert(k >= 0 && k < targrows);
+					heap_freetuple(rows[k]);
+					rows[k] = ExecCopySlotHeapTuple(slot);
+				}
+
+				rowstoskip -= 1;
+			}
+
+			samplerows += 1;
+		}
+	}
+
+	ExecDropSingleTupleTableSlot(slot);
+
+	/*
+	 * Estimate total numbers of live and dead rows in relation, extrapolating
+	 * on the assumption that the average tuple density in pages we didn't
+	 * scan is the same as in the pages we did scan.  Since what we scanned is
+	 * a random sample of the pages in the relation, this should be a good
+	 * assumption.
+	 */
+	if (bs.m > 0)
+	{
+		*totalrows = floor((liverows / bs.m) * totalblocks + 0.5);
+		*totaldeadrows = floor((deadrows / bs.m) * totalblocks + 0.5);
+	}
+	else
+	{
+		*totalrows = 0.0;
+		*totaldeadrows = 0.0;
+	}
+
+	/*
+	 * Emit some interesting relation info
+	 */
+	ereport(elevel,
+			(errmsg("\"%s\": scanned %d of %u pages, "
+					"containing %.0f live rows and %.0f dead rows; "
+					"%d rows in sample, %.0f estimated total rows",
+					RelationGetRelationName(scan->rs_rd),
+					bs.m, totalblocks,
+					liverows, deadrows,
+					numrows, *totalrows)));
+
+	return numrows;
+}
+
 static double
 heapam_index_build_range_scan(Relation heapRelation,
 							  Relation indexRelation,
@@ -1657,13 +1804,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 			offnum = ItemPointerGetOffsetNumber(&heapTuple->t_self);
 
 			/*
-			 * If a HOT tuple points to a root that we don't know
-			 * about, obtain root items afresh.  If that still fails,
-			 * report it as corruption.
+			 * If a HOT tuple points to a root that we don't know about,
+			 * obtain root items afresh.  If that still fails, report it as
+			 * corruption.
 			 */
 			if (root_offsets[offnum - 1] == InvalidOffsetNumber)
 			{
-				Page	page = BufferGetPage(hscan->rs_cbuf);
+				Page		page = BufferGetPage(hscan->rs_cbuf);
 
 				LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
 				heap_get_root_tuples(page, root_offsets);
@@ -2569,8 +2716,7 @@ static const TableAmRoutine heapam_methods = {
 	.relation_copy_data = heapam_relation_copy_data,
 	.relation_copy_for_cluster = heapam_relation_copy_for_cluster,
 	.relation_vacuum = heap_vacuum_rel,
-	.scan_analyze_next_block = heapam_scan_analyze_next_block,
-	.scan_analyze_next_tuple = heapam_scan_analyze_next_tuple,
+	.acquire_sample_rows = heapam_acquire_sample_rows,
 	.index_build_range_scan = heapam_index_build_range_scan,
 	.index_validate_scan = heapam_index_validate_scan,
 
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index 58de0743ba..0a981e20e4 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -87,8 +87,7 @@ GetTableAmRoutine(Oid amhandler)
 	Assert(routine->relation_copy_data != NULL);
 	Assert(routine->relation_copy_for_cluster != NULL);
 	Assert(routine->relation_vacuum != NULL);
-	Assert(routine->scan_analyze_next_block != NULL);
-	Assert(routine->scan_analyze_next_tuple != NULL);
+	Assert(routine->acquire_sample_rows != NULL);
 	Assert(routine->index_build_range_scan != NULL);
 	Assert(routine->index_validate_scan != NULL);
 
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8af12b5c6b..e6476d7047 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -37,7 +37,6 @@
 #include "commands/dbcommands.h"
 #include "commands/progress.h"
 #include "commands/tablecmds.h"
-#include "commands/vacuum.h"
 #include "executor/executor.h"
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
@@ -61,7 +60,6 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/pg_rusage.h"
-#include "utils/sampling.h"
 #include "utils/sortsupport.h"
 #include "utils/syscache.h"
 #include "utils/timestamp.h"
@@ -96,10 +94,12 @@ static void compute_index_stats(Relation onerel, double totalrows,
 static VacAttrStats *examine_attribute(Relation onerel, int attnum,
 									   Node *index_expr);
 static int	acquire_sample_rows(Relation onerel, int elevel,
+								int natts, VacAttrStats **stats,
 								HeapTuple *rows, int targrows,
 								double *totalrows, double *totaldeadrows);
 static int	compare_rows(const void *a, const void *b);
 static int	acquire_inherited_sample_rows(Relation onerel, int elevel,
+										  int natts, VacAttrStats **stats,
 										  HeapTuple *rows, int targrows,
 										  double *totalrows, double *totaldeadrows);
 static void update_attstats(Oid relid, bool inh,
@@ -505,10 +505,12 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 								 PROGRESS_ANALYZE_PHASE_ACQUIRE_SAMPLE_ROWS);
 	if (inh)
 		numrows = acquire_inherited_sample_rows(onerel, elevel,
+												attr_cnt, vacattrstats,
 												rows, targrows,
 												&totalrows, &totaldeadrows);
 	else
 		numrows = (*acquirefunc) (onerel, elevel,
+								  attr_cnt, vacattrstats,
 								  rows, targrows,
 								  &totalrows, &totaldeadrows);
 
@@ -999,127 +1001,26 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
  *
  * The returned list of tuples is in order by physical position in the table.
  * (We will rely on this later to derive correlation estimates.)
- *
- * As of May 2004 we use a new two-stage method:  Stage one selects up
- * to targrows random blocks (or all blocks, if there aren't so many).
- * Stage two scans these blocks and uses the Vitter algorithm to create
- * a random sample of targrows rows (or less, if there are less in the
- * sample of blocks).  The two stages are executed simultaneously: each
- * block is processed as soon as stage one returns its number and while
- * the rows are read stage two controls which ones are to be inserted
- * into the sample.
- *
- * Although every row has an equal chance of ending up in the final
- * sample, this sampling method is not perfect: not every possible
- * sample has an equal chance of being selected.  For large relations
- * the number of different blocks represented by the sample tends to be
- * too small.  We can live with that for now.  Improvements are welcome.
- *
- * An important property of this sampling method is that because we do
- * look at a statistically unbiased set of blocks, we should get
- * unbiased estimates of the average numbers of live and dead rows per
- * block.  The previous sampling method put too much credence in the row
- * density near the start of the table.
  */
 static int
 acquire_sample_rows(Relation onerel, int elevel,
+					int natts, VacAttrStats **stats,
 					HeapTuple *rows, int targrows,
 					double *totalrows, double *totaldeadrows)
 {
 	int			numrows = 0;	/* # rows now in reservoir */
-	double		samplerows = 0; /* total # rows collected */
-	double		liverows = 0;	/* # live rows seen */
-	double		deadrows = 0;	/* # dead rows seen */
-	double		rowstoskip = -1;	/* -1 means not set yet */
-	BlockNumber totalblocks;
-	TransactionId OldestXmin;
-	BlockSamplerData bs;
-	ReservoirStateData rstate;
-	TupleTableSlot *slot;
 	TableScanDesc scan;
-	BlockNumber nblocks;
-	BlockNumber blksdone = 0;
 
 	Assert(targrows > 0);
 
-	totalblocks = RelationGetNumberOfBlocks(onerel);
-
-	/* Need a cutoff xmin for HeapTupleSatisfiesVacuum */
-	OldestXmin = GetOldestNonRemovableTransactionId(onerel);
-
-	/* Prepare for sampling block numbers */
-	nblocks = BlockSampler_Init(&bs, totalblocks, targrows, random());
-
-	/* Report sampling block numbers */
-	pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_TOTAL,
-								 nblocks);
-
-	/* Prepare for sampling rows */
-	reservoir_init_selection_state(&rstate, targrows);
-
 	scan = table_beginscan_analyze(onerel);
-	slot = table_slot_create(onerel, NULL);
 
-	/* Outer loop over blocks to sample */
-	while (BlockSampler_HasMore(&bs))
-	{
-		BlockNumber targblock = BlockSampler_Next(&bs);
+	numrows = table_acquire_sample_rows(scan, elevel,
+										natts, stats,
+										vac_strategy, rows,
+										targrows, totalrows,
+										totaldeadrows);
 
-		vacuum_delay_point();
-
-		if (!table_scan_analyze_next_block(scan, targblock, vac_strategy))
-			continue;
-
-		while (table_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
-		{
-			/*
-			 * The first targrows sample rows are simply copied into the
-			 * reservoir. Then we start replacing tuples in the sample until
-			 * we reach the end of the relation.  This algorithm is from Jeff
-			 * Vitter's paper (see full citation in utils/misc/sampling.c). It
-			 * works by repeatedly computing the number of tuples to skip
-			 * before selecting a tuple, which replaces a randomly chosen
-			 * element of the reservoir (current set of tuples).  At all times
-			 * the reservoir is a true random sample of the tuples we've
-			 * passed over so far, so when we fall off the end of the relation
-			 * we're done.
-			 */
-			if (numrows < targrows)
-				rows[numrows++] = ExecCopySlotHeapTuple(slot);
-			else
-			{
-				/*
-				 * t in Vitter's paper is the number of records already
-				 * processed.  If we need to compute a new S value, we must
-				 * use the not-yet-incremented value of samplerows as t.
-				 */
-				if (rowstoskip < 0)
-					rowstoskip = reservoir_get_next_S(&rstate, samplerows, targrows);
-
-				if (rowstoskip <= 0)
-				{
-					/*
-					 * Found a suitable tuple, so save it, replacing one old
-					 * tuple at random
-					 */
-					int			k = (int) (targrows * sampler_random_fract(rstate.randstate));
-
-					Assert(k >= 0 && k < targrows);
-					heap_freetuple(rows[k]);
-					rows[k] = ExecCopySlotHeapTuple(slot);
-				}
-
-				rowstoskip -= 1;
-			}
-
-			samplerows += 1;
-		}
-
-		pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_DONE,
-									 ++blksdone);
-	}
-
-	ExecDropSingleTupleTableSlot(slot);
 	table_endscan(scan);
 
 	/*
@@ -1133,36 +1034,6 @@ acquire_sample_rows(Relation onerel, int elevel,
 	if (numrows == targrows)
 		qsort((void *) rows, numrows, sizeof(HeapTuple), compare_rows);
 
-	/*
-	 * Estimate total numbers of live and dead rows in relation, extrapolating
-	 * on the assumption that the average tuple density in pages we didn't
-	 * scan is the same as in the pages we did scan.  Since what we scanned is
-	 * a random sample of the pages in the relation, this should be a good
-	 * assumption.
-	 */
-	if (bs.m > 0)
-	{
-		*totalrows = floor((liverows / bs.m) * totalblocks + 0.5);
-		*totaldeadrows = floor((deadrows / bs.m) * totalblocks + 0.5);
-	}
-	else
-	{
-		*totalrows = 0.0;
-		*totaldeadrows = 0.0;
-	}
-
-	/*
-	 * Emit some interesting relation info
-	 */
-	ereport(elevel,
-			(errmsg("\"%s\": scanned %d of %u pages, "
-					"containing %.0f live rows and %.0f dead rows; "
-					"%d rows in sample, %.0f estimated total rows",
-					RelationGetRelationName(onerel),
-					bs.m, totalblocks,
-					liverows, deadrows,
-					numrows, *totalrows)));
-
 	return numrows;
 }
 
@@ -1201,6 +1072,7 @@ compare_rows(const void *a, const void *b)
  */
 static int
 acquire_inherited_sample_rows(Relation onerel, int elevel,
+							  int natts, VacAttrStats **stats,
 							  HeapTuple *rows, int targrows,
 							  double *totalrows, double *totaldeadrows)
 {
@@ -1374,6 +1246,7 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
 
 				/* Fetch a random sample of the child's rows */
 				childrows = (*acquirefunc) (childrel, elevel,
+											natts, stats,
 											rows + numrows, childtargrows,
 											&trows, &tdrows);
 
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 387eb34a61..f9359e27ed 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -20,6 +20,7 @@
 #include "access/relscan.h"
 #include "access/sdir.h"
 #include "access/xact.h"
+#include "commands/vacuum.h"
 #include "utils/guc.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
@@ -502,39 +503,18 @@ typedef struct TableAmRoutine
 									BufferAccessStrategy bstrategy);
 
 	/*
-	 * Prepare to analyze block `blockno` of `scan`. The scan has been started
-	 * with table_beginscan_analyze().  See also
-	 * table_scan_analyze_next_block().
-	 *
-	 * The callback may acquire resources like locks that are held until
-	 * table_scan_analyze_next_tuple() returns false. It e.g. can make sense
-	 * to hold a lock until all tuples on a block have been analyzed by
-	 * scan_analyze_next_tuple.
-	 *
-	 * The callback can return false if the block is not suitable for
-	 * sampling, e.g. because it's a metapage that could never contain tuples.
-	 *
-	 * XXX: This obviously is primarily suited for block-based AMs. It's not
-	 * clear what a good interface for non block based AMs would be, so there
-	 * isn't one yet.
+	 * This callback needs to fill reservour with sample rows during analyze
+	 * scan.
 	 */
-	bool		(*scan_analyze_next_block) (TableScanDesc scan,
-											BlockNumber blockno,
-											BufferAccessStrategy bstrategy);
-
-	/*
-	 * See table_scan_analyze_next_tuple().
-	 *
-	 * Not every AM might have a meaningful concept of dead rows, in which
-	 * case it's OK to not increment *deadrows - but note that that may
-	 * influence autovacuum scheduling (see comment for relation_vacuum
-	 * callback).
-	 */
-	bool		(*scan_analyze_next_tuple) (TableScanDesc scan,
-											TransactionId OldestXmin,
-											double *liverows,
-											double *deadrows,
-											TupleTableSlot *slot);
+	int			(*acquire_sample_rows) (TableScanDesc scan,
+										int elevel,
+										int natts,
+										VacAttrStats **stats,
+										BufferAccessStrategy bstrategy,
+										HeapTuple *rows,
+										int targrows,
+										double *totalrows,
+										double *totaldeadrows);
 
 	/* see table_index_build_range_scan for reference about parameters */
 	double		(*index_build_range_scan) (Relation table_rel,
@@ -1485,40 +1465,18 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
 	rel->rd_tableam->relation_vacuum(rel, params, bstrategy);
 }
 
-/*
- * Prepare to analyze block `blockno` of `scan`. The scan needs to have been
- * started with table_beginscan_analyze().  Note that this routine might
- * acquire resources like locks that are held until
- * table_scan_analyze_next_tuple() returns false.
- *
- * Returns false if block is unsuitable for sampling, true otherwise.
- */
-static inline bool
-table_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
-							  BufferAccessStrategy bstrategy)
+static inline int
+table_acquire_sample_rows(TableScanDesc scan, int elevel,
+						  int natts, VacAttrStats **stats,
+						  BufferAccessStrategy bstrategy,
+						  HeapTuple *rows, int targrows,
+						  double *totalrows, double *totaldeadrows)
 {
-	return scan->rs_rd->rd_tableam->scan_analyze_next_block(scan, blockno,
-															bstrategy);
-}
-
-/*
- * Iterate over tuples in the block selected with
- * table_scan_analyze_next_block() (which needs to have returned true, and
- * this routine may not have returned false for the same block before). If a
- * tuple that's suitable for sampling is found, true is returned and a tuple
- * is stored in `slot`.
- *
- * *liverows and *deadrows are incremented according to the encountered
- * tuples.
- */
-static inline bool
-table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
-							  double *liverows, double *deadrows,
-							  TupleTableSlot *slot)
-{
-	return scan->rs_rd->rd_tableam->scan_analyze_next_tuple(scan, OldestXmin,
-															liverows, deadrows,
-															slot);
+	return scan->rs_rd->rd_tableam->acquire_sample_rows(scan, elevel,
+														natts, stats,
+														bstrategy, rows,
+														targrows, totalrows,
+														totaldeadrows);
 }
 
 /*
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 36b073e677..397c8612d4 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -36,13 +36,11 @@
 
 /* Progress parameters for analyze */
 #define PROGRESS_ANALYZE_PHASE						0
-#define PROGRESS_ANALYZE_BLOCKS_TOTAL				1
-#define PROGRESS_ANALYZE_BLOCKS_DONE				2
-#define PROGRESS_ANALYZE_EXT_STATS_TOTAL			3
-#define PROGRESS_ANALYZE_EXT_STATS_COMPUTED			4
-#define PROGRESS_ANALYZE_CHILD_TABLES_TOTAL			5
-#define PROGRESS_ANALYZE_CHILD_TABLES_DONE			6
-#define PROGRESS_ANALYZE_CURRENT_CHILD_TABLE_RELID	7
+#define PROGRESS_ANALYZE_EXT_STATS_TOTAL			1
+#define PROGRESS_ANALYZE_EXT_STATS_COMPUTED			2
+#define PROGRESS_ANALYZE_CHILD_TABLES_TOTAL			3
+#define PROGRESS_ANALYZE_CHILD_TABLES_DONE			4
+#define PROGRESS_ANALYZE_CURRENT_CHILD_TABLE_RELID	5
 
 /* Phases of analyze (as advertised via PROGRESS_ANALYZE_PHASE) */
 #define PROGRESS_ANALYZE_PHASE_ACQUIRE_SAMPLE_ROWS		1
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 95556dfb15..f056d02b4a 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -13,6 +13,7 @@
 #define FDWAPI_H
 
 #include "access/parallel.h"
+#include "commands/vacuum.h"
 #include "nodes/execnodes.h"
 #include "nodes/pathnodes.h"
 
@@ -140,6 +141,7 @@ typedef void (*ExplainDirectModify_function) (ForeignScanState *node,
 											  struct ExplainState *es);
 
 typedef int (*AcquireSampleRowsFunc) (Relation relation, int elevel,
+									  int natts, VacAttrStats **stats,
 									  HeapTuple *rows, int targrows,
 									  double *totalrows,
 									  double *totaldeadrows);

Andrey Borodin

x4mmm@yandex-team.ru

about 5 years ago

In reply to: Смирнов Денис (#1)

Re: PoC Refactor AM analyse API

Hi Denis!

7 дек. 2020 г., в 18:23, Смирнов Денис <sd@arenadata.io> написал(а):

I suggest a refactoring of analyze AM API as it is too much heap specific at the moment. The problem was inspired by Greenplum’s analyze improvement for append-optimized row and column AM with variable size compressed blocks.
Currently we do analyze in two steps.

1. Sample fix size blocks with algorithm S from Knuth (BlockSampler function)
2. Collect tuples into reservoir with algorithm Z from Vitter.

So this doesn’t work for AMs using variable sized physical blocks for example. They need weight random sampling (WRS) algorithms like A-Chao or logical blocks to follow S-Knuth (and have a problem with RelationGetNumberOfBlocks() estimating a physical number of blocks). Another problem with columns - they are not passed to analyze begin scan and can’t benefit from column storage at ANALYZE TABLE (COL).

The suggestion is to replace table_scan_analyze_next_block() and table_scan_analyze_next_tuple() with a single function: table_acquire_sample_rows(). The AM implementation of table_acquire_sample_rows() can use the BlockSampler functions if it wants to, but if the AM is not block-oriented, it could do something else. This suggestion also passes VacAttrStats to table_acquire_sample_rows() for column-oriented AMs and removes PROGRESS_ANALYZE_BLOCKS_TOTAL and PROGRESS_ANALYZE_BLOCKS_DONE definitions as not all AMs can be block-oriented.

Just few random notes about the idea.
Heap pages are not of a fixed size, when measured in tuple count. And comment in the codes describes it.
* Although every row has an equal chance of ending up in the final
* sample, this sampling method is not perfect: not every possible
* sample has an equal chance of being selected. For large relations
* the number of different blocks represented by the sample tends to be
* too small. We can live with that for now. Improvements are welcome.

Current implementation provide framework with shared code. Though this framework is only suitable for block-of-tuples AMs. And have statistical downsides when count of tuples varies too much.
Maybe can we just provide a richer API? To have both: avoid copying code and provide flexibility.

Best regards, Andrey Borodin.

Denis Smirnov

sd@arenadata.io

about 5 years ago

In reply to: Andrey Borodin (#3)

Re: PoC Refactor AM analyse API

Andrey, thanks for your feedback!

I agree that AMs with fix sized blocks can have much alike code in acquire_sample_rows() (though it is not a rule). But there are several points about current master sampling.

* It is not perfect - AM developers may want to improve it with other sampling algorithms.
* It is designed with a big influence of heap AM - for example, RelationGetNumberOfBlocks() returns uint32 while other AMs can have a bigger amount of blocks.
* heapam_acquire_sample_rows() is a small function - I don't think it is not a big trouble to write something alike for any AM developer.
* Some AMs may have a single level sampling (only algorithm Z from Vitter for example) - why not?

As a result we get a single and clear method to acquire rows for statistics. If we don’t modify but rather extend current API ( for example in a manner it is done for FDW) the code becomes more complicated and difficult to understand.

Show quoted text

8 дек. 2020 г., в 18:42, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):

Hi Denis!

7 дек. 2020 г., в 18:23, Смирнов Денис <sd@arenadata.io> написал(а):

I suggest a refactoring of analyze AM API as it is too much heap specific at the moment. The problem was inspired by Greenplum’s analyze improvement for append-optimized row and column AM with variable size compressed blocks.
Currently we do analyze in two steps.

1. Sample fix size blocks with algorithm S from Knuth (BlockSampler function)
2. Collect tuples into reservoir with algorithm Z from Vitter.

So this doesn’t work for AMs using variable sized physical blocks for example. They need weight random sampling (WRS) algorithms like A-Chao or logical blocks to follow S-Knuth (and have a problem with RelationGetNumberOfBlocks() estimating a physical number of blocks). Another problem with columns - they are not passed to analyze begin scan and can’t benefit from column storage at ANALYZE TABLE (COL).

The suggestion is to replace table_scan_analyze_next_block() and table_scan_analyze_next_tuple() with a single function: table_acquire_sample_rows(). The AM implementation of table_acquire_sample_rows() can use the BlockSampler functions if it wants to, but if the AM is not block-oriented, it could do something else. This suggestion also passes VacAttrStats to table_acquire_sample_rows() for column-oriented AMs and removes PROGRESS_ANALYZE_BLOCKS_TOTAL and PROGRESS_ANALYZE_BLOCKS_DONE definitions as not all AMs can be block-oriented.

Just few random notes about the idea.
Heap pages are not of a fixed size, when measured in tuple count. And comment in the codes describes it.
* Although every row has an equal chance of ending up in the final
* sample, this sampling method is not perfect: not every possible
* sample has an equal chance of being selected. For large relations
* the number of different blocks represented by the sample tends to be
* too small. We can live with that for now. Improvements are welcome.

Current implementation provide framework with shared code. Though this framework is only suitable for block-of-tuples AMs. And have statistical downsides when count of tuples varies too much.
Maybe can we just provide a richer API? To have both: avoid copying code and provide flexibility.

Best regards, Andrey Borodin.

Andrey Borodin

x4mmm@yandex-team.ru

about 5 years ago

In reply to: Denis Smirnov (#4)

Re: PoC Refactor AM analyse API

8 дек. 2020 г., в 16:44, Denis Smirnov <sd@arenadata.io> написал(а):

Andrey, thanks for your feedback!

I agree that AMs with fix sized blocks can have much alike code in acquire_sample_rows() (though it is not a rule). But there are several points about current master sampling.

* It is not perfect - AM developers may want to improve it with other sampling algorithms.
* It is designed with a big influence of heap AM - for example, RelationGetNumberOfBlocks() returns uint32 while other AMs can have a bigger amount of blocks.
* heapam_acquire_sample_rows() is a small function - I don't think it is not a big trouble to write something alike for any AM developer.
* Some AMs may have a single level sampling (only algorithm Z from Vitter for example) - why not?

As a result we get a single and clear method to acquire rows for statistics. If we don’t modify but rather extend current API ( for example in a manner it is done for FDW) the code becomes more complicated and difficult to understand.

This makes sense. Purpose of the API is to provide flexible abstraction. Current table_scan_analyze_next_block()/table_scan_analyze_next_tuple() API assumes too much about AM implementation.
But why do you pass int natts and VacAttrStats **stats to acquire_sample_rows()? Is it of any use? It seems to break abstraction too.

Best regards, Andrey Borodin.

Denis Smirnov

sd@arenadata.io

about 5 years ago

In reply to: Andrey Borodin (#5)

Re: PoC Refactor AM analyse API

But why do you pass int natts and VacAttrStats **stats to acquire_sample_rows()? Is it of any use? It seems to break abstraction too.

Yes, it is really a kluge that should be discussed. The main problem is that we don’t pass projection information to analyze scan (analyze begin scan relise only on relation information during initialization). And as a result we can’t benefit from column AMs during «analyze t(col)» and consume data only from target columns. This parameters were added to solve this problem.

Best regards,
Denis Smirnov | Developer
sd@arenadata.io
Arenadata | Godovikova 9-17, Moscow 129085 Russia

Heikki Linnakangas

hlinnaka@iki.fi

almost 5 years ago

In reply to: Denis Smirnov (#6)

Re: PoC Refactor AM analyse API

On 30/12/2020 11:12, Denis Smirnov wrote:

But why do you pass int natts and VacAttrStats **stats to
acquire_sample_rows()? Is it of any use? It seems to break
abstraction too.

Yes, it is really a kluge that should be discussed. The main problem
is that we don’t pass projection information to analyze scan (analyze
begin scan relise only on relation information during
initialization). And as a result we can’t benefit from column AMs
during «analyze t(col)» and consume data only from target columns.
This parameters were added to solve this problem.

The documentation needs to be updated accordingly, see
AcquireSampleRowsFunc in fdwhandler.sgml.

This part of the patch, adding the list of columns being analyzed, seems
a bit unfinished. I'd suggest to leave that out for now, and add it as
part of the "Table AM modifications to accept column projection lists"
patch that's also in this commitfest [1]https://commitfest.postgresql.org/31/2922/

This suggestion also ... removes PROGRESS_ANALYZE_BLOCKS_TOTAL and
PROGRESS_ANALYZE_BLOCKS_DONE definitions as not all AMs can be
block-oriented.

We shouldn't just remove it, a progress indicator is nice. It's true
that not all AMs are block-oriented, but those that are can still use
those. Perhaps we can add ther PROGRESS_ANALYZE_* states for
non-block-oriented AMs, but that can wait until there is a concrete use
for it.

static int
acquire_sample_rows(Relation onerel, int elevel,
HeapTuple *rows, int targrows,
double *totalrows, double *totaldeadrows)
{
int numrows = 0; /* # rows now in reservoir */
TableScanDesc scan;

Assert(targrows > 0);

scan = table_beginscan_analyze(onerel);

numrows = table_acquire_sample_rows(scan, elevel,
natts, stats,
vac_strategy, rows,
targrows, totalrows,
totaldeadrows);

table_endscan(scan);

/*
* If we didn't find as many tuples as we wanted then we're done. No sort
* is needed, since they're already in order.
*
* Otherwise we need to sort the collected tuples by position
* (itempointer). It's not worth worrying about corner cases where the
* tuples are already sorted.
*/
if (numrows == targrows)
qsort((void *) rows, numrows, sizeof(HeapTuple), compare_rows);

return numrows;
}

Perhaps better to move the qsort() into heapam_acquire_sample_rows(),
and document that the acquire_sample_rows() AM function must return the
tuples in 'ctid' order. In a generic API, it seems like a shady
assumption that they must be in order if we didn't find as many rows as
we wanted. Or always call qsort(); if the tuples are already in order,
that should be pretty quick.

The table_beginscan_analyze() call seems a bit pointless now. Let's
remove it, and pass the Relation to table_acquire_sample_rows directly.

/*
* This callback needs to fill reservour with sample rows during analyze
* scan.
*/
int (*acquire_sample_rows) (TableScanDesc scan,

The "reservoir" is related to the block sampler, but that's just an
implementation detail of the function. Perhaps something like "Acquire a
sample of rows from the table, for ANALYZE". And explain the arguments
here, or in table_acquire_sample_rows().

Overall, I like where this patch is going.

[1]: https://commitfest.postgresql.org/31/2922/

- Heikki

Denis Smirnov

sd@arenadata.io

almost 5 years ago

In reply to: Heikki Linnakangas (#7)

1 attachment(s)

Re: PoC Refactor AM analyse API

Thanks for your review, Heikki.

I have made the changes you have requested.

1. All modifications interconnected with column projection were reverted (they should be added in https://commitfest.postgresql.org/31/2922 if the current patch would be merged one day).
2. I have returned PROGRESS_ANALYZE_* states.
3. qsort() was moved into heapam_acquire_sample_rows(). Also a comment was added, that the acquire_sample_rows() AM function must return the tuples in a physical table order.
4. table_beginscan_analyze() was removed as a redundant function.
5. acquire_sample_rows() comment about reservoir was changed.

Attachments:

v2-0001-Refactor-AM-analyze-acqiure-rows-method.patch.txttext/plain; name=v2-0001-Refactor-AM-analyze-acqiure-rows-method.patch.txt; x-unix-mode=0644Download

From a5104b15683524a375aab1beb9615f690de31882 Mon Sep 17 00:00:00 2001
From: Denis Smirnov <sd@arenadata.io>
Date: Sat, 5 Dec 2020 23:25:29 +1000
Subject: [PATCH v2] Refactor AM analyze (acqiure rows method)

Analyze AM functions were implemented for fix sized block AM (heap
influence). Other AMs with variable size blocks are not be able to
use current two stage block sampling (Knuth' algorithm S and Vitter's
algorithm Z). This refactoring gives more freedom to AM developers to
implement analyze for non-fix sized AMs.

Discussion: https://www.postgresql.org/message-id/flat/C7CFE16B-F192-4124-BEBB-7864285E0FF7%40arenadata.io
---
 src/backend/access/heap/heapam_handler.c | 199 ++++++++++++++++++++++-
 src/backend/access/table/tableamapi.c    |   3 +-
 src/backend/commands/analyze.c           | 189 +--------------------
 src/include/access/tableam.h             |  99 +++--------
 4 files changed, 227 insertions(+), 263 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 4a70e20a14..6bf0c887be 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -33,9 +33,11 @@
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
 #include "commands/progress.h"
+#include "commands/vacuum.h"
 #include "executor/executor.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "port.h"
 #include "storage/bufmgr.h"
 #include "storage/bufpage.h"
 #include "storage/lmgr.h"
@@ -44,6 +46,7 @@
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
+#include "utils/sampling.h"
 
 static void reform_and_rewrite_tuple(HeapTuple tuple,
 									 Relation OldHeap, Relation NewHeap,
@@ -53,6 +56,8 @@ static bool SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
 								   HeapTuple tuple,
 								   OffsetNumber tupoffset);
 
+static int	compare_rows(const void *a, const void *b);
+
 static BlockNumber heapam_scan_get_blocks_done(HeapScanDesc hscan);
 
 static const TableAmRoutine heapam_methods;
@@ -1154,6 +1159,173 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
 	return false;
 }
 
+/*
+ * As of May 2004 we use a new two-stage method:  Stage one selects up
+ * to targrows random blocks (or all blocks, if there aren't so many).
+ * Stage two scans these blocks and uses the Vitter algorithm to create
+ * a random sample of targrows rows (or less, if there are less in the
+ * sample of blocks).  The two stages are executed simultaneously: each
+ * block is processed as soon as stage one returns its number and while
+ * the rows are read stage two controls which ones are to be inserted
+ * into the sample.
+ *
+ * Although every row has an equal chance of ending up in the final
+ * sample, this sampling method is not perfect: not every possible
+ * sample has an equal chance of being selected.  For large relations
+ * the number of different blocks represented by the sample tends to be
+ * too small.  We can live with that for now.  Improvements are welcome.
+ *
+ * An important property of this sampling method is that because we do
+ * look at a statistically unbiased set of blocks, we should get
+ * unbiased estimates of the average numbers of live and dead rows per
+ * block.  The previous sampling method put too much credence in the row
+ * density near the start of the table.
+ */
+static int
+heapam_acquire_sample_rows(Relation rel, int elevel,
+						   BufferAccessStrategy bstrategy,
+						   HeapTuple *rows, int targrows,
+						   double *totalrows, double *totaldeadrows)
+{
+	int			numrows = 0;	/* # rows now in reservoir */
+	double		samplerows = 0; /* total # rows collected */
+	double		liverows = 0;	/* # live rows seen */
+	double		deadrows = 0;	/* # dead rows seen */
+	double		rowstoskip = -1;	/* -1 means not set yet */
+	BlockNumber totalblocks;
+	TransactionId OldestXmin;
+	BlockSamplerData bs;
+	ReservoirStateData rstate;
+	TupleTableSlot *slot;
+	TableScanDesc scan;
+	BlockNumber nblocks;
+	BlockNumber blksdone = 0;
+
+	scan = heap_beginscan(rel, NULL, 0, NULL, NULL, SO_TYPE_ANALYZE);
+
+	totalblocks = RelationGetNumberOfBlocks(scan->rs_rd);
+
+	/* Need a cutoff xmin for HeapTupleSatisfiesVacuum */
+	OldestXmin = GetOldestNonRemovableTransactionId(scan->rs_rd);
+
+	/* Prepare for sampling block numbers */
+	nblocks = BlockSampler_Init(&bs, totalblocks, targrows, random());
+
+	/* Report sampling block numbers */
+	pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_TOTAL,
+								 nblocks);
+
+	/* Prepare for sampling rows */
+	reservoir_init_selection_state(&rstate, targrows);
+
+	slot = table_slot_create(scan->rs_rd, NULL);
+
+	/* Outer loop over blocks to sample */
+	while (BlockSampler_HasMore(&bs))
+	{
+		BlockNumber targblock = BlockSampler_Next(&bs);
+
+		vacuum_delay_point();
+
+		if (!heapam_scan_analyze_next_block(scan, targblock, bstrategy))
+			continue;
+
+		while (heapam_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
+		{
+			/*
+			 * The first targrows sample rows are simply copied into the
+			 * reservoir. Then we start replacing tuples in the sample until
+			 * we reach the end of the relation.  This algorithm is from Jeff
+			 * Vitter's paper (see full citation in utils/misc/sampling.c). It
+			 * works by repeatedly computing the number of tuples to skip
+			 * before selecting a tuple, which replaces a randomly chosen
+			 * element of the reservoir (current set of tuples).  At all times
+			 * the reservoir is a true random sample of the tuples we've
+			 * passed over so far, so when we fall off the end of the relation
+			 * we're done.
+			 */
+			if (numrows < targrows)
+				rows[numrows++] = ExecCopySlotHeapTuple(slot);
+			else
+			{
+				/*
+				 * t in Vitter's paper is the number of records already
+				 * processed.  If we need to compute a new S value, we must
+				 * use the not-yet-incremented value of samplerows as t.
+				 */
+				if (rowstoskip < 0)
+					rowstoskip = reservoir_get_next_S(&rstate, samplerows, targrows);
+
+				if (rowstoskip <= 0)
+				{
+					/*
+					 * Found a suitable tuple, so save it, replacing one old
+					 * tuple at random
+					 */
+					int			k = (int) (targrows * sampler_random_fract(rstate.randstate));
+
+					Assert(k >= 0 && k < targrows);
+					heap_freetuple(rows[k]);
+					rows[k] = ExecCopySlotHeapTuple(slot);
+				}
+
+				rowstoskip -= 1;
+			}
+
+			samplerows += 1;
+		}
+
+		pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_DONE,
+									 ++blksdone);
+	}
+
+	ExecDropSingleTupleTableSlot(slot);
+	heap_endscan(scan);
+
+	/*
+	 * If we didn't find as many tuples as we wanted then we're done. No sort
+	 * is needed, since they're already in order.
+	 *
+	 * Otherwise we need to sort the collected tuples by position
+	 * (itempointer). It's not worth worrying about corner cases where the
+	 * tuples are already sorted.
+	 */
+	if (numrows == targrows)
+		qsort((void *) rows, numrows, sizeof(HeapTuple), compare_rows);
+
+	/*
+	 * Estimate total numbers of live and dead rows in relation, extrapolating
+	 * on the assumption that the average tuple density in pages we didn't
+	 * scan is the same as in the pages we did scan.  Since what we scanned is
+	 * a random sample of the pages in the relation, this should be a good
+	 * assumption.
+	 */
+	if (bs.m > 0)
+	{
+		*totalrows = floor((liverows / bs.m) * totalblocks + 0.5);
+		*totaldeadrows = floor((deadrows / bs.m) * totalblocks + 0.5);
+	}
+	else
+	{
+		*totalrows = 0.0;
+		*totaldeadrows = 0.0;
+	}
+
+	/*
+	 * Emit some interesting relation info
+	 */
+	ereport(elevel,
+			(errmsg("\"%s\": scanned %d of %u pages, "
+					"containing %.0f live rows and %.0f dead rows; "
+					"%d rows in sample, %.0f estimated total rows",
+					RelationGetRelationName(scan->rs_rd),
+					bs.m, totalblocks,
+					liverows, deadrows,
+					numrows, *totalrows)));
+
+	return numrows;
+}
+
 static double
 heapam_index_build_range_scan(Relation heapRelation,
 							  Relation indexRelation,
@@ -2526,6 +2698,30 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
 	}
 }
 
+/*
+ * qsort comparator for sorting rows[] array
+ */
+static int
+compare_rows(const void *a, const void *b)
+{
+	HeapTuple	ha = *(const HeapTuple *) a;
+	HeapTuple	hb = *(const HeapTuple *) b;
+	BlockNumber ba = ItemPointerGetBlockNumber(&ha->t_self);
+	OffsetNumber oa = ItemPointerGetOffsetNumber(&ha->t_self);
+	BlockNumber bb = ItemPointerGetBlockNumber(&hb->t_self);
+	OffsetNumber ob = ItemPointerGetOffsetNumber(&hb->t_self);
+
+	if (ba < bb)
+		return -1;
+	if (ba > bb)
+		return 1;
+	if (oa < ob)
+		return -1;
+	if (oa > ob)
+		return 1;
+	return 0;
+}
+
 
 /* ------------------------------------------------------------------------
  * Definition of the heap table access method.
@@ -2570,8 +2766,7 @@ static const TableAmRoutine heapam_methods = {
 	.relation_copy_data = heapam_relation_copy_data,
 	.relation_copy_for_cluster = heapam_relation_copy_for_cluster,
 	.relation_vacuum = heap_vacuum_rel,
-	.scan_analyze_next_block = heapam_scan_analyze_next_block,
-	.scan_analyze_next_tuple = heapam_scan_analyze_next_tuple,
+	.acquire_sample_rows = heapam_acquire_sample_rows,
 	.index_build_range_scan = heapam_index_build_range_scan,
 	.index_validate_scan = heapam_index_validate_scan,
 
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index 325ecdc122..91f3b04dc4 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -87,8 +87,7 @@ GetTableAmRoutine(Oid amhandler)
 	Assert(routine->relation_copy_data != NULL);
 	Assert(routine->relation_copy_for_cluster != NULL);
 	Assert(routine->relation_vacuum != NULL);
-	Assert(routine->scan_analyze_next_block != NULL);
-	Assert(routine->scan_analyze_next_tuple != NULL);
+	Assert(routine->acquire_sample_rows != NULL);
 	Assert(routine->index_build_range_scan != NULL);
 	Assert(routine->index_validate_scan != NULL);
 
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 7295cf0215..82b40e22dc 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -37,7 +37,6 @@
 #include "commands/dbcommands.h"
 #include "commands/progress.h"
 #include "commands/tablecmds.h"
-#include "commands/vacuum.h"
 #include "executor/executor.h"
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
@@ -61,7 +60,6 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/pg_rusage.h"
-#include "utils/sampling.h"
 #include "utils/sortsupport.h"
 #include "utils/syscache.h"
 #include "utils/timestamp.h"
@@ -98,7 +96,6 @@ static VacAttrStats *examine_attribute(Relation onerel, int attnum,
 static int	acquire_sample_rows(Relation onerel, int elevel,
 								HeapTuple *rows, int targrows,
 								double *totalrows, double *totaldeadrows);
-static int	compare_rows(const void *a, const void *b);
 static int	acquire_inherited_sample_rows(Relation onerel, int elevel,
 										  HeapTuple *rows, int targrows,
 										  double *totalrows, double *totaldeadrows);
@@ -997,29 +994,9 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
  * We also estimate the total numbers of live and dead rows in the table,
  * and return them into *totalrows and *totaldeadrows, respectively.
  *
- * The returned list of tuples is in order by physical position in the table.
- * (We will rely on this later to derive correlation estimates.)
- *
- * As of May 2004 we use a new two-stage method:  Stage one selects up
- * to targrows random blocks (or all blocks, if there aren't so many).
- * Stage two scans these blocks and uses the Vitter algorithm to create
- * a random sample of targrows rows (or less, if there are less in the
- * sample of blocks).  The two stages are executed simultaneously: each
- * block is processed as soon as stage one returns its number and while
- * the rows are read stage two controls which ones are to be inserted
- * into the sample.
- *
- * Although every row has an equal chance of ending up in the final
- * sample, this sampling method is not perfect: not every possible
- * sample has an equal chance of being selected.  For large relations
- * the number of different blocks represented by the sample tends to be
- * too small.  We can live with that for now.  Improvements are welcome.
- *
- * An important property of this sampling method is that because we do
- * look at a statistically unbiased set of blocks, we should get
- * unbiased estimates of the average numbers of live and dead rows per
- * block.  The previous sampling method put too much credence in the row
- * density near the start of the table.
+ * AM acquire sample function MUST return tuples in the order by physical
+ * position in the table. (We will rely on this later to derive correlation
+ * estimates.)
  */
 static int
 acquire_sample_rows(Relation onerel, int elevel,
@@ -1027,169 +1004,17 @@ acquire_sample_rows(Relation onerel, int elevel,
 					double *totalrows, double *totaldeadrows)
 {
 	int			numrows = 0;	/* # rows now in reservoir */
-	double		samplerows = 0; /* total # rows collected */
-	double		liverows = 0;	/* # live rows seen */
-	double		deadrows = 0;	/* # dead rows seen */
-	double		rowstoskip = -1;	/* -1 means not set yet */
-	BlockNumber totalblocks;
-	TransactionId OldestXmin;
-	BlockSamplerData bs;
-	ReservoirStateData rstate;
-	TupleTableSlot *slot;
-	TableScanDesc scan;
-	BlockNumber nblocks;
-	BlockNumber blksdone = 0;
 
 	Assert(targrows > 0);
 
-	totalblocks = RelationGetNumberOfBlocks(onerel);
-
-	/* Need a cutoff xmin for HeapTupleSatisfiesVacuum */
-	OldestXmin = GetOldestNonRemovableTransactionId(onerel);
-
-	/* Prepare for sampling block numbers */
-	nblocks = BlockSampler_Init(&bs, totalblocks, targrows, random());
-
-	/* Report sampling block numbers */
-	pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_TOTAL,
-								 nblocks);
-
-	/* Prepare for sampling rows */
-	reservoir_init_selection_state(&rstate, targrows);
-
-	scan = table_beginscan_analyze(onerel);
-	slot = table_slot_create(onerel, NULL);
-
-	/* Outer loop over blocks to sample */
-	while (BlockSampler_HasMore(&bs))
-	{
-		BlockNumber targblock = BlockSampler_Next(&bs);
-
-		vacuum_delay_point();
-
-		if (!table_scan_analyze_next_block(scan, targblock, vac_strategy))
-			continue;
-
-		while (table_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
-		{
-			/*
-			 * The first targrows sample rows are simply copied into the
-			 * reservoir. Then we start replacing tuples in the sample until
-			 * we reach the end of the relation.  This algorithm is from Jeff
-			 * Vitter's paper (see full citation in utils/misc/sampling.c). It
-			 * works by repeatedly computing the number of tuples to skip
-			 * before selecting a tuple, which replaces a randomly chosen
-			 * element of the reservoir (current set of tuples).  At all times
-			 * the reservoir is a true random sample of the tuples we've
-			 * passed over so far, so when we fall off the end of the relation
-			 * we're done.
-			 */
-			if (numrows < targrows)
-				rows[numrows++] = ExecCopySlotHeapTuple(slot);
-			else
-			{
-				/*
-				 * t in Vitter's paper is the number of records already
-				 * processed.  If we need to compute a new S value, we must
-				 * use the not-yet-incremented value of samplerows as t.
-				 */
-				if (rowstoskip < 0)
-					rowstoskip = reservoir_get_next_S(&rstate, samplerows, targrows);
-
-				if (rowstoskip <= 0)
-				{
-					/*
-					 * Found a suitable tuple, so save it, replacing one old
-					 * tuple at random
-					 */
-					int			k = (int) (targrows * sampler_random_fract(rstate.randstate));
-
-					Assert(k >= 0 && k < targrows);
-					heap_freetuple(rows[k]);
-					rows[k] = ExecCopySlotHeapTuple(slot);
-				}
-
-				rowstoskip -= 1;
-			}
-
-			samplerows += 1;
-		}
-
-		pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_DONE,
-									 ++blksdone);
-	}
-
-	ExecDropSingleTupleTableSlot(slot);
-	table_endscan(scan);
-
-	/*
-	 * If we didn't find as many tuples as we wanted then we're done. No sort
-	 * is needed, since they're already in order.
-	 *
-	 * Otherwise we need to sort the collected tuples by position
-	 * (itempointer). It's not worth worrying about corner cases where the
-	 * tuples are already sorted.
-	 */
-	if (numrows == targrows)
-		qsort((void *) rows, numrows, sizeof(HeapTuple), compare_rows);
-
-	/*
-	 * Estimate total numbers of live and dead rows in relation, extrapolating
-	 * on the assumption that the average tuple density in pages we didn't
-	 * scan is the same as in the pages we did scan.  Since what we scanned is
-	 * a random sample of the pages in the relation, this should be a good
-	 * assumption.
-	 */
-	if (bs.m > 0)
-	{
-		*totalrows = floor((liverows / bs.m) * totalblocks + 0.5);
-		*totaldeadrows = floor((deadrows / bs.m) * totalblocks + 0.5);
-	}
-	else
-	{
-		*totalrows = 0.0;
-		*totaldeadrows = 0.0;
-	}
-
-	/*
-	 * Emit some interesting relation info
-	 */
-	ereport(elevel,
-			(errmsg("\"%s\": scanned %d of %u pages, "
-					"containing %.0f live rows and %.0f dead rows; "
-					"%d rows in sample, %.0f estimated total rows",
-					RelationGetRelationName(onerel),
-					bs.m, totalblocks,
-					liverows, deadrows,
-					numrows, *totalrows)));
+	numrows = table_acquire_sample_rows(onerel, elevel,
+										vac_strategy, rows,
+										targrows, totalrows,
+										totaldeadrows);
 
 	return numrows;
 }
 
-/*
- * qsort comparator for sorting rows[] array
- */
-static int
-compare_rows(const void *a, const void *b)
-{
-	HeapTuple	ha = *(const HeapTuple *) a;
-	HeapTuple	hb = *(const HeapTuple *) b;
-	BlockNumber ba = ItemPointerGetBlockNumber(&ha->t_self);
-	OffsetNumber oa = ItemPointerGetOffsetNumber(&ha->t_self);
-	BlockNumber bb = ItemPointerGetBlockNumber(&hb->t_self);
-	OffsetNumber ob = ItemPointerGetOffsetNumber(&hb->t_self);
-
-	if (ba < bb)
-		return -1;
-	if (ba > bb)
-		return 1;
-	if (oa < ob)
-		return -1;
-	if (oa > ob)
-		return 1;
-	return 0;
-}
-
 
 /*
  * acquire_inherited_sample_rows -- acquire sample rows from inheritance tree
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 33bffb6815..9d38341098 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -601,39 +601,21 @@ typedef struct TableAmRoutine
 									BufferAccessStrategy bstrategy);
 
 	/*
-	 * Prepare to analyze block `blockno` of `scan`. The scan has been started
-	 * with table_beginscan_analyze().  See also
-	 * table_scan_analyze_next_block().
+	 * This callback needs to acquire a sample of rows from the table (for
+	 * ANALYZE). Note, that returned tuples in a sample MUST be in the order
+	 * by physical position in the table, as we rely on this fact in
+	 * acquire_sample_rows().
 	 *
-	 * The callback may acquire resources like locks that are held until
-	 * table_scan_analyze_next_tuple() returns false. It e.g. can make sense
-	 * to hold a lock until all tuples on a block have been analyzed by
-	 * scan_analyze_next_tuple.
-	 *
-	 * The callback can return false if the block is not suitable for
-	 * sampling, e.g. because it's a metapage that could never contain tuples.
-	 *
-	 * XXX: This obviously is primarily suited for block-based AMs. It's not
-	 * clear what a good interface for non block based AMs would be, so there
-	 * isn't one yet.
+	 * Buffer ring access strategy is derived from analyze_rel() and should be
+	 * used by AMs, using shared buffers for ANALYZE.
 	 */
-	bool		(*scan_analyze_next_block) (TableScanDesc scan,
-											BlockNumber blockno,
-											BufferAccessStrategy bstrategy);
-
-	/*
-	 * See table_scan_analyze_next_tuple().
-	 *
-	 * Not every AM might have a meaningful concept of dead rows, in which
-	 * case it's OK to not increment *deadrows - but note that that may
-	 * influence autovacuum scheduling (see comment for relation_vacuum
-	 * callback).
-	 */
-	bool		(*scan_analyze_next_tuple) (TableScanDesc scan,
-											TransactionId OldestXmin,
-											double *liverows,
-											double *deadrows,
-											TupleTableSlot *slot);
+	int			(*acquire_sample_rows) (Relation onerel,
+										int elevel,
+										BufferAccessStrategy bstrategy,
+										HeapTuple *rows,
+										int targrows,
+										double *totalrows,
+										double *totaldeadrows);
 
 	/* see table_index_build_range_scan for reference about parameters */
 	double		(*index_build_range_scan) (Relation table_rel,
@@ -942,19 +924,6 @@ table_beginscan_tid(Relation rel, Snapshot snapshot)
 	return rel->rd_tableam->scan_begin(rel, snapshot, 0, NULL, NULL, flags);
 }
 
-/*
- * table_beginscan_analyze is an alternative entry point for setting up a
- * TableScanDesc for an ANALYZE scan.  As with bitmap scans, it's worth using
- * the same data structure although the behavior is rather different.
- */
-static inline TableScanDesc
-table_beginscan_analyze(Relation rel)
-{
-	uint32		flags = SO_TYPE_ANALYZE;
-
-	return rel->rd_tableam->scan_begin(rel, NULL, 0, NULL, NULL, flags);
-}
-
 /*
  * End relation scan.
  */
@@ -1591,40 +1560,16 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
 	rel->rd_tableam->relation_vacuum(rel, params, bstrategy);
 }
 
-/*
- * Prepare to analyze block `blockno` of `scan`. The scan needs to have been
- * started with table_beginscan_analyze().  Note that this routine might
- * acquire resources like locks that are held until
- * table_scan_analyze_next_tuple() returns false.
- *
- * Returns false if block is unsuitable for sampling, true otherwise.
- */
-static inline bool
-table_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
-							  BufferAccessStrategy bstrategy)
-{
-	return scan->rs_rd->rd_tableam->scan_analyze_next_block(scan, blockno,
-															bstrategy);
-}
-
-/*
- * Iterate over tuples in the block selected with
- * table_scan_analyze_next_block() (which needs to have returned true, and
- * this routine may not have returned false for the same block before). If a
- * tuple that's suitable for sampling is found, true is returned and a tuple
- * is stored in `slot`.
- *
- * *liverows and *deadrows are incremented according to the encountered
- * tuples.
- */
-static inline bool
-table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
-							  double *liverows, double *deadrows,
-							  TupleTableSlot *slot)
+static inline int
+table_acquire_sample_rows(Relation rel, int elevel,
+						  BufferAccessStrategy bstrategy,
+						  HeapTuple *rows, int targrows,
+						  double *totalrows, double *totaldeadrows)
 {
-	return scan->rs_rd->rd_tableam->scan_analyze_next_tuple(scan, OldestXmin,
-															liverows, deadrows,
-															slot);
+	return rel->rd_tableam->acquire_sample_rows(rel, elevel,
+													 bstrategy, rows,
+													 targrows, totalrows,
+													 totaldeadrows);
 }
 
 /*
-- 
2.24.3 (Apple Git-128)

Zhihong Yu

zyu@yugabyte.com

almost 5 years ago

In reply to: Denis Smirnov (#8)

Re: PoC Refactor AM analyse API

Hi,

+ *totalrows = floor((liverows / bs.m) * totalblocks + 0.5);

Is the above equivalent to:

+ *totalrows = ceil((liverows / bs.m) * totalblocks);

For compare_rows(), it seems the computation of oa and ob can be delayed to
when ba == bb (after the first two if statements).

Cheers

On Thu, Feb 18, 2021 at 6:06 PM Denis Smirnov <sd@arenadata.io> wrote:

Show quoted text

Thanks for your review, Heikki.

I have made the changes you have requested.

1. All modifications interconnected with column projection were reverted
(they should be added in https://commitfest.postgresql.org/31/2922 if the
current patch would be merged one day).
2. I have returned PROGRESS_ANALYZE_* states.
3. qsort() was moved into heapam_acquire_sample_rows(). Also a comment was
added, that the acquire_sample_rows() AM function must return the tuples in
a physical table order.
4. table_beginscan_analyze() was removed as a redundant function.
5. acquire_sample_rows() comment about reservoir was changed.

Best regards,
Denis Smirnov | Developer
sd@arenadata.io
Arenadata | Godovikova 9-17, Moscow 129085 Russia

#10

Denis Smirnov

sd@arenadata.io

almost 5 years ago

In reply to: Zhihong Yu (#9)

1 attachment(s)

Re: PoC Refactor AM analyse API

Hello, Zhihong.

Thanks for your comments.

1. I am afraid that an equivalence of "floor(val + 0.5)" to "cell(val)" is incorrect: "floor(2.1 + 0.5)" returns 2 while "cell(2.1)" returns 3. We can’t use such replacement, as you have suggested.

2. >> For compare_rows(), it seems the computation of oa and ob can be delayed to when ba == bb (after the first two if statements).
I have checked some examples of ASM code generated by different compilers with -O2/O3 flags on https://gcc.godbolt.org and didn’t see any big benefit in result CPU instructions. You can check yourself an attached example below.

#11

Zhihong Yu

zyu@yugabyte.com

almost 5 years ago

In reply to: Denis Smirnov (#10)

Re: PoC Refactor AM analyse API

Denis:
Thanks for considering my suggestion.

For #1, I didn't take your example into account. Thanks for pointing that
out.

Cheers

On Thu, Feb 18, 2021 at 11:59 PM Denis Smirnov <sd@arenadata.io> wrote:

Show quoted text

Hello, Zhihong.

Thanks for your comments.

1. I am afraid that an equivalence of "floor(val + 0.5)" to "cell(val)" is
incorrect: "floor(2.1 + 0.5)" returns 2 while "cell(2.1)" returns 3. We
can’t use such replacement, as you have suggested.

2. >> For compare_rows(), it seems the computation of oa and ob can be
delayed to when ba == bb (after the first two if statements).
I have checked some examples of ASM code generated by different compilers
with -O2/O3 flags on https://gcc.godbolt.org and didn’t see any big
benefit in result CPU instructions. You can check yourself an attached
example below.

Best regards,
Denis Smirnov | Developer
sd@arenadata.io
Arenadata | Godovikova 9-17, Moscow 129085 Russia

#12

Jaime Casanova

jcasanov@systemguards.com.ec

over 4 years ago

In reply to: Denis Smirnov (#8)

Re: PoC Refactor AM analyse API

On Fri, Feb 19, 2021 at 12:06:12PM +1000, Denis Smirnov wrote:

Thanks for your review, Heikki.

I have made the changes you have requested.

1. All modifications interconnected with column projection were reverted (they should be added in https://commitfest.postgresql.org/31/2922 if the current patch would be merged one day).
2. I have returned PROGRESS_ANALYZE_* states.
3. qsort() was moved into heapam_acquire_sample_rows(). Also a comment was added, that the acquire_sample_rows() AM function must return the tuples in a physical table order.
4. table_beginscan_analyze() was removed as a redundant function.
5. acquire_sample_rows() comment about reservoir was changed.

Hi Denis,

This doesn't apply anymore because of c6fc50c, can you resubmit a new
patch?

Please note that the patch must be submitted with a .patch extension
instead of .txt, that way the CI at http://commitfest.cputube.org/ will
be able to execute automatic tests on it.

Regards,

--
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL

#13

Michael Paquier

michael@paquier.xyz

over 4 years ago

In reply to: Jaime Casanova (#12)

Re: PoC Refactor AM analyse API

On Wed, Sep 08, 2021 at 10:06:25AM -0500, Jaime Casanova wrote:

This doesn't apply anymore because of c6fc50c, can you resubmit a new
patch?

Activity has stalled here, so I have marked the entry as RwF in the CF
app.
--
Michael

#14

Денис Смирнов

sd@arenadata.io

over 4 years ago

In reply to: Michael Paquier (#13)

Re: PoC Refactor AM analyse API

I think this patch should be totally redesigned and removed from the upcoming CF. The problem is that vanilla PG has a single storage manager implementation, that works with fix size blocks. Current commit didn’t take this aspect into account. We should first decide whether PG needs an ability to implement custom storage managers with variable size blocks and custom block buffers (or without any for OLAP). And only after that we should move to the variable size block analyze.

Best regards,
Denis Smirnov | Developer
sd@arenadata.io
Arenadata | Bldg. 3, Block 1 Skladochnaya St. Moscow, 127018