Extracting only the columns needed for a query

Started by Melanie Plagemanover 6 years ago19 messages

melanieplageman@gmail.com

over 6 years ago

While hacking on zedstore, we needed to get a list of the columns to
be projected--basically all of the columns needed to satisfy the
query. The two use cases we have for this is
1) to pass this column list down to the AM layer for the AM to leverage it
2) for use during planning to improving costing
In other threads, such as [1]/messages/by-id/20190409010440.bqdikgpslh42pqit@alap3.anarazel.de, there has been discussion about the
possible benefits for all table types of having access to this set of
columns.

Focusing on how to get this used cols list (as opposed to how to pass
it down to the AM), we have tried a few approaches to constructing it
and wanted to get some ideas on how best to do it.

We are trying to determine which phase to get the columns -- after
parsing, after planning, or during execution right before calling the
AM.

Approach A: right before calling AM

Leverage expression_tree_walker() right before calling beginscan()
and collecting the columns into a needed columns context. This
approach is what is currently in the zedstore patch mentioned in
this thread [2]/messages/by-id/CALfoeiuuLe12PuQ=zvM_L7B5qegBqGHYENHUGbLOsjAnG=mi4A@mail.gmail.com.

The benefit of this approach is that it walks the tree right
before the used column set will be used--which makes it easy to
skip this walk for queries or AMs that don't benefit from this
used columns list.

Approach B: after parsing and/or after planning

Add a new member 'used_cols' to PlannedStmt which contains the
attributes for each relation present in the query. Construct
'used_cols' at the end of planning using the PathTargets in the
RelOptInfos in the PlannerInfo->simple_rel_array and the
RangeTblEntries in PlannerInfo->simple_rte_array.

The nice thing about this is that it does not require a full walk
of the plan tree. Approach A could be more expensive if the tree
is quite large. I'm not sure, however, if just getting the
PathTargets from the RelOptInfos is sufficient for obtaining the
whole set of columns used in the query.

Approach B, however, does not work for utility statements which do
not go through planning.

One potential solution to this that we tried was getting the
columns from the query tree after parse analyze and then in
exec_simple_query() adding the column list to the PlannedStmt.

This turned out to be as messy or more than Approach A because
each kind of utility statement has its own data structure that is
composed of elements taken from the Query tree but does not
directly include the original PlannedStmt created for the query
(the PlannedStmt doesn't contain anything except the query tree
for utility statements since they do not go through planning). So,
for each type of utility statement, we would have to separately
copy over the column list from the PlannedStmt in its own way.

It is worth noting that Approach A also requires special handling
for each type of utility statement.

We are wondering about specific benefits of Approach B--that is, is
there some use case (maybe plan re-use) for having the column set
accessible in the PlannedStmt itself?

One potential benefit of Approach B could be for scans of partition
tables. Collecting the used column list could be done once for the
query instead of once for each partition.

Both approaches, however, do not address our second use case, as we
would not have the column list during planning for non-utility
statements. To satisfy this, we would likely have to extract the
columns from the query tree after parse analyze for non-utility
statements as well.

An approach which extracted this list before planning and saved it
somewhere would help avoid having to do the same walk during planning
and then again during execution. Though, using the list constructed
after parsing may not be ideal when some columns were able to be
eliminated during planning.

Melanie & Ashwin

[1]: /messages/by-id/20190409010440.bqdikgpslh42pqit@alap3.anarazel.de
/messages/by-id/20190409010440.bqdikgpslh42pqit@alap3.anarazel.de
[2]: /messages/by-id/CALfoeiuuLe12PuQ=zvM_L7B5qegBqGHYENHUGbLOsjAnG=mi4A@mail.gmail.com
/messages/by-id/CALfoeiuuLe12PuQ=zvM_L7B5qegBqGHYENHUGbLOsjAnG=mi4A@mail.gmail.com

David Rowley

david.rowley@2ndquadrant.com

over 6 years ago

In reply to: Melanie Plageman (#1)

Re: Extracting only the columns needed for a query

On Sat, 15 Jun 2019 at 13:46, Melanie Plageman
<melanieplageman@gmail.com> wrote:

While hacking on zedstore, we needed to get a list of the columns to
be projected--basically all of the columns needed to satisfy the
query. The two use cases we have for this is
1) to pass this column list down to the AM layer for the AM to leverage it
2) for use during planning to improving costing
In other threads, such as [1], there has been discussion about the
possible benefits for all table types of having access to this set of
columns.

Focusing on how to get this used cols list (as opposed to how to pass
it down to the AM), we have tried a few approaches to constructing it
and wanted to get some ideas on how best to do it.

We are trying to determine which phase to get the columns -- after
parsing, after planning, or during execution right before calling the
AM.

For planning, isn't this information already available via either
targetlists or from the RelOptInfo->attr_needed array combined with
min_attr and max_attr?

If it's going to be too much overhead to extract vars from the
targetlist during executor startup then maybe something can be done
during planning to set a new Bitmapset field of attrs in
RangeTblEntry. Likely that can be built easily by looking at
attr_needed in RelOptInfo. Parse is too early to set this as which
attributes are needed can change during planning. For example, look at
remove_rel_from_query() in analyzejoins.c. If you don't need access to
this during planning then maybe setrefs.c is a good place to set
something. Although, it would be nice not to add this overhead when
the executor does not need this information. I'm unsure how the
planner could know that though, other than by having another tableam
callback function to tell it.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Tom Lane

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Melanie Plageman (#1)

Re: Extracting only the columns needed for a query

Melanie Plageman <melanieplageman@gmail.com> writes:

While hacking on zedstore, we needed to get a list of the columns to
be projected--basically all of the columns needed to satisfy the
query. The two use cases we have for this is
1) to pass this column list down to the AM layer for the AM to leverage it
2) for use during planning to improving costing
In other threads, such as [1], there has been discussion about the
possible benefits for all table types of having access to this set of
columns.

The thing that most approaches to this have fallen down on is triggers ---
that is, a trigger function might access columns mentioned nowhere in the
SQL text. (See 8b6da83d1 for a recent example :-() If you have a plan
for dealing with that, then ...

Approach B: after parsing and/or after planning

If we wanted to do something about this, making the planner record
the set of used columns seems like the thing to do. We could avoid
the expense of doing it when it's not needed by setting up an AM/FDW/
etc property or callback to request it.

Another reason for having the planner do this is that presumably, in
an AM that's excited about this, the set of fetched columns should
play into the cost estimates for the scan. I've not been paying
enough attention to the tableam work to know if we've got hooks for
the AM to affect scan costing ... but if we don't, that seems like
a hole that needs plugged.

I'm not sure, however, if just getting the
PathTargets from the RelOptInfos is sufficient for obtaining the
whole set of columns used in the query.

The PathTarget only records the columns that need to be *emitted*
by the relation scan plan node. You'd also have to look at the
baserestrictinfo conditions to see what columns are inspected by
filter conditions. But that seems fine to me anyway since the AM
might conceivably have different requirements for filter variables than
emitted variables. (As a concrete example, filter conditions that are
handled by non-lossy index checks might not give rise to any requirement
for the table AM to do anything.) So that line of thought confirms that
we want to do this at the end of planning when we know the shape of the
plan tree; the parser can't do it usefully.

Approach B, however, does not work for utility statements which do
not go through planning.

I'm not sure why you're excited about that case? Utility statements
tend to be pretty much all-or-nothing as far as data access goes.

regards, tom lane

Corey Huinker

corey.huinker@gmail.com

over 6 years ago

In reply to: Tom Lane (#3)

Re: Extracting only the columns needed for a query

The thing that most approaches to this have fallen down on is triggers ---
that is, a trigger function might access columns mentioned nowhere in the
SQL text. (See 8b6da83d1 for a recent example :-() If you have a plan
for dealing with that, then ...

Well, if we had a trigger language that compiled to <something> at creation
time, and that trigger didn't do any dynamic/eval code, we could store
which attributes and rels were touched inside the trigger.

I'm not sure if that trigger language would be sql, plpgsql with a
"compile" pragma, or maybe we exhume PSM, but it could have some side
benefits:

1. This same issue haunts any attempts at refactoring triggers and
referential integrity, so narrowing the scope of what a trigger touches
will help there too
2. additional validity checks
3. (this is an even bigger stretch) possibly a chance to combine multiple
triggers into one statement, or combine mutliple row-based triggers into a
statement level trigger

Of course, this all falls apart with one dynamic SQL or one SELECT *, but
it would be incentive for the users to refactor code to not do things that
impede trigger optimization.

Ashwin Agrawal

aagrawal@pivotal.io

over 6 years ago

In reply to: Tom Lane (#3)

Re: Extracting only the columns needed for a query

On Sat, Jun 15, 2019 at 10:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Approach B: after parsing and/or after planning

If we wanted to do something about this, making the planner record
the set of used columns seems like the thing to do. We could avoid
the expense of doing it when it's not needed by setting up an AM/FDW/
etc property or callback to request it.

Sounds good. In Zedstore patch, we have added AM property to convey the AM
leverages column projection and currently skip physical tlist optimization
based
on it. So, yes can similarly be leveraged for other planning needs.

Another reason for having the planner do this is that presumably, in
an AM that's excited about this, the set of fetched columns should
play into the cost estimates for the scan. I've not been paying
enough attention to the tableam work to know if we've got hooks for
the AM to affect scan costing ... but if we don't, that seems like
a hole that needs plugged.

AM callback relation_estimate_size exists currently which planner
leverages. Via
this callback it fetches tuples, pages, etc.. So, our thought is to extend
this
API if possible to pass down needed column and help perform better costing
for
the query. Though we think if wish to leverage this function, need to know
list
of columns before planning hence might need to use query tree.

Approach B, however, does not work for utility statements which do
not go through planning.

I'm not sure why you're excited about that case? Utility statements
tend to be pretty much all-or-nothing as far as data access goes.

Statements like COPY, CREATE INDEX, CREATE CONSTRAINTS, etc.. can benefit
from
subset of columns for scan. For example in Zedstore currently for CREATE
INDEX we extract needed columns by walking indexInfo->ii_Predicate and
indexInfo->ii_Expressions. For COPY, we currently use cstate->attnumlist to
know
which columns need to be scanned.

Melanie Plageman

melanieplageman@gmail.com

over 6 years ago

In reply to: Ashwin Agrawal (#5)

3 attachment(s)

Re: Extracting only the columns needed for a query

We implemented Approach B in the attached patch set (patch 0001) and
then implemented Approach A (patch 0002) to sanity check the pruned
list of columns to scan we were getting at plan-time.
We emit a notice in SeqNext() if the two column sets differ.
Currently, for all of the queries in the regression suite, no
differences are found.
We did notice that none of the current regression tests for triggers
are showing a difference between columns that can be extracted at plan
time and those that can be extracted at execution time--though we
haven't dug into this at all.

In our implementation of Approach B, we extract the columns to scan in
make_one_rel() after set_base_rel_sizes() and before
set_base_rel_pathlists() so that the scan cols can be used in costing.
We do it after calling set_base_rel_sizes() because it eventually
calls set_append_rel_size() which adds PathTarget exprs for the
partitions with Vars having the correct varattno and varno.

We added the scanCols to RangeTblEntries because it seemed like the
easiest way to make sure the information was available at scan time
(as suggested by David).

A quirk in the patch for Approach B is that, in inheritance_planner(),
we copy over the scanCols which we populated in each subroot's rtable
to the final rtable.
The rtables for all of the subroots seem to be basically unused after
finishing with inheritance_planner(). That is, many other members of
the modified child PlannerInfos are copied over to the root
PlannerInfo, but the rtables seem to be an exception.
If we want to get at them later, we would have had to go rooting
around in the pathlist of the RelOptInfo, which seemed very complex.

One note: we did not add any logic to make the extraction of scan
columns conditional on the AM. We have code to do it conditionally in
the zedstore patch, but we made it generic here.

While we were implementing this, we found that in the columns
extracted at plan-time, there were excess columns for
UPDATE/DELETE...RETURNING on partition tables.
Vars for these columns are being added to the targetlist in
preprocess_targetlist(). Eventually targetlist items are added to
RelOptInfo->reltarget exprs.
However, when result_relation is 0, this means all of the vars from
the returningList will be added to the targetlist, which seems
incorrect. We included a patch (0003) to only add the vars if
result_relation is not 0.

Adding the scanCols in RangeTblEntry, we noticed that a few other
members of RangeTblEntry are consistently initialized to NULL whenever
a new RangeTblEntry is being made.
This is confusing because makeNode() for a RangeTblEntry does
palloc0() anyway, so, why is this necessary?
If it is necessary, why not create some kind of generic initialization
function to do this?

Thanks,
Ashwin & Melanie

Attachments:

v1-0002-Execution-time-scan-col-extraction-and-comparison.patchtext/x-patch; charset=US-ASCII; name=v1-0002-Execution-time-scan-col-extraction-and-comparison.patchDownload

From 3b8515aaea164174d18ef2450b19478de8e64298 Mon Sep 17 00:00:00 2001
From: csteam <mplageman@pivotal.io>
Date: Tue, 16 Jul 2019 16:19:49 -0700
Subject: [PATCH v1 2/3] Execution-time scan col extraction and comparison with
 plan-time cols

Extract the columns needed to scan directly before scanning from the
ScanState and compare this set of columns with those extracted at
plan-time.

In regress, there are two main queries where there is a difference in
the columns extracted at plan time vs execution time. They are both due
to the fact that UPDATE/DELETE on partition tables adds the contents of
the returningList to the PathTargets in the RelOptInfos. This manifests
as a difference in the column sets.
---
 src/backend/executor/execScan.c    | 63 ++++++++++++++++++++++++++++++
 src/backend/executor/nodeSeqscan.c | 24 ++++++++++++
 src/include/executor/executor.h    |  4 ++
 3 files changed, 91 insertions(+)

diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index c0e4a5376c..89146693a3 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -20,6 +20,7 @@
 
 #include "executor/executor.h"
 #include "miscadmin.h"
+#include "nodes/nodeFuncs.h"
 #include "utils/memutils.h"
 
 
@@ -300,3 +301,65 @@ ExecScanReScan(ScanState *node)
 		}
 	}
 }
+
+typedef struct neededColumnContext
+{
+	Bitmapset **mask;
+	int n;
+} neededColumnContext;
+
+static bool
+neededColumnContextWalker(Node *node, neededColumnContext *c)
+{
+	if (node == NULL)
+		return false;
+
+	if (IsA(node, Var))
+	{
+		Var *var = (Var *)node;
+
+		if (var->varattno >= 0)
+		{
+			Assert(var->varattno <= c->n);
+			*(c->mask) = bms_add_member(*(c->mask), var->varattno);
+		}
+
+		return false;
+	}
+	return expression_tree_walker(node, neededColumnContextWalker, (void * )c);
+}
+
+/*
+ * n specifies the number of allowed entries in mask: we use
+ * it for bounds-checking in the walker above.
+ */
+void
+PopulateNeededColumnsForNode(Node *expr, int n, Bitmapset **scanCols)
+{
+	neededColumnContext c;
+
+	c.mask = scanCols;
+	c.n = n;
+
+	neededColumnContextWalker(expr, &c);
+}
+
+Bitmapset *
+PopulateNeededColumnsForScan(ScanState *scanstate, int ncol)
+{
+	Bitmapset *result = NULL;
+	Plan	   *plan = scanstate->ps.plan;
+
+	PopulateNeededColumnsForNode((Node *) plan->targetlist, ncol, &result);
+	PopulateNeededColumnsForNode((Node *) plan->qual, ncol, &result);
+
+	if (IsA(plan, IndexScan))
+	{
+		PopulateNeededColumnsForNode((Node *) ((IndexScan *) plan)->indexqualorig, ncol, &result);
+		PopulateNeededColumnsForNode((Node *) ((IndexScan *) plan)->indexorderbyorig, ncol, &result);
+	}
+	else if (IsA(plan, BitmapHeapScan))
+		PopulateNeededColumnsForNode((Node *) ((BitmapHeapScan *) plan)->bitmapqualorig, ncol, &result);
+
+	return result;
+}
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 436b43f8ca..c4b6a35554 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -64,6 +64,30 @@ SeqNext(SeqScanState *node)
 
 	if (scandesc == NULL)
 	{
+		/*
+		 * Print used cols extracted during planning as well as
+		 * used cols extracted now from the ScanState
+		 */
+		int plan_col_num;
+		Bitmapset *execution_cols = NULL;
+		Scan *planNode = (Scan *)node->ss.ps.plan;
+		int ncols = node->ss.ss_currentRelation->rd_att->natts;
+		int rti = planNode->scanrelid;
+		RangeTblEntry *rangeTblEntry = list_nth(estate->es_plannedstmt->rtable, rti - 1);
+
+		Bitmapset *plan_cols = rangeTblEntry->scanCols;
+#ifdef USE_ASSERT_CHECKING
+		while ((plan_col_num = bms_next_member(plan_cols, ncols)) >= 0)
+			Assert(plan_col_num <= ncols);
+#endif
+		execution_cols = PopulateNeededColumnsForScan(&node->ss, ncols);
+
+		if (bms_is_empty(bms_difference(plan_cols, execution_cols)) == false)
+			elog(NOTICE, "table: %s.\n exec-time cols: %s\n plan-time cols: %s",
+				RelationGetRelationName(node->ss.ss_currentRelation),
+				bmsToString(execution_cols),
+				bmsToString(plan_cols));
+
 		/*
 		 * We reach here if the scan is not parallel, or if we're serially
 		 * executing a scan that was planned to be parallel.
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 1fb28b4596..310fa2c1bb 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -596,5 +596,9 @@ extern void CheckCmdReplicaIdentity(Relation rel, CmdType cmd);
 
 extern void CheckSubscriptionRelkind(char relkind, const char *nspname,
 									 const char *relname);
+extern void
+PopulateNeededColumnsForNode(Node *expr, int n, Bitmapset **scanCols);
+extern Bitmapset *
+PopulateNeededColumnsForScan(ScanState *scanstate, int ncol);
 
 #endif							/* EXECUTOR_H  */
-- 
2.22.0

v1-0001-Plan-time-extraction-of-scan-cols.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Plan-time-extraction-of-scan-cols.patchDownload

From 09c9bc8262df05ad178ac6b213ea215e73e3a881 Mon Sep 17 00:00:00 2001
From: csteam <mplageman@pivotal.io>
Date: Fri, 7 Jun 2019 17:26:29 -0700
Subject: [PATCH v1 1/3] Plan-time extraction of scan cols

Extract columns from query and save in the RangeTblEntry which
corresponds to the relation that will eventually be scanned.

Do this directly before costing so that this pruned list of columns
could potentially be used in costing calculations

Note that this patch does not use the scanCols.
---
 src/backend/nodes/copyfuncs.c         |  1 +
 src/backend/nodes/equalfuncs.c        |  1 +
 src/backend/nodes/outfuncs.c          |  1 +
 src/backend/nodes/readfuncs.c         |  1 +
 src/backend/optimizer/path/allpaths.c | 60 ++++++++++++++++++++++++++-
 src/backend/optimizer/plan/planner.c  | 12 ++++++
 src/backend/parser/parse_relation.c   |  9 ++++
 src/backend/rewrite/rewriteHandler.c  |  2 +
 src/include/nodes/parsenodes.h        |  7 ++++
 9 files changed, 93 insertions(+), 1 deletion(-)

diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 6414aded0e..2b97ba687a 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -2388,6 +2388,7 @@ _copyRangeTblEntry(const RangeTblEntry *from)
 	COPY_BITMAPSET_FIELD(insertedCols);
 	COPY_BITMAPSET_FIELD(updatedCols);
 	COPY_BITMAPSET_FIELD(extraUpdatedCols);
+	COPY_BITMAPSET_FIELD(scanCols);
 	COPY_NODE_FIELD(securityQuals);
 
 	return newnode;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 4f2ebe5118..a57788f60a 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -2668,6 +2668,7 @@ _equalRangeTblEntry(const RangeTblEntry *a, const RangeTblEntry *b)
 	COMPARE_BITMAPSET_FIELD(insertedCols);
 	COMPARE_BITMAPSET_FIELD(updatedCols);
 	COMPARE_BITMAPSET_FIELD(extraUpdatedCols);
+	COMPARE_BITMAPSET_FIELD(scanCols);
 	COMPARE_NODE_FIELD(securityQuals);
 
 	return true;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 8e31fae47f..e015388612 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -3097,6 +3097,7 @@ _outRangeTblEntry(StringInfo str, const RangeTblEntry *node)
 	WRITE_BITMAPSET_FIELD(insertedCols);
 	WRITE_BITMAPSET_FIELD(updatedCols);
 	WRITE_BITMAPSET_FIELD(extraUpdatedCols);
+	WRITE_BITMAPSET_FIELD(scanCols);
 	WRITE_NODE_FIELD(securityQuals);
 }
 
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 6c2626ee62..3ac9d67e7a 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1431,6 +1431,7 @@ _readRangeTblEntry(void)
 	READ_BITMAPSET_FIELD(insertedCols);
 	READ_BITMAPSET_FIELD(updatedCols);
 	READ_BITMAPSET_FIELD(extraUpdatedCols);
+	READ_BITMAPSET_FIELD(scanCols);
 	READ_NODE_FIELD(securityQuals);
 
 	READ_DONE();
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index e9ee32b7f4..4c437e2da1 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -142,7 +142,7 @@ static void recurse_push_qual(Node *setOp, Query *topquery,
 							  RangeTblEntry *rte, Index rti, Node *qual);
 static void remove_unused_subquery_outputs(Query *subquery, RelOptInfo *rel);
 
-
+static void extract_used_columns(PlannerInfo *root);
 /*
  * make_one_rel
  *	  Finds all possible access paths for executing a query, returning a
@@ -184,6 +184,8 @@ make_one_rel(PlannerInfo *root, List *joinlist)
 	 */
 	set_base_rel_sizes(root);
 
+	extract_used_columns(root);
+
 	/*
 	 * We should now have size estimates for every actual table involved in
 	 * the query, and we also know which if any have been deleted from the
@@ -234,6 +236,62 @@ make_one_rel(PlannerInfo *root, List *joinlist)
 	return rel;
 }
 
+static void
+extract_used_columns(PlannerInfo *root)
+{
+	for (int i = 1; i < root->simple_rel_array_size; i++)
+	{
+		ListCell *lc;
+		RangeTblEntry *rte = root->simple_rte_array[i];
+		RelOptInfo    *rel = root->simple_rel_array[i];
+
+		if (rte == NULL)
+			continue;
+
+		if (rel == NULL)
+			continue;
+
+		rte->scanCols = NULL;
+
+		foreach(lc, rel->reltarget->exprs)
+		{
+			Node *node;
+			List *vars;
+			ListCell *lc1;
+			node = lfirst(lc);
+			/*
+			 * TODO: suggest a default for vars_only to make maintenance less burdensome
+			 */
+			vars = pull_var_clause(node,
+								   PVC_RECURSE_AGGREGATES |
+								   PVC_RECURSE_WINDOWFUNCS |
+								   PVC_RECURSE_PLACEHOLDERS);
+			foreach(lc1, vars)
+			{
+				Var *var = lfirst(lc1);
+				if (var->varno == i && var->varattno >= 0)
+					rte->scanCols = bms_add_member(rte->scanCols, var->varattno);
+			}
+		}
+
+		foreach(lc, rel->baserestrictinfo)
+		{
+			RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+			List *vars = pull_var_clause((Node *)rinfo->clause,
+										 PVC_RECURSE_AGGREGATES |
+										 PVC_RECURSE_WINDOWFUNCS |
+										 PVC_RECURSE_PLACEHOLDERS);
+			ListCell *lc1;
+			foreach(lc1, vars)
+			{
+				Var *var = lfirst(lc1);
+				if (var->varno == i && var->varattno >= 0)
+					rte->scanCols = bms_add_member(rte->scanCols, var->varattno);
+			}
+		}
+	}
+}
+
 /*
  * set_base_rel_consider_startup
  *	  Set the consider_[param_]startup flags for each base-relation entry.
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index ca3b7f29e1..4f9e319d38 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1477,6 +1477,9 @@ inheritance_planner(PlannerInfo *root)
 		RelOptInfo *sub_final_rel;
 		Path	   *subpath;
 
+		ListCell *listCell;
+		int rti;
+
 		/*
 		 * expand_inherited_rtentry() always processes a parent before any of
 		 * that parent's children, so the parent query for this relation
@@ -1680,6 +1683,15 @@ inheritance_planner(PlannerInfo *root)
 
 		/* Build list of modified subroots, too */
 		subroots = lappend(subroots, subroot);
+		rti = 0;
+		foreach(listCell, subroot->parse->rtable)
+		{
+			RangeTblEntry *subroot_rte = lfirst(listCell);
+			RangeTblEntry *finalroot_rte = list_nth(final_rtable, rti);
+			if (finalroot_rte != subroot_rte)
+				finalroot_rte->scanCols = bms_union(finalroot_rte->scanCols, subroot_rte->scanCols);
+			rti++;
+		}
 
 		/* Build list of target-relation RT indexes */
 		resultRelations = lappend_int(resultRelations, appinfo->child_relid);
diff --git a/src/backend/parser/parse_relation.c b/src/backend/parser/parse_relation.c
index 4dd81507a7..a6c51083b6 100644
--- a/src/backend/parser/parse_relation.c
+++ b/src/backend/parser/parse_relation.c
@@ -1271,6 +1271,7 @@ addRangeTableEntry(ParseState *pstate,
 	rte->selectedCols = NULL;
 	rte->insertedCols = NULL;
 	rte->updatedCols = NULL;
+	rte->scanCols = NULL;
 	rte->extraUpdatedCols = NULL;
 
 	/*
@@ -1343,6 +1344,7 @@ addRangeTableEntryForRelation(ParseState *pstate,
 	rte->selectedCols = NULL;
 	rte->insertedCols = NULL;
 	rte->updatedCols = NULL;
+	rte->scanCols = NULL;
 	rte->extraUpdatedCols = NULL;
 
 	/*
@@ -1423,6 +1425,7 @@ addRangeTableEntryForSubquery(ParseState *pstate,
 	rte->selectedCols = NULL;
 	rte->insertedCols = NULL;
 	rte->updatedCols = NULL;
+	rte->scanCols = NULL;
 	rte->extraUpdatedCols = NULL;
 
 	/*
@@ -1687,6 +1690,7 @@ addRangeTableEntryForFunction(ParseState *pstate,
 	rte->selectedCols = NULL;
 	rte->insertedCols = NULL;
 	rte->updatedCols = NULL;
+	rte->scanCols = NULL;
 	rte->extraUpdatedCols = NULL;
 
 	/*
@@ -1751,6 +1755,7 @@ addRangeTableEntryForTableFunc(ParseState *pstate,
 	rte->selectedCols = NULL;
 	rte->insertedCols = NULL;
 	rte->updatedCols = NULL;
+	rte->scanCols = NULL;
 	rte->extraUpdatedCols = NULL;
 
 	/*
@@ -1830,6 +1835,7 @@ addRangeTableEntryForValues(ParseState *pstate,
 	rte->selectedCols = NULL;
 	rte->insertedCols = NULL;
 	rte->updatedCols = NULL;
+	rte->scanCols = NULL;
 	rte->extraUpdatedCols = NULL;
 
 	/*
@@ -1901,6 +1907,7 @@ addRangeTableEntryForJoin(ParseState *pstate,
 	rte->selectedCols = NULL;
 	rte->insertedCols = NULL;
 	rte->updatedCols = NULL;
+	rte->scanCols = NULL;
 	rte->extraUpdatedCols = NULL;
 
 	/*
@@ -2004,6 +2011,7 @@ addRangeTableEntryForCTE(ParseState *pstate,
 	rte->selectedCols = NULL;
 	rte->insertedCols = NULL;
 	rte->updatedCols = NULL;
+	rte->scanCols = NULL;
 	rte->extraUpdatedCols = NULL;
 
 	/*
@@ -2110,6 +2118,7 @@ addRangeTableEntryForENR(ParseState *pstate,
 	rte->requiredPerms = 0;
 	rte->checkAsUser = InvalidOid;
 	rte->selectedCols = NULL;
+	rte->scanCols = NULL;
 
 	/*
 	 * Add completed RTE to pstate's range table list, but not to join list
diff --git a/src/backend/rewrite/rewriteHandler.c b/src/backend/rewrite/rewriteHandler.c
index 5b047d1662..d17dd49446 100644
--- a/src/backend/rewrite/rewriteHandler.c
+++ b/src/backend/rewrite/rewriteHandler.c
@@ -1640,6 +1640,7 @@ ApplyRetrieveRule(Query *parsetree,
 			rte->selectedCols = NULL;
 			rte->insertedCols = NULL;
 			rte->updatedCols = NULL;
+			rte->scanCols = NULL;
 
 			/*
 			 * For the most part, Vars referencing the view should remain as
@@ -1748,6 +1749,7 @@ ApplyRetrieveRule(Query *parsetree,
 	rte->insertedCols = NULL;
 	rte->updatedCols = NULL;
 	rte->extraUpdatedCols = NULL;
+	rte->scanCols = NULL;
 
 	return parsetree;
 }
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 94ded3c135..84f2d2250c 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -1099,7 +1099,14 @@ typedef struct RangeTblEntry
 	Bitmapset  *insertedCols;	/* columns needing INSERT permission */
 	Bitmapset  *updatedCols;	/* columns needing UPDATE permission */
 	Bitmapset  *extraUpdatedCols;	/* generated columns being updated */
+	/*
+	 * Columns to be scanned.
+	 * If the 0th element is set due to varattno == 0
+	 * that means all columns must be scanned, so handle this at scan-time
+	 */
+	Bitmapset  *scanCols;
 	List	   *securityQuals;	/* security barrier quals to apply, if any */
+
 } RangeTblEntry;
 
 /*
-- 
2.22.0

v1-0003-Avoid-adding-returningList-for-invalid-result_rel.patchtext/x-patch; charset=US-ASCII; name=v1-0003-Avoid-adding-returningList-for-invalid-result_rel.patchDownload

From b565386600cbd84087dc43b0ea56dc038b2562fe Mon Sep 17 00:00:00 2001
From: csteam <mplageman@pivotal.io>
Date: Tue, 16 Jul 2019 18:12:29 -0700
Subject: [PATCH v1 3/3] Avoid adding returningList for invalid result_relation
 number

When result_relation is 0, this loop will add all Vars from the
returningList to the targetlist, which seems incorrect.
---
 src/backend/optimizer/prep/preptlist.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/optimizer/prep/preptlist.c b/src/backend/optimizer/prep/preptlist.c
index 792ae393d9..24da3bfb6c 100644
--- a/src/backend/optimizer/prep/preptlist.c
+++ b/src/backend/optimizer/prep/preptlist.c
@@ -192,7 +192,7 @@ preprocess_targetlist(PlannerInfo *root)
 	 * belong to the result rel don't need to be added, because they will be
 	 * made to refer to the actual heap tuple.
 	 */
-	if (parse->returningList && list_length(parse->rtable) > 1)
+	if (result_relation && parse->returningList && list_length(parse->rtable) > 1)
 	{
 		List	   *vars;
 		ListCell   *l;
-- 
2.22.0

Melanie Plageman

melanieplageman@gmail.com

over 6 years ago

In reply to: Tom Lane (#3)

Re: Extracting only the columns needed for a query

On Sat, Jun 15, 2019 at 10:01 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Melanie Plageman <melanieplageman@gmail.com> writes:

While hacking on zedstore, we needed to get a list of the columns to
be projected--basically all of the columns needed to satisfy the
query. The two use cases we have for this is
1) to pass this column list down to the AM layer for the AM to leverage

it

2) for use during planning to improving costing
In other threads, such as [1], there has been discussion about the
possible benefits for all table types of having access to this set of
columns.

The thing that most approaches to this have fallen down on is triggers ---
that is, a trigger function might access columns mentioned nowhere in the
SQL text. (See 8b6da83d1 for a recent example :-() If you have a plan
for dealing with that, then ...

For triggers, there's not much we can do since we don't know what
columns the trigger function requires. All of the columns will have to
be scanned for GetTupleForTrigger().
The initial scan can still use the scanCols, though.

Currently, when we use the scanCols, triggers work because
GetTupleForTrigger() will call table_tuple_lock(). tuple_table_lock()
expects the return slot to be populated with the latest fetched
tuple--which will have all of the columns.
So, you don't get any kind of optimization, but, really, with the
current TRIGGER/FUNCTION syntax, it doesn't seem like we could do
better than that.

Ashwin & Melanie

Dmitry Dolgov

9erthalion6@gmail.com

about 6 years ago

In reply to: Melanie Plageman (#6)

Re: Extracting only the columns needed for a query

On Tue, Jul 16, 2019 at 06:49:10PM -0700, Melanie Plageman wrote:

We implemented Approach B in the attached patch set (patch 0001) and
then implemented Approach A (patch 0002) to sanity check the pruned
list of columns to scan we were getting at plan-time.
We emit a notice in SeqNext() if the two column sets differ.
Currently, for all of the queries in the regression suite, no
differences are found.
We did notice that none of the current regression tests for triggers
are showing a difference between columns that can be extracted at plan
time and those that can be extracted at execution time--though we
haven't dug into this at all.

Thanks for the patch! If I understand correctly from this thread,
approach B is more preferable, so I've concentrated on the patch 0001
and have a few commentaries/questions:

* The idea is to collect columns that have being used for selects/updates
(where it makes sense for columnar storage to avoid extra work), do I see it
right? If it's the case, then scanCols could be a bit misleading, since it
gives an impression that it's only about reads.

* After a quick experiment, it seems that extract_used_columns is invoked for
updates, but collects too many colums, e.g:

create table test (id int primary key, a text, b text, c text);
update test set a = 'something' where id = 1;

collects into scanCols all columns (varattno from 1 to 4) + again the first
column from baserestrictinfo. Is it correct?

* Not sure if it supposed to be covered by this functionality, but if we do

insert ... on conflict (uniq_col) do update set other_col = 'something'

and actually have to perform an update, extract_used_columns is not called.

* Probably it also makes sense to check IS_DUMMY_REL in extract_used_columns?

On Sat, Jun 15, 2019 at 10:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Another reason for having the planner do this is that presumably, in
an AM that's excited about this, the set of fetched columns should
play into the cost estimates for the scan. I've not been paying
enough attention to the tableam work to know if we've got hooks for
the AM to affect scan costing ... but if we don't, that seems like
a hole that needs plugged.

AM callback relation_estimate_size exists currently which planner leverages.
Via this callback it fetches tuples, pages, etc.. So, our thought is to extend
this API if possible to pass down needed column and help perform better costing
for the query. Though we think if wish to leverage this function, need to know
list of columns before planning hence might need to use query tree.

I believe it would be beneficial to add this potential API extension patch into
the thread (as an example of an interface defining how scanCols could be used)
and review them together.

Melanie Plageman

melanieplageman@gmail.com

about 6 years ago

In reply to: Dmitry Dolgov (#8)

1 attachment(s)

Re: Extracting only the columns needed for a query

On Tue, Dec 17, 2019 at 2:57 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

Thanks for the patch! If I understand correctly from this thread,
approach B is more preferable, so I've concentrated on the patch 0001
and have a few commentaries/questions:

Thanks so much for the review!

* The idea is to collect columns that have being used for selects/updates
(where it makes sense for columnar storage to avoid extra work), do I
see it
right? If it's the case, then scanCols could be a bit misleading, since
it
gives an impression that it's only about reads.

The "scanCols" columns are only what will need to be scanned in order
to execute a query, so, even if a column is being "used", it may not
be set in "scanCols" if it is not required to scan it.

For example, a column which does not need to be scanned but is "used"
-- e.g. in UPDATE x SET col = 2; "col" will not be in "scanCols" because
it is known that it will be 2.

That makes me think that maybe the function name,
extract_used_columns() is bad, though. Maybe extract_scan_columns()?
I tried this in the attached, updated patch.

* After a quick experiment, it seems that extract_used_columns is invoked
for
updates, but collects too many colums, e.g:

create table test (id int primary key, a text, b text, c text);
update test set a = 'something' where id = 1;

collects into scanCols all columns (varattno from 1 to 4) + again the
first
column from baserestrictinfo. Is it correct?

For UPDATE, we need all of the columns in the table because of the
table_lock() API's current expectation that the slot has all of the
columns populated. If we want UPDATE to only need to insert the column
values which have changed, table_tuple_lock() will have to change.

Collecting columns from the baserestrictinfo is important for when
that column isn't present in another part of the query, but it won't
double count it in the bitmap (when it is already present).

* Not sure if it supposed to be covered by this functionality, but if we do

insert ... on conflict (uniq_col) do update set other_col = 'something'

and actually have to perform an update, extract_used_columns is not
called.

For UPSERT, you are correct that it will not extract scan columns.
It wasn't by design. It is because that UPDATE is planned as part of
an INSERT.
For an INSERT, in query_planner(), because the jointree has only one
relation and that relation is an RTE_RESULT planner does not continue
on to make_one_rel() and thus doesn't extract scan columns.

This means that for INSERT ON CONFLICT DO UPDATE, "scanCols" is not
populated, however, since UPDATE needs to scan all of the columns
anyway, I don't think populating "scanCols" would have any impact.
This does mean that that bitmap would be different for a regular
UPDATE vs an UPSERT, however, I don't think that doing the extra work
to populate it makes sense if it won't be used. What do you think?

* Probably it also makes sense to check IS_DUMMY_REL in
extract_used_columns?

I am wondering, when IS_DUMMY_REL is true for a relation, do we
reference the associated RTE later? It seems like if it is a dummy
rel, we wouldn't scan it. It still makes sense to add it to
extract_used_columns(), I think, to avoid any wasted loops through the
rel's expressions. Thanks for the idea!

--
Melanie Plageman

Attachments:

v2-0001-Plan-time-extraction-of-scan-cols.patchapplication/octet-stream; name=v2-0001-Plan-time-extraction-of-scan-cols.patchDownload

From dfdc22b56bc28c520ad4af9461533f3d7a5421be Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 2 Jan 2020 16:59:29 -0800
Subject: [PATCH v2] Plan-time extraction of scan cols

Extract columns from query and save in the RangeTblEntry which
corresponds to the relation that will eventually be scanned.

Do this directly before costing so that this pruned list of columns
could potentially be used in costing calculations

Note that this patch does not use the scanCols.
---
 src/backend/nodes/copyfuncs.c         |  1 +
 src/backend/nodes/equalfuncs.c        |  1 +
 src/backend/nodes/outfuncs.c          |  1 +
 src/backend/nodes/readfuncs.c         |  1 +
 src/backend/optimizer/path/allpaths.c | 63 ++++++++++++++++++++++++++-
 src/backend/optimizer/plan/planner.c  | 12 +++++
 src/backend/parser/parse_relation.c   |  9 ++++
 src/backend/rewrite/rewriteHandler.c  |  2 +
 src/include/nodes/parsenodes.h        |  7 +++
 9 files changed, 96 insertions(+), 1 deletion(-)

diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 8034d5a51c..84e4058aab 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -2397,6 +2397,7 @@ _copyRangeTblEntry(const RangeTblEntry *from)
 	COPY_BITMAPSET_FIELD(insertedCols);
 	COPY_BITMAPSET_FIELD(updatedCols);
 	COPY_BITMAPSET_FIELD(extraUpdatedCols);
+	COPY_BITMAPSET_FIELD(scanCols);
 	COPY_NODE_FIELD(securityQuals);
 
 	return newnode;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 9c8070c640..d9749d9cc5 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -2681,6 +2681,7 @@ _equalRangeTblEntry(const RangeTblEntry *a, const RangeTblEntry *b)
 	COMPARE_BITMAPSET_FIELD(insertedCols);
 	COMPARE_BITMAPSET_FIELD(updatedCols);
 	COMPARE_BITMAPSET_FIELD(extraUpdatedCols);
+	COMPARE_BITMAPSET_FIELD(scanCols);
 	COMPARE_NODE_FIELD(securityQuals);
 
 	return true;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index a53d47371b..69d52ed75f 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -3119,6 +3119,7 @@ _outRangeTblEntry(StringInfo str, const RangeTblEntry *node)
 	WRITE_BITMAPSET_FIELD(insertedCols);
 	WRITE_BITMAPSET_FIELD(updatedCols);
 	WRITE_BITMAPSET_FIELD(extraUpdatedCols);
+	WRITE_BITMAPSET_FIELD(scanCols);
 	WRITE_NODE_FIELD(securityQuals);
 }
 
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 81e7b94b9b..a26b99ec58 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1458,6 +1458,7 @@ _readRangeTblEntry(void)
 	READ_BITMAPSET_FIELD(insertedCols);
 	READ_BITMAPSET_FIELD(updatedCols);
 	READ_BITMAPSET_FIELD(extraUpdatedCols);
+	READ_BITMAPSET_FIELD(scanCols);
 	READ_NODE_FIELD(securityQuals);
 
 	READ_DONE();
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 8286d9cf34..10b64121e1 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -142,7 +142,7 @@ static void recurse_push_qual(Node *setOp, Query *topquery,
 							  RangeTblEntry *rte, Index rti, Node *qual);
 static void remove_unused_subquery_outputs(Query *subquery, RelOptInfo *rel);
 
-
+static void extract_scan_columns(PlannerInfo *root);
 /*
  * make_one_rel
  *	  Finds all possible access paths for executing a query, returning a
@@ -184,6 +184,8 @@ make_one_rel(PlannerInfo *root, List *joinlist)
 	 */
 	set_base_rel_sizes(root);
 
+	extract_scan_columns(root);
+
 	/*
 	 * We should now have size estimates for every actual table involved in
 	 * the query, and we also know which if any have been deleted from the
@@ -234,6 +236,65 @@ make_one_rel(PlannerInfo *root, List *joinlist)
 	return rel;
 }
 
+static void
+extract_scan_columns(PlannerInfo *root)
+{
+	for (int i = 1; i < root->simple_rel_array_size; i++)
+	{
+		ListCell *lc;
+		RangeTblEntry *rte = root->simple_rte_array[i];
+		RelOptInfo    *rel = root->simple_rel_array[i];
+
+		if (rte == NULL)
+			continue;
+
+		if (rel == NULL)
+			continue;
+
+		if (IS_DUMMY_REL(rel))
+			continue;
+
+		rte->scanCols = NULL;
+
+		foreach(lc, rel->reltarget->exprs)
+		{
+			Node *node;
+			List *vars;
+			ListCell *lc1;
+			node = lfirst(lc);
+			/*
+			 * TODO: suggest a default for vars_only to make maintenance less burdensome
+			 */
+			vars = pull_var_clause(node,
+								   PVC_RECURSE_AGGREGATES |
+								   PVC_RECURSE_WINDOWFUNCS |
+								   PVC_RECURSE_PLACEHOLDERS);
+			foreach(lc1, vars)
+			{
+				Var *var = lfirst(lc1);
+				if (var->varno == i && var->varattno >= 0)
+					rte->scanCols = bms_add_member(rte->scanCols, var->varattno);
+			}
+		}
+
+		foreach(lc, rel->baserestrictinfo)
+		{
+			RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+			List *vars = pull_var_clause((Node *)rinfo->clause,
+										 PVC_RECURSE_AGGREGATES |
+										 PVC_RECURSE_WINDOWFUNCS |
+										 PVC_RECURSE_PLACEHOLDERS);
+			ListCell *lc1;
+			foreach(lc1, vars)
+			{
+				Var *var = lfirst(lc1);
+				if (var->varno == i && var->varattno >= 0)
+					rte->scanCols = bms_add_member(rte->scanCols, var->varattno);
+			}
+		}
+	}
+}
+
 /*
  * set_base_rel_consider_startup
  *	  Set the consider_[param_]startup flags for each base-relation entry.
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index d6f2153593..40fa7a11d3 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1486,6 +1486,9 @@ inheritance_planner(PlannerInfo *root)
 		RelOptInfo *sub_final_rel;
 		Path	   *subpath;
 
+		ListCell *listCell;
+		int rti;
+
 		/*
 		 * expand_inherited_rtentry() always processes a parent before any of
 		 * that parent's children, so the parent query for this relation
@@ -1690,6 +1693,15 @@ inheritance_planner(PlannerInfo *root)
 
 		/* Build list of modified subroots, too */
 		subroots = lappend(subroots, subroot);
+		rti = 0;
+		foreach(listCell, subroot->parse->rtable)
+		{
+			RangeTblEntry *subroot_rte = lfirst(listCell);
+			RangeTblEntry *finalroot_rte = list_nth(final_rtable, rti);
+			if (finalroot_rte != subroot_rte)
+				finalroot_rte->scanCols = bms_union(finalroot_rte->scanCols, subroot_rte->scanCols);
+			rti++;
+		}
 
 		/* Build list of target-relation RT indexes */
 		resultRelations = lappend_int(resultRelations, appinfo->child_relid);
diff --git a/src/backend/parser/parse_relation.c b/src/backend/parser/parse_relation.c
index ceed0ceb48..64bb9d8d28 100644
--- a/src/backend/parser/parse_relation.c
+++ b/src/backend/parser/parse_relation.c
@@ -1460,6 +1460,7 @@ addRangeTableEntry(ParseState *pstate,
 	rte->selectedCols = NULL;
 	rte->insertedCols = NULL;
 	rte->updatedCols = NULL;
+	rte->scanCols = NULL;
 	rte->extraUpdatedCols = NULL;
 
 	/*
@@ -1548,6 +1549,7 @@ addRangeTableEntryForRelation(ParseState *pstate,
 	rte->selectedCols = NULL;
 	rte->insertedCols = NULL;
 	rte->updatedCols = NULL;
+	rte->scanCols = NULL;
 	rte->extraUpdatedCols = NULL;
 
 	/*
@@ -1645,6 +1647,7 @@ addRangeTableEntryForSubquery(ParseState *pstate,
 	rte->selectedCols = NULL;
 	rte->insertedCols = NULL;
 	rte->updatedCols = NULL;
+	rte->scanCols = NULL;
 	rte->extraUpdatedCols = NULL;
 
 	/*
@@ -1921,6 +1924,7 @@ addRangeTableEntryForFunction(ParseState *pstate,
 	rte->selectedCols = NULL;
 	rte->insertedCols = NULL;
 	rte->updatedCols = NULL;
+	rte->scanCols = NULL;
 	rte->extraUpdatedCols = NULL;
 
 	/*
@@ -1992,6 +1996,7 @@ addRangeTableEntryForTableFunc(ParseState *pstate,
 	rte->selectedCols = NULL;
 	rte->insertedCols = NULL;
 	rte->updatedCols = NULL;
+	rte->scanCols = NULL;
 	rte->extraUpdatedCols = NULL;
 
 	/*
@@ -2079,6 +2084,7 @@ addRangeTableEntryForValues(ParseState *pstate,
 	rte->selectedCols = NULL;
 	rte->insertedCols = NULL;
 	rte->updatedCols = NULL;
+	rte->scanCols = NULL;
 	rte->extraUpdatedCols = NULL;
 
 	/*
@@ -2162,6 +2168,7 @@ addRangeTableEntryForJoin(ParseState *pstate,
 	rte->selectedCols = NULL;
 	rte->insertedCols = NULL;
 	rte->updatedCols = NULL;
+	rte->scanCols = NULL;
 	rte->extraUpdatedCols = NULL;
 
 	/*
@@ -2281,6 +2288,7 @@ addRangeTableEntryForCTE(ParseState *pstate,
 	rte->selectedCols = NULL;
 	rte->insertedCols = NULL;
 	rte->updatedCols = NULL;
+	rte->scanCols = NULL;
 	rte->extraUpdatedCols = NULL;
 
 	/*
@@ -2395,6 +2403,7 @@ addRangeTableEntryForENR(ParseState *pstate,
 	rte->requiredPerms = 0;
 	rte->checkAsUser = InvalidOid;
 	rte->selectedCols = NULL;
+	rte->scanCols = NULL;
 
 	/*
 	 * Add completed RTE to pstate's range table list, so that we know its
diff --git a/src/backend/rewrite/rewriteHandler.c b/src/backend/rewrite/rewriteHandler.c
index e9fefec8b3..f0197fa0bb 100644
--- a/src/backend/rewrite/rewriteHandler.c
+++ b/src/backend/rewrite/rewriteHandler.c
@@ -1641,6 +1641,7 @@ ApplyRetrieveRule(Query *parsetree,
 			rte->selectedCols = NULL;
 			rte->insertedCols = NULL;
 			rte->updatedCols = NULL;
+			rte->scanCols = NULL;
 
 			/*
 			 * For the most part, Vars referencing the view should remain as
@@ -1749,6 +1750,7 @@ ApplyRetrieveRule(Query *parsetree,
 	rte->insertedCols = NULL;
 	rte->updatedCols = NULL;
 	rte->extraUpdatedCols = NULL;
+	rte->scanCols = NULL;
 
 	return parsetree;
 }
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index f67bd9fad5..fca4b3a47c 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -1099,7 +1099,14 @@ typedef struct RangeTblEntry
 	Bitmapset  *insertedCols;	/* columns needing INSERT permission */
 	Bitmapset  *updatedCols;	/* columns needing UPDATE permission */
 	Bitmapset  *extraUpdatedCols;	/* generated columns being updated */
+	/*
+	 * Columns to be scanned.
+	 * If the 0th element is set due to varattno == 0
+	 * that means all columns must be scanned, so handle this at scan-time
+	 */
+	Bitmapset  *scanCols;
 	List	   *securityQuals;	/* security barrier quals to apply, if any */
+
 } RangeTblEntry;
 
 /*
-- 
2.20.1 (Apple Git-117)

#10

Melanie Plageman

melanieplageman@gmail.com

about 6 years ago

In reply to: Melanie Plageman (#9)

Re: Extracting only the columns needed for a query

I just wanted to address a question we got about making scanCols
instead of using RelOptInfo->attr_needed.

David Rowley had asked:

For planning, isn't this information already available via either
targetlists or from the RelOptInfo->attr_needed array combined with
min_attr and max_attr?

attr_needed does not have all of the attributes needed set. Attributes
are only added in add_vars_to_targetlist() and this is only called for
certain query classes.

Also, Jeff Davis had asked off-list why we didn't start using
RelOptInfo->attr_needed for our purpose (marking which columns will
need to be scanned for the use of table access methods) instead of
adding a new member to RangeTblEntry.

The primary reason is that RelOptInfos are used during planning and
not execution. We need access to this information somewhere in the
PlannedStmt.

Even if we used attr_needed, we would, at some point, need to transfer
the columns needed to be scanned to a data structure available during
execution.

However, the next question was why not use attr_needed during costing
(which is the other time the table access method can use scanCols).

David Kimura and I dug into how attr_needed is used currently in
Postgres to understand if its meaning and usage is consistent with
using it instead of scanCols during costing.

We could set attr_needed in the same way we are now setting scanCols
and then use it during costing, however, besides the fact that we
would then have to create a member like scanCols anyway during
execution, it seems like adding an additional meaning to attr_needed
during planning is confusing and dangerous.

attr_needed is documented as being used to determine the highest
joinrel in which attribute is needed (when it was introduced
835bb975d8d).
Since then it has been extended to be used for removing joins and
relations from queries b78f6264eba33 and to determine if whole row
vars are used in a baserel (which isn't supported as a participant in
a partition-wise join) 7cfdc77023ad507317.

It actually seems like attr_needed might be too general a name for
this field.
It isn't set for every attribute in a query -- only in specific cases
for attributes in certain parts of the query. If a developer checks
attr_needed for the columns in his/her query, many times those
columns will not be present.
It might actually be better to rename attr_needed to clarify its
usage.
scanCols, on the other hand, has a clear meaning and usage. For table
access methods, scanCols are the columns that need to be scanned.
There might even be cases where this ends up being a different set
than attr_needed for a base rel.

Melanie & David

#11

Dmitry Dolgov

9erthalion6@gmail.com

almost 6 years ago

In reply to: Melanie Plageman (#9)

Re: Extracting only the columns needed for a query

On Thu, Jan 02, 2020 at 05:21:55PM -0800, Melanie Plageman wrote:

That makes me think that maybe the function name,
extract_used_columns() is bad, though. Maybe extract_scan_columns()?
I tried this in the attached, updated patch.

Thanks, I'll take a look at the new version. Following your explanation
extract_scan_columns sounds better, but at the same time scan is pretty
broad term and one can probably misunderstand what kind of scan is that
in the name.

For UPDATE, we need all of the columns in the table because of the
table_lock() API's current expectation that the slot has all of the
columns populated. If we want UPDATE to only need to insert the column
values which have changed, table_tuple_lock() will have to change.

Can you please elaborate on this part? I'm probably missing something,
since I don't see immediately where are those expectations expressed.

AM callback relation_estimate_size exists currently which planner leverages.
Via this callback it fetches tuples, pages, etc.. So, our thought is to extend
this API if possible to pass down needed column and help perform better costing
for the query. Though we think if wish to leverage this function, need to know
list of columns before planning hence might need to use query tree.

I believe it would be beneficial to add this potential API extension patch into
the thread (as an example of an interface defining how scanCols could be used)
and review them together.

I still find this question from my previous email interesting to explore.

#12

Melanie Plageman

melanieplageman@gmail.com

almost 6 years ago

In reply to: Dmitry Dolgov (#11)

Re: Extracting only the columns needed for a query

I'm bumping this to the next commitfest because I haven't had a chance
to address the questions posed by Dmitry. I'm sure I'll get to it in
the next few weeks.

I believe it would be beneficial to add this potential API extension

patch into

the thread (as an example of an interface defining how scanCols could be

used)

and review them together.

As for including some code that uses the scanCols, after discussing
off-list with a few folks, there are three options I would like to
pursue for doing this.

One option I will pursue is using the scanCols to inform the columns
needed to be spilled for memory-bounded hashagg (mentioned by Jeff
here [1]/messages/by-id/e5566f7def33a9e9fdff337cca32d07155d7b635.camel@j-davis.com).

The second is potentially using the scanCols in the context of FDW.
Corey, would you happen to have any specific examples handy of when
this might be useful?

The third is exercising it with a test only but providing an example
of how a table AM API user like Zedstore uses the columns during
planning.

[1]: /messages/by-id/e5566f7def33a9e9fdff337cca32d07155d7b635.camel@j-davis.com
/messages/by-id/e5566f7def33a9e9fdff337cca32d07155d7b635.camel@j-davis.com

--
Melanie Plageman

#13

Ashwin Agrawal

aagrawal@pivotal.io

almost 6 years ago

In reply to: Dmitry Dolgov (#11)

Re: Extracting only the columns needed for a query

On Wed, Jan 15, 2020 at 7:54 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

For UPDATE, we need all of the columns in the table because of the
table_lock() API's current expectation that the slot has all of the
columns populated. If we want UPDATE to only need to insert the column
values which have changed, table_tuple_lock() will have to change.

Can you please elaborate on this part? I'm probably missing something,
since I don't see immediately where are those expectations expressed.

The table_tuple_lock() has TupleTableSlot as output argument. Comment for
the function states "*slot: contains the target tuple". Usage of
table_tuple_lock() in places like ExecUpdate() and GetTupleForTrigger() use
the returned tuple for EvalPlanQual. Also, it's unknown
within table_tuple_lock() what is expectation and what would be
performed on the returned slot/tuple. Hence, it becomes tricky for any AM
currently to guess and hence need to return full tuple, as API doesn't
state only which columns it would be interested in or just wishes to take
the lock and needs nothing back in slot.

#14

Pengzhou Tang

ptang@pivotal.io

almost 6 years ago

In reply to: Dmitry Dolgov (#8)

3 attachment(s)

Re: Extracting only the columns needed for a query

On Sat, Jun 15, 2019 at 10:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Another reason for having the planner do this is that presumably, in
an AM that's excited about this, the set of fetched columns should
play into the cost estimates for the scan. I've not been paying
enough attention to the tableam work to know if we've got hooks for
the AM to affect scan costing ... but if we don't, that seems like
a hole that needs plugged.

AM callback relation_estimate_size exists currently which planner

leverages.

Via this callback it fetches tuples, pages, etc.. So, our thought is to

extend

this API if possible to pass down needed column and help perform better

costing

for the query. Though we think if wish to leverage this function, need

to know

list of columns before planning hence might need to use query tree.

I believe it would be beneficial to add this potential API extension patch
into
the thread (as an example of an interface defining how scanCols could be
used)
and review them together.

Thanks for your suggestion, we paste one potential API extension change

bellow for zedstore to use scanCols.

The change contains 3 patches to clarify our idea.
0001-ANALYZE.patch is a generic patch for ANALYZE API extension, we develop
it to make the
analysis of zedstore tables more accurate. It is more flexible now, eg,
TableAm can provide
logical block number as random sample seed; TableAm can only analyze
specified columns; TableAm
can provide extra info besides the data tuple.

0002-Planner.patch is the real patch to show how we use rte->scanCols for a
cost estimate, the main idea
is adding a new metric 'stadiskfrac' to catalog pg_statistic, 'stadiskfrac'
is the physical size ratio of a column,
it is calculated when ANALYZE is performed, 0001-ANALYZE.patch can help to
provide extra disk size info.
So when set_plain_rel_size() is called by the planner, it uses
rte->scanCols and 'stadiskfrac' to adjust the
rel->pages, please see set_plain_rel_page_estimates().

0003-ZedStore.patch is an example of how zedstore uses extended ANALYZE
API, I paste it here anywhere, in case someone
is interest in it.

Thanks,
Pengzhou

Attachments:

0001-ANALYZE-tableam-API-change.patchapplication/x-patch; name=0001-ANALYZE-tableam-API-change.patchDownload

From 77d257ab002a5e1b6a2f65e359cbfd7978e3cff5 Mon Sep 17 00:00:00 2001
From: Pengzhou Tang <ptang@pivotal.io>
Date: Wed, 20 Nov 2019 06:42:37 -0500
Subject: [PATCH 1/3] ANALYZE tableam API change

Extended three ANALYZE-related tableam APIs so AMs can take more control
of ANALYZE progress:
- scan_analyze_beginscan() : so AMs can has more flexible sampling strategy
- scan_analyze_sample_tuple() : so ANALYZE can get extra info as needed
- scan_analyze_endscan() :

Also use struct AnalyzeSampleContext to provide more convenience, with it
tableam analyze routines can provide extra info except the real data,
for example: physical size or compression ratio.
---
 contrib/file_fdw/file_fdw.c              |  35 +++---
 contrib/postgres_fdw/postgres_fdw.c      |  56 +++++----
 src/backend/access/heap/heapam_handler.c | 109 ++++++++++++++--
 src/backend/access/table/tableam.c       | 209 +++++++++++++++++++++++++++++++
 src/backend/commands/analyze.c           | 181 ++++++++------------------
 src/include/access/tableam.h             | 138 +++++++++++++++++---
 src/include/foreign/fdwapi.h             |   7 +-
 7 files changed, 530 insertions(+), 205 deletions(-)

diff --git a/contrib/file_fdw/file_fdw.c b/contrib/file_fdw/file_fdw.c
index 549821c..2344f01 100644
--- a/contrib/file_fdw/file_fdw.c
+++ b/contrib/file_fdw/file_fdw.c
@@ -19,6 +19,7 @@
 #include "access/reloptions.h"
 #include "access/sysattr.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_foreign_table.h"
 #include "commands/copy.h"
@@ -157,10 +158,8 @@ static void estimate_size(PlannerInfo *root, RelOptInfo *baserel,
 static void estimate_costs(PlannerInfo *root, RelOptInfo *baserel,
 						   FileFdwPlanState *fdw_private,
 						   Cost *startup_cost, Cost *total_cost);
-static int	file_acquire_sample_rows(Relation onerel, int elevel,
-									 HeapTuple *rows, int targrows,
-									 double *totalrows, double *totaldeadrows);
-
+static void file_acquire_sample_rows(Relation onerel, int elevel,
+									 AnalyzeSampleContext *context);
 
 /*
  * Foreign-data wrapper handler function: return a struct with pointers
@@ -1091,14 +1090,16 @@ estimate_costs(PlannerInfo *root, RelOptInfo *baserel,
  * may be meaningless, but it's OK because we don't use the estimates
  * currently (the planner only pays attention to correlation for indexscans).
  */
-static int
+static void
 file_acquire_sample_rows(Relation onerel, int elevel,
-						 HeapTuple *rows, int targrows,
-						 double *totalrows, double *totaldeadrows)
+						 AnalyzeSampleContext *context)
 {
 	int			numrows = 0;
+	int			targrows = 0;
+	double		totalrows = 0;
 	double		rowstoskip = -1;	/* -1 means not set yet */
 	ReservoirStateData rstate;
+	HeapTuple	tuple;
 	TupleDesc	tupDesc;
 	Datum	   *values;
 	bool	   *nulls;
@@ -1111,6 +1112,8 @@ file_acquire_sample_rows(Relation onerel, int elevel,
 	MemoryContext oldcontext = CurrentMemoryContext;
 	MemoryContext tupcontext;
 
+	targrows = context->targrows;
+
 	Assert(onerel);
 	Assert(targrows > 0);
 
@@ -1144,8 +1147,6 @@ file_acquire_sample_rows(Relation onerel, int elevel,
 	errcallback.previous = error_context_stack;
 	error_context_stack = &errcallback;
 
-	*totalrows = 0;
-	*totaldeadrows = 0;
 	for (;;)
 	{
 		/* Check for user-requested abort or sleep */
@@ -1170,7 +1171,8 @@ file_acquire_sample_rows(Relation onerel, int elevel,
 		 */
 		if (numrows < targrows)
 		{
-			rows[numrows++] = heap_form_tuple(tupDesc, values, nulls);
+			tuple = heap_form_tuple(tupDesc, values, nulls);
+			AnalyzeRecordSampleRow(context, NULL, tuple, ANALYZE_SAMPLE_DATA, numrows++, false /* replace */, false);
 		}
 		else
 		{
@@ -1180,7 +1182,7 @@ file_acquire_sample_rows(Relation onerel, int elevel,
 			 * not-yet-incremented value of totalrows as t.
 			 */
 			if (rowstoskip < 0)
-				rowstoskip = reservoir_get_next_S(&rstate, *totalrows, targrows);
+				rowstoskip = reservoir_get_next_S(&rstate, totalrows, targrows);
 
 			if (rowstoskip <= 0)
 			{
@@ -1191,14 +1193,14 @@ file_acquire_sample_rows(Relation onerel, int elevel,
 				int			k = (int) (targrows * sampler_random_fract(rstate.randstate));
 
 				Assert(k >= 0 && k < targrows);
-				heap_freetuple(rows[k]);
-				rows[k] = heap_form_tuple(tupDesc, values, nulls);
+				tuple = heap_form_tuple(tupDesc, values, nulls);
+				AnalyzeRecordSampleRow(context, NULL, tuple, ANALYZE_SAMPLE_DATA, k, true /* replace */, false);
 			}
 
 			rowstoskip -= 1;
 		}
 
-		*totalrows += 1;
+		totalrows += 1;
 	}
 
 	/* Remove error callback. */
@@ -1219,7 +1221,8 @@ file_acquire_sample_rows(Relation onerel, int elevel,
 			(errmsg("\"%s\": file contains %.0f rows; "
 					"%d rows in sample",
 					RelationGetRelationName(onerel),
-					*totalrows, numrows)));
+					totalrows, numrows)));
 
-	return numrows;
+	context->totalrows += totalrows;
+	context->totalsampledrows += numrows;
 }
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index bdc21b3..f0789cc 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -17,6 +17,7 @@
 #include "access/htup_details.h"
 #include "access/sysattr.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "catalog/pg_class.h"
 #include "commands/defrem.h"
 #include "commands/explain.h"
@@ -237,7 +238,6 @@ typedef struct PgFdwAnalyzeState
 	List	   *retrieved_attrs;	/* attr numbers retrieved by query */
 
 	/* collected sample rows */
-	HeapTuple  *rows;			/* array of size targrows */
 	int			targrows;		/* target # of sample rows */
 	int			numrows;		/* # of sample rows collected */
 
@@ -463,12 +463,11 @@ static void process_query_params(ExprContext *econtext,
 								 FmgrInfo *param_flinfo,
 								 List *param_exprs,
 								 const char **param_values);
-static int	postgresAcquireSampleRowsFunc(Relation relation, int elevel,
-										  HeapTuple *rows, int targrows,
-										  double *totalrows,
-										  double *totaldeadrows);
+static void	postgresAcquireSampleRowsFunc(Relation relation, int elevel,
+										  AnalyzeSampleContext *context);
 static void analyze_row_processor(PGresult *res, int row,
-								  PgFdwAnalyzeState *astate);
+								  PgFdwAnalyzeState *astate,
+								  AnalyzeSampleContext *context);
 static HeapTuple make_tuple_from_result_row(PGresult *res,
 											int row,
 											Relation rel,
@@ -4488,11 +4487,9 @@ postgresAnalyzeForeignTable(Relation relation,
  * may be meaningless, but it's OK because we don't use the estimates
  * currently (the planner only pays attention to correlation for indexscans).
  */
-static int
+static void 
 postgresAcquireSampleRowsFunc(Relation relation, int elevel,
-							  HeapTuple *rows, int targrows,
-							  double *totalrows,
-							  double *totaldeadrows)
+							  AnalyzeSampleContext *context)
 {
 	PgFdwAnalyzeState astate;
 	ForeignTable *table;
@@ -4506,13 +4503,11 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
 	/* Initialize workspace state */
 	astate.rel = relation;
 	astate.attinmeta = TupleDescGetAttInMetadata(RelationGetDescr(relation));
-
-	astate.rows = rows;
-	astate.targrows = targrows;
+	astate.targrows = context->targrows;
 	astate.numrows = 0;
 	astate.samplerows = 0;
 	astate.rowstoskip = -1;		/* -1 means not set yet */
-	reservoir_init_selection_state(&astate.rstate, targrows);
+	reservoir_init_selection_state(&astate.rstate, astate.targrows);
 
 	/* Remember ANALYZE context, and create a per-tuple temp context */
 	astate.anl_cxt = CurrentMemoryContext;
@@ -4604,7 +4599,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
 			/* Process whatever we got. */
 			numrows = PQntuples(res);
 			for (i = 0; i < numrows; i++)
-				analyze_row_processor(res, i, &astate);
+				analyze_row_processor(res, i, &astate, context);
 
 			PQclear(res);
 			res = NULL;
@@ -4628,10 +4623,13 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
 	ReleaseConnection(conn);
 
 	/* We assume that we have no dead tuple. */
-	*totaldeadrows = 0.0;
+	context->totaldeadrows = 0.0;
 
 	/* We've retrieved all living tuples from foreign server. */
-	*totalrows = astate.samplerows;
+	context->totalrows += astate.samplerows;
+
+	/* Increase the number of sample rows stored in the context */
+	context->totalsampledrows += astate.numrows;
 
 	/*
 	 * Emit some interesting relation info
@@ -4640,8 +4638,6 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
 			(errmsg("\"%s\": table contains %.0f rows, %d rows in sample",
 					RelationGetRelationName(relation),
 					astate.samplerows, astate.numrows)));
-
-	return astate.numrows;
 }
 
 /*
@@ -4650,10 +4646,11 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
  *	 - Subsequently, replace already-sampled tuples randomly.
  */
 static void
-analyze_row_processor(PGresult *res, int row, PgFdwAnalyzeState *astate)
+analyze_row_processor(PGresult *res, int row, PgFdwAnalyzeState *astate, AnalyzeSampleContext *context)
 {
 	int			targrows = astate->targrows;
 	int			pos;			/* array index to store tuple in */
+	bool		replace;
 	MemoryContext oldcontext;
 
 	/* Always increment sample row counter. */
@@ -4667,6 +4664,7 @@ analyze_row_processor(PGresult *res, int row, PgFdwAnalyzeState *astate)
 	{
 		/* First targrows rows are always included into the sample */
 		pos = astate->numrows++;
+		replace = false;
 	}
 	else
 	{
@@ -4683,7 +4681,7 @@ analyze_row_processor(PGresult *res, int row, PgFdwAnalyzeState *astate)
 			/* Choose a random reservoir element to replace. */
 			pos = (int) (targrows * sampler_random_fract(astate->rstate.randstate));
 			Assert(pos >= 0 && pos < targrows);
-			heap_freetuple(astate->rows[pos]);
+			replace = true;
 		}
 		else
 		{
@@ -4696,18 +4694,22 @@ analyze_row_processor(PGresult *res, int row, PgFdwAnalyzeState *astate)
 
 	if (pos >= 0)
 	{
+		HeapTuple		tuple;
 		/*
 		 * Create sample tuple from current result row, and store it in the
 		 * position determined above.  The tuple has to be created in anl_cxt.
 		 */
 		oldcontext = MemoryContextSwitchTo(astate->anl_cxt);
 
-		astate->rows[pos] = make_tuple_from_result_row(res, row,
-													   astate->rel,
-													   astate->attinmeta,
-													   astate->retrieved_attrs,
-													   NULL,
-													   astate->temp_cxt);
+		tuple = make_tuple_from_result_row(res, row,
+										   astate->rel,
+										   astate->attinmeta,
+										   astate->retrieved_attrs,
+										   NULL,
+										   astate->temp_cxt);
+
+		/* Tuple is already created in anl_cxt, we can record it directly */
+		AnalyzeRecordSampleRow(context, NULL, tuple, ANALYZE_SAMPLE_DATA, pos, replace, false);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 253849e..c57c670 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -35,6 +35,7 @@
 #include "executor/executor.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "parser/analyze.h"
 #include "storage/bufmgr.h"
 #include "storage/bufpage.h"
 #include "storage/lmgr.h"
@@ -44,6 +45,7 @@
 #include "utils/builtins.h"
 #include "utils/rel.h"
 
+static int	compare_rows(const void *a, const void *b);
 static void reform_and_rewrite_tuple(HeapTuple tuple,
 									 Relation OldHeap, Relation NewHeap,
 									 Datum *values, bool *isnull, RewriteState rwstate);
@@ -974,10 +976,25 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 	pfree(isnull);
 }
 
+static void
+heapam_scan_analyze_beginscan(Relation onerel, AnalyzeSampleContext *context)
+{
+	context->scan = table_beginscan_analyze(onerel);
+
+	/* initialize the totalblocks analyze can scan */
+	context->totalblocks = RelationGetNumberOfBlocks(onerel);
+
+	/* reset the statistic */
+	context->liverows = 0;
+	context->deadrows = 0;
+	context->ordered = true;
+}
+
 static bool
-heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
-							   BufferAccessStrategy bstrategy)
+heapam_scan_analyze_next_block(BlockNumber blockno,
+							   AnalyzeSampleContext *context)
 {
+	TableScanDesc scan = context->scan;
 	HeapScanDesc hscan = (HeapScanDesc) scan;
 
 	/*
@@ -992,7 +1009,7 @@ heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
 	hscan->rs_cblock = blockno;
 	hscan->rs_cindex = FirstOffsetNumber;
 	hscan->rs_cbuf = ReadBufferExtended(scan->rs_rd, MAIN_FORKNUM,
-										blockno, RBM_NORMAL, bstrategy);
+										blockno, RBM_NORMAL, context->bstrategy);
 	LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
 
 	/* in heap all blocks can contain tuples, so always return true */
@@ -1000,14 +1017,14 @@ heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
 }
 
 static bool
-heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
-							   double *liverows, double *deadrows,
-							   TupleTableSlot *slot)
+heapam_scan_analyze_next_tuple(TransactionId OldestXmin, AnalyzeSampleContext *context)
 {
+	TableScanDesc scan = context->scan;
 	HeapScanDesc hscan = (HeapScanDesc) scan;
 	Page		targpage;
 	OffsetNumber maxoffset;
 	BufferHeapTupleTableSlot *hslot;
+	TupleTableSlot *slot = AnalyzeGetSampleSlot(context, scan->rs_rd, ANALYZE_SAMPLE_DATA);
 
 	Assert(TTS_IS_BUFFERTUPLE(slot));
 
@@ -1033,7 +1050,7 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
 		if (!ItemIdIsNormal(itemid))
 		{
 			if (ItemIdIsDead(itemid))
-				*deadrows += 1;
+				context->deadrows += 1;
 			continue;
 		}
 
@@ -1048,13 +1065,13 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
 		{
 			case HEAPTUPLE_LIVE:
 				sample_it = true;
-				*liverows += 1;
+				context->liverows += 1;
 				break;
 
 			case HEAPTUPLE_DEAD:
 			case HEAPTUPLE_RECENTLY_DEAD:
 				/* Count dead and recently-dead rows */
-				*deadrows += 1;
+				context->deadrows += 1;
 				break;
 
 			case HEAPTUPLE_INSERT_IN_PROGRESS:
@@ -1080,7 +1097,7 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
 				if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(targtuple->t_data)))
 				{
 					sample_it = true;
-					*liverows += 1;
+					context->liverows += 1;
 				}
 				break;
 
@@ -1109,11 +1126,11 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
 				 * concurrent transaction never commits.
 				 */
 				if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetUpdateXid(targtuple->t_data)))
-					*deadrows += 1;
+					context->deadrows += 1;
 				else
 				{
 					sample_it = true;
-					*liverows += 1;
+					context->liverows += 1;
 				}
 				break;
 
@@ -1142,6 +1159,71 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
 	return false;
 }
 
+static void 
+heapam_scan_analyze_sample_tuple(int pos, bool replace, AnalyzeSampleContext *context)
+{
+	TupleTableSlot *slot;
+	Relation onerel = context->scan->rs_rd;
+
+	Assert(pos >= 0);
+	/* 
+	 * heapam_scan_analyze_next_tuple should already put the tuple
+	 * in the sample slot, just record it into the array of sample
+	 * rows.
+	 */
+	slot = AnalyzeGetSampleSlot(context, onerel, ANALYZE_SAMPLE_DATA);
+	AnalyzeRecordSampleRow(context, slot, NULL, ANALYZE_SAMPLE_DATA, pos, replace, true);
+
+	/*
+	 * if replace happens, the sample rows are no longer ordered
+	 * in physical position. 
+	 */
+	if (replace)
+		context->ordered = false;
+}
+
+static void
+heapam_scan_analyze_endscan(AnalyzeSampleContext *context)
+{
+	HeapTuple *rows = AnalyzeGetSampleRows(context, ANALYZE_SAMPLE_DATA, context->totalsampledrows);
+
+	/*
+	 * If we didn't find as many tuples as we wanted then we're done. No sort
+	 * is needed, since they're already in order.
+	 *
+	 * Otherwise we need to sort the collected tuples by position
+	 * (itempointer).
+	 */
+	if (!context->ordered)
+		qsort((void *)rows, context->targrows, sizeof(HeapTuple), compare_rows);
+
+	table_endscan(context->scan);
+}
+
+/*
+ * qsort comparator for sorting rows[] array
+ */
+static int
+compare_rows(const void *a, const void *b)
+{
+	HeapTuple	ha = *(const HeapTuple *) a;
+	HeapTuple	hb = *(const HeapTuple *) b;
+	BlockNumber ba = ItemPointerGetBlockNumber(&ha->t_self);
+	OffsetNumber oa = ItemPointerGetOffsetNumber(&ha->t_self);
+	BlockNumber bb = ItemPointerGetBlockNumber(&hb->t_self);
+	OffsetNumber ob = ItemPointerGetOffsetNumber(&hb->t_self);
+
+	if (ba < bb)
+		return -1;
+	if (ba > bb)
+		return 1;
+	if (oa < ob)
+		return -1;
+	if (oa > ob)
+		return 1;
+	return 0;
+}
+
 static double
 heapam_index_build_range_scan(Relation heapRelation,
 							  Relation indexRelation,
@@ -2529,8 +2611,11 @@ static const TableAmRoutine heapam_methods = {
 	.relation_copy_data = heapam_relation_copy_data,
 	.relation_copy_for_cluster = heapam_relation_copy_for_cluster,
 	.relation_vacuum = heap_vacuum_rel,
+	.scan_analyze_beginscan = heapam_scan_analyze_beginscan,
 	.scan_analyze_next_block = heapam_scan_analyze_next_block,
 	.scan_analyze_next_tuple = heapam_scan_analyze_next_tuple,
+	.scan_analyze_sample_tuple = heapam_scan_analyze_sample_tuple,
+	.scan_analyze_endscan = heapam_scan_analyze_endscan,
 	.index_build_range_scan = heapam_index_build_range_scan,
 	.index_validate_scan = heapam_index_validate_scan,
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index b9ed336..d40eff0 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -23,7 +23,9 @@
 
 #include "access/heapam.h"		/* for ss_* */
 #include "access/tableam.h"
+#include "access/tupconvert.h"
 #include "access/xact.h"
+#include "catalog/pg_type.h"
 #include "optimizer/plancat.h"
 #include "storage/bufmgr.h"
 #include "storage/shmem.h"
@@ -650,3 +652,210 @@ table_block_relation_estimate_size(Relation rel, int32 *attr_widths,
 	else
 		*allvisfrac = (double) relallvisible / curpages;
 }
+
+/* Create the analyze sample context to acquire sample rows */
+AnalyzeSampleContext *
+CreateAnalyzeSampleContext(Relation onerel,
+						   List *anl_cols,
+						   int totaltargrows,
+						   BufferAccessStrategy strategy)
+{
+	AnalyzeSampleContext *context;
+	
+	context = (AnalyzeSampleContext *) palloc(sizeof(AnalyzeSampleContext));
+	context->parent = onerel;
+	context->anl_cols = anl_cols;
+	context->bstrategy = strategy;
+	context->totaltargrows = totaltargrows;
+	context->targrows = totaltargrows;
+	context->scan = NULL;
+	context->totalblocks = 0;
+	context->totalrows = 0;
+	context->totaldeadrows = 0;
+	context->totalsampledrows = 0;
+	context->liverows = 0;
+	context->deadrows = 0;
+	context->ordered = false;
+	context->tup_convert_map = NULL;
+
+	/* empty all sample type */
+	memset(context->sample_slots, 0, MAX_ANALYZE_SAMPLE * sizeof(TupleTableSlot *));
+	memset(context->sample_rows, 0, MAX_ANALYZE_SAMPLE * sizeof(HeapTuple *));
+
+	return context;
+}
+
+/* Destroy analyze sample context */
+void
+DestroyAnalyzeSampleContext(AnalyzeSampleContext *context)
+{
+	for (int i = 0; i < MAX_ANALYZE_SAMPLE; i++)
+	{
+		TupleTableSlot *slot = context->sample_slots[i];
+		if (slot)
+			ExecDropSingleTupleTableSlot(slot);
+	}
+}
+
+/* 
+ * To acquire sample rows from an inherited table, all child
+ * relations use the same analyze sample context, this function
+ * must be called before starting analyze a new child relation.
+ */
+void
+InitAnalyzeSampleContextForChild(AnalyzeSampleContext *context,
+								 Relation child,
+								 int childtargrows)
+{
+	/* Set targrows to childtargrows */
+	context->targrows = childtargrows;
+
+	/* We may need to convert from child's rowtype to parent's */
+	if (!equalTupleDescs(RelationGetDescr(child),
+						 RelationGetDescr(context->parent)))
+	{
+		if (context->tup_convert_map)
+			free_conversion_map(context->tup_convert_map);
+		/* Create a convert map so it can be used when recording sample rows */
+		context->tup_convert_map =
+			convert_tuples_by_name(RelationGetDescr(child),
+								   RelationGetDescr(context->parent));
+
+		/* We also cannot use previous sample slot anymore */
+		if (context->sample_slots[ANALYZE_SAMPLE_DATA])
+		{
+			ExecDropSingleTupleTableSlot(context->sample_slots[ANALYZE_SAMPLE_DATA]);
+			context->sample_slots[ANALYZE_SAMPLE_DATA] = NULL;
+		}
+	}
+}
+
+void
+AnalyzeGetSampleStats(AnalyzeSampleContext *context,
+					  int *totalsampledrows,
+					  double *totalrows,
+					  double *totaldeadrows)
+{
+	if (totalsampledrows)
+		*totalsampledrows = context->totalsampledrows;
+	if (totalrows)
+		*totalrows = context->totalrows;
+	if (*totaldeadrows)
+		*totaldeadrows = context->totaldeadrows;
+}
+
+
+/* 
+ * Get or initialize a sample slot to hold sample tuple, normally
+ * the tuple in the slot will be copied to the sample_rows[type]
+ * by AnalyzeRecordSampleRow().
+ */
+TupleTableSlot *
+AnalyzeGetSampleSlot(AnalyzeSampleContext *context,
+					 Relation onerel,
+					 AnalyzeSampleType type)
+{
+	TupleDesc tupdesc;
+	int attr_cnt = onerel->rd_att->natts;
+
+	if (context->sample_slots[type])
+		return context->sample_slots[type]; 
+
+	switch (type)
+	{
+		case ANALYZE_SAMPLE_DATA:
+			tupdesc = RelationGetDescr(onerel);
+			break;
+		case ANALYZE_SAMPLE_DISKSIZE:
+			tupdesc = CreateTemplateTupleDesc(attr_cnt);
+			for (int i = 1; i <= attr_cnt; i++)
+				TupleDescInitEntry(tupdesc, i, "", FLOAT8OID, -1, 0);
+			break;
+		default:
+			elog(ERROR, "unknown analyze sample type");
+	}
+
+	context->sample_slots[type] =
+		MakeSingleTupleTableSlot(tupdesc, table_slot_callbacks(onerel));
+	return context->sample_slots[type];
+}
+
+HeapTuple *
+AnalyzeGetSampleRows(AnalyzeSampleContext *context,
+					 AnalyzeSampleType type,
+					 int offset)
+{
+	Assert(offset < context->totaltargrows);
+	if (!context->sample_rows[type])
+		context->sample_rows[type] =
+			(HeapTuple *) palloc(context->totaltargrows * sizeof(HeapTuple));
+
+	return context->sample_rows[type] + offset;
+}
+
+/*
+ * Record a sample tuple into sample_rows[type].
+ * 
+ * sample_tuple:
+ * 		Input sample tuple. Sometimes, callers has already
+ * 		formed sample tuple in its memory context, we can
+ * 		record it directly. 
+ * sample_slot: 
+ * 		Slot which contains the sample tuple. We need to copy
+ * 		the sample tuple and then record it.
+ * pos:
+ * 		The postion in the sample_rows[type].
+ * replace:
+ * 		Replace the old sample tuple in the specified position.
+ * withtid:
+ * 		Set the tid of sample tuple, this is only valid when
+ * 		sample_slot is set.
+ *
+ * We prefer to use sample_slot if both sample_tuple and
+ * sample_slot are set, sample_slot is the most common case. 
+ */
+void
+AnalyzeRecordSampleRow(AnalyzeSampleContext *context,
+					   TupleTableSlot *sample_slot,
+					   HeapTuple sample_tuple,
+					   AnalyzeSampleType type,
+					   int pos,
+					   bool replace,
+					   bool withtid)
+{
+	HeapTuple tuple;
+	HeapTuple *rows;
+
+	rows = AnalyzeGetSampleRows(context, type, context->totalsampledrows);
+
+	/* We need to free the old tuple if replace is true */
+	if (replace)
+		heap_freetuple(rows[pos]);
+
+	Assert(sample_slot || sample_tuple);
+	if (sample_slot)
+		tuple = ExecCopySlotHeapTuple(sample_slot);
+	else
+		tuple = sample_tuple;
+
+	/* We may need to convert from child's rowtype to parent's */
+	if (context->tup_convert_map != NULL)
+	{
+		HeapTuple	newtup;
+		newtup = execute_attr_map_tuple(tuple, context->tup_convert_map);
+		heap_freetuple(tuple);
+		tuple = newtup;
+	}
+
+	if (withtid && sample_slot)
+		tuple->t_self = sample_slot->tts_tid;
+
+	/* store the tuple to right position */
+	rows[pos] = tuple;
+}
+
+bool
+AnalyzeSampleIsValid(AnalyzeSampleContext *context, AnalyzeSampleType type)
+{
+	return context->sample_rows[type] != NULL;
+}
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index e2033f9..cc1649d 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -83,7 +83,6 @@ int			default_statistics_target = 100;
 static MemoryContext anl_context = NULL;
 static BufferAccessStrategy vac_strategy;
 
-
 static void do_analyze_rel(Relation onerel,
 						   VacuumParams *params, List *va_cols,
 						   AcquireSampleRowsFunc acquirefunc, BlockNumber relpages,
@@ -94,13 +93,10 @@ static void compute_index_stats(Relation onerel, double totalrows,
 								MemoryContext col_context);
 static VacAttrStats *examine_attribute(Relation onerel, int attnum,
 									   Node *index_expr);
-static int	acquire_sample_rows(Relation onerel, int elevel,
-								HeapTuple *rows, int targrows,
-								double *totalrows, double *totaldeadrows);
-static int	compare_rows(const void *a, const void *b);
-static int	acquire_inherited_sample_rows(Relation onerel, int elevel,
-										  HeapTuple *rows, int targrows,
-										  double *totalrows, double *totaldeadrows);
+static void	acquire_sample_rows(Relation onerel, int elevel,
+								AnalyzeSampleContext *context);
+static void	acquire_inherited_sample_rows(Relation onerel, int elevel,
+										  AnalyzeSampleContext *context);
 static void update_attstats(Oid relid, bool inh,
 							int natts, VacAttrStats **vacattrstats);
 static Datum std_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull);
@@ -318,6 +314,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	AnalyzeSampleContext *sample_context;
 
 	if (inh)
 		ereport(elevel,
@@ -502,18 +499,21 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 	if (targrows < minrows)
 		targrows = minrows;
 
+	/* create context for acquiring sample rows */
+	sample_context = CreateAnalyzeSampleContext(onerel, va_cols, targrows,
+												vac_strategy);
+
 	/*
 	 * Acquire the sample rows
 	 */
-	rows = (HeapTuple *) palloc(targrows * sizeof(HeapTuple));
 	if (inh)
-		numrows = acquire_inherited_sample_rows(onerel, elevel,
-												rows, targrows,
-												&totalrows, &totaldeadrows);
+		acquire_inherited_sample_rows(onerel, elevel, sample_context);
 	else
-		numrows = (*acquirefunc) (onerel, elevel,
-								  rows, targrows,
-								  &totalrows, &totaldeadrows);
+		(*acquirefunc) (onerel, elevel, sample_context); 
+
+	/* Get the sample statistics */
+	AnalyzeGetSampleStats(sample_context, &numrows, &totalrows, &totaldeadrows);
+	rows = AnalyzeGetSampleRows(sample_context, ANALYZE_SAMPLE_DATA, 0);
 
 	/*
 	 * Compute the statistics.  Temporary results during the calculations for
@@ -592,7 +592,8 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 		 * not for relations representing inheritance trees.
 		 */
 		if (!inh)
-			BuildRelationExtStatistics(onerel, totalrows, numrows, rows,
+			BuildRelationExtStatistics(onerel, totalrows, numrows,
+									   rows,
 									   attr_cnt, vacattrstats);
 	}
 
@@ -690,6 +691,8 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 							pg_rusage_show(&ru0))));
 	}
 
+	DestroyAnalyzeSampleContext(sample_context);
+
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
 
@@ -1018,26 +1021,26 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
  * block.  The previous sampling method put too much credence in the row
  * density near the start of the table.
  */
-static int
+static void 
 acquire_sample_rows(Relation onerel, int elevel,
-					HeapTuple *rows, int targrows,
-					double *totalrows, double *totaldeadrows)
+					AnalyzeSampleContext *context)
 {
 	int			numrows = 0;	/* # rows now in reservoir */
+	int			targrows = context->targrows;
 	double		samplerows = 0; /* total # rows collected */
-	double		liverows = 0;	/* # live rows seen */
-	double		deadrows = 0;	/* # dead rows seen */
 	double		rowstoskip = -1;	/* -1 means not set yet */
+	double		totalrows = 0;
+	double		totaldeadrows = 0;
 	BlockNumber totalblocks;
 	TransactionId OldestXmin;
 	BlockSamplerData bs;
 	ReservoirStateData rstate;
-	TupleTableSlot *slot;
-	TableScanDesc scan;
 
 	Assert(targrows > 0);
 
-	totalblocks = RelationGetNumberOfBlocks(onerel);
+	table_scan_analyze_beginscan(onerel, context);
+
+	totalblocks = context->totalblocks;
 
 	/* Need a cutoff xmin for HeapTupleSatisfiesVacuum */
 	OldestXmin = GetOldestXmin(onerel, PROCARRAY_FLAGS_VACUUM);
@@ -1047,9 +1050,6 @@ acquire_sample_rows(Relation onerel, int elevel,
 	/* Prepare for sampling rows */
 	reservoir_init_selection_state(&rstate, targrows);
 
-	scan = table_beginscan_analyze(onerel);
-	slot = table_slot_create(onerel, NULL);
-
 	/* Outer loop over blocks to sample */
 	while (BlockSampler_HasMore(&bs))
 	{
@@ -1057,10 +1057,10 @@ acquire_sample_rows(Relation onerel, int elevel,
 
 		vacuum_delay_point();
 
-		if (!table_scan_analyze_next_block(scan, targblock, vac_strategy))
+		if (!table_scan_analyze_next_block(targblock, context))
 			continue;
 
-		while (table_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
+		while (table_scan_analyze_next_tuple(OldestXmin, context))
 		{
 			/*
 			 * The first targrows sample rows are simply copied into the
@@ -1076,8 +1076,8 @@ acquire_sample_rows(Relation onerel, int elevel,
 			 */
 			if (numrows < targrows)
 			{
-				rows[numrows] = ExecCopySlotHeapTuple(slot);
-				rows[numrows]->t_self = slot->tts_tid;
+				table_scan_analyze_sample_tuple(numrows, false, context);
+
 				numrows++;
 			}
 			else
@@ -1099,9 +1099,8 @@ acquire_sample_rows(Relation onerel, int elevel,
 					int			k = (int) (targrows * sampler_random_fract(rstate.randstate));
 
 					Assert(k >= 0 && k < targrows);
-					heap_freetuple(rows[k]);
-					rows[k] = ExecCopySlotHeapTuple(slot);
-					rows[k]->t_self = slot->tts_tid;
+
+					table_scan_analyze_sample_tuple(k, true, context);
 				}
 
 				rowstoskip -= 1;
@@ -1111,19 +1110,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 		}
 	}
 
-	ExecDropSingleTupleTableSlot(slot);
-	table_endscan(scan);
-
-	/*
-	 * If we didn't find as many tuples as we wanted then we're done. No sort
-	 * is needed, since they're already in order.
-	 *
-	 * Otherwise we need to sort the collected tuples by position
-	 * (itempointer). It's not worth worrying about corner cases where the
-	 * tuples are already sorted.
-	 */
-	if (numrows == targrows)
-		qsort((void *) rows, numrows, sizeof(HeapTuple), compare_rows);
+	table_scan_analyze_endscan(context);
 
 	/*
 	 * Estimate total numbers of live and dead rows in relation, extrapolating
@@ -1134,13 +1121,13 @@ acquire_sample_rows(Relation onerel, int elevel,
 	 */
 	if (bs.m > 0)
 	{
-		*totalrows = floor((liverows / bs.m) * totalblocks + 0.5);
-		*totaldeadrows = floor((deadrows / bs.m) * totalblocks + 0.5);
+		totalrows = floor((context->liverows / bs.m) * totalblocks + 0.5);
+		totaldeadrows = floor((context->deadrows / bs.m) * totalblocks + 0.5);
 	}
 	else
 	{
-		*totalrows = 0.0;
-		*totaldeadrows = 0.0;
+		totalrows = 0.0;
+		totaldeadrows = 0.0;
 	}
 
 	/*
@@ -1152,34 +1139,13 @@ acquire_sample_rows(Relation onerel, int elevel,
 					"%d rows in sample, %.0f estimated total rows",
 					RelationGetRelationName(onerel),
 					bs.m, totalblocks,
-					liverows, deadrows,
-					numrows, *totalrows)));
+					context->liverows,
+					context->deadrows,
+					numrows, totalrows)));
 
-	return numrows;
-}
-
-/*
- * qsort comparator for sorting rows[] array
- */
-static int
-compare_rows(const void *a, const void *b)
-{
-	HeapTuple	ha = *(const HeapTuple *) a;
-	HeapTuple	hb = *(const HeapTuple *) b;
-	BlockNumber ba = ItemPointerGetBlockNumber(&ha->t_self);
-	OffsetNumber oa = ItemPointerGetOffsetNumber(&ha->t_self);
-	BlockNumber bb = ItemPointerGetBlockNumber(&hb->t_self);
-	OffsetNumber ob = ItemPointerGetOffsetNumber(&hb->t_self);
-
-	if (ba < bb)
-		return -1;
-	if (ba > bb)
-		return 1;
-	if (oa < ob)
-		return -1;
-	if (oa > ob)
-		return 1;
-	return 0;
+	context->totalrows += totalrows;
+	context->totaldeadrows += totaldeadrows;
+	context->totalsampledrows += numrows;
 }
 
 
@@ -1191,18 +1157,16 @@ compare_rows(const void *a, const void *b)
  * We fail and return zero if there are no inheritance children, or if all
  * children are foreign tables that don't support ANALYZE.
  */
-static int
+static void
 acquire_inherited_sample_rows(Relation onerel, int elevel,
-							  HeapTuple *rows, int targrows,
-							  double *totalrows, double *totaldeadrows)
+							  AnalyzeSampleContext *context)
 {
 	List	   *tableOIDs;
 	Relation   *rels;
 	AcquireSampleRowsFunc *acquirefuncs;
 	double	   *relblocks;
 	double		totalblocks;
-	int			numrows,
-				nrels,
+	int			nrels,
 				i;
 	ListCell   *lc;
 	bool		has_child;
@@ -1230,7 +1194,7 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
 				(errmsg("skipping analyze of \"%s.%s\" inheritance tree --- this inheritance tree contains no child tables",
 						get_namespace_name(RelationGetNamespace(onerel)),
 						RelationGetRelationName(onerel))));
-		return 0;
+		return;
 	}
 
 	/*
@@ -1328,7 +1292,7 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
 				(errmsg("skipping analyze of \"%s.%s\" inheritance tree --- this inheritance tree contains no analyzable child tables",
 						get_namespace_name(RelationGetNamespace(onerel)),
 						RelationGetRelationName(onerel))));
-		return 0;
+		return;
 	}
 
 	/*
@@ -1337,9 +1301,6 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
 	 * rels have radically different free-space percentages, but it's not
 	 * clear that it's worth working harder.)
 	 */
-	numrows = 0;
-	*totalrows = 0;
-	*totaldeadrows = 0;
 	for (i = 0; i < nrels; i++)
 	{
 		Relation	childrel = rels[i];
@@ -1350,49 +1311,15 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
 		{
 			int			childtargrows;
 
-			childtargrows = (int) rint(targrows * childblocks / totalblocks);
+			childtargrows = (int) rint(context->totaltargrows * childblocks / totalblocks);
 			/* Make sure we don't overrun due to roundoff error */
-			childtargrows = Min(childtargrows, targrows - numrows);
+			childtargrows = Min(childtargrows, context->totaltargrows - context->totalsampledrows);
 			if (childtargrows > 0)
 			{
-				int			childrows;
-				double		trows,
-							tdrows;
+				InitAnalyzeSampleContextForChild(context, childrel, childtargrows);
 
 				/* Fetch a random sample of the child's rows */
-				childrows = (*acquirefunc) (childrel, elevel,
-											rows + numrows, childtargrows,
-											&trows, &tdrows);
-
-				/* We may need to convert from child's rowtype to parent's */
-				if (childrows > 0 &&
-					!equalTupleDescs(RelationGetDescr(childrel),
-									 RelationGetDescr(onerel)))
-				{
-					TupleConversionMap *map;
-
-					map = convert_tuples_by_name(RelationGetDescr(childrel),
-												 RelationGetDescr(onerel));
-					if (map != NULL)
-					{
-						int			j;
-
-						for (j = 0; j < childrows; j++)
-						{
-							HeapTuple	newtup;
-
-							newtup = execute_attr_map_tuple(rows[numrows + j], map);
-							heap_freetuple(rows[numrows + j]);
-							rows[numrows + j] = newtup;
-						}
-						free_conversion_map(map);
-					}
-				}
-
-				/* And add to counts */
-				numrows += childrows;
-				*totalrows += trows;
-				*totaldeadrows += tdrows;
+				(*acquirefunc) (childrel, elevel, context);
 			}
 		}
 
@@ -1402,8 +1329,6 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
 		 */
 		table_close(childrel, NoLock);
 	}
-
-	return numrows;
 }
 
 
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 0b882dc..90d2375 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -37,6 +37,66 @@ struct SampleScanState;
 struct TBMIterateResult;
 struct VacuumParams;
 struct ValidateIndexState;
+struct TupleConversionMap;
+
+typedef enum AnalyzeSampleType 
+{
+	ANALYZE_SAMPLE_DATA = 0,	/* real data per column */
+	ANALYZE_SAMPLE_DISKSIZE,	/* physical size per column */
+	MAX_ANALYZE_SAMPLE			/* must be last */
+} AnalyzeSampleType;
+
+typedef struct AnalyzeSampleContext
+{
+	/* Filled when context is created */
+	int		totaltargrows;
+	List	*anl_cols;
+	Relation parent;
+	BufferAccessStrategy bstrategy;
+
+	/* Filled by table AM analyze routines */
+	BlockNumber	totalblocks;
+	TableScanDesc scan;
+
+	/* 
+	 * Acquiring sample rows from a inherited table will invoke
+	 * multiple sampling iterations for each child relation, so
+	 * bellow filed is the statistic for each iteration.
+	 */
+	int		targrows;	/* target number of sample rows */
+	double 	liverows;
+	double 	deadrows;
+	bool	ordered;	/* are sample rows ordered physically */
+
+	/*
+	 * Statistics filed by all sampling iterations.
+	 */
+	int		totalsampledrows; /* total number of sample rows stored */
+	double	totalrows;
+	double	totaldeadrows;
+
+	/* 
+	 * If childrel has different rowtype with parent, we
+	 * need to convert sample tuple to the same rowtype
+	 * with parent
+	 */
+	struct TupleConversionMap *tup_convert_map;
+
+	/*
+	 * Used by table AM analyze routines to store
+	 * the temporary tuple for different types of
+	 * sample rows, the tuple is finally stored to
+	 * sample_rows[] if the tuple is
+	 * randomly selected.
+	 */
+	TupleTableSlot* sample_slots[MAX_ANALYZE_SAMPLE];
+
+	/* 
+	 * stores the final sample rows which will be
+	 * used to compute statistics.
+	 */
+	HeapTuple* sample_rows[MAX_ANALYZE_SAMPLE];
+} AnalyzeSampleContext;
 
 /*
  * Bitmask values for the flags argument to the scan_begin callback.
@@ -532,9 +592,10 @@ typedef struct TableAmRoutine
 	 * clear what a good interface for non block based AMs would be, so there
 	 * isn't one yet.
 	 */
-	bool		(*scan_analyze_next_block) (TableScanDesc scan,
-											BlockNumber blockno,
-											BufferAccessStrategy bstrategy);
+	void		(*scan_analyze_beginscan) (Relation onerel, AnalyzeSampleContext *context);
+
+	bool		(*scan_analyze_next_block) (BlockNumber blockno,
+											AnalyzeSampleContext *context);
 
 	/*
 	 * See table_scan_analyze_next_tuple().
@@ -544,11 +605,13 @@ typedef struct TableAmRoutine
 	 * influence autovacuum scheduling (see comment for relation_vacuum
 	 * callback).
 	 */
-	bool		(*scan_analyze_next_tuple) (TableScanDesc scan,
-											TransactionId OldestXmin,
-											double *liverows,
-											double *deadrows,
-											TupleTableSlot *slot);
+	bool		(*scan_analyze_next_tuple) (TransactionId OldestXmin,
+											AnalyzeSampleContext *context);
+
+	void		(*scan_analyze_sample_tuple) (int pos, bool replace,
+											  AnalyzeSampleContext *context);
+
+	void		(*scan_analyze_endscan) (AnalyzeSampleContext *context);
 
 	/* see table_index_build_range_scan for reference about parameters */
 	double		(*index_build_range_scan) (Relation table_rel,
@@ -1474,6 +1537,12 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
 	rel->rd_tableam->relation_vacuum(rel, params, bstrategy);
 }
 
+static inline void
+table_scan_analyze_beginscan(Relation rel, struct AnalyzeSampleContext *context)
+{
+	rel->rd_tableam->scan_analyze_beginscan(rel, context);
+}
+
 /*
  * Prepare to analyze block `blockno` of `scan`. The scan needs to have been
  * started with table_beginscan_analyze().  Note that this routine might
@@ -1483,11 +1552,10 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
  * Returns false if block is unsuitable for sampling, true otherwise.
  */
 static inline bool
-table_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
-							  BufferAccessStrategy bstrategy)
+table_scan_analyze_next_block(BlockNumber blockno,
+							  struct AnalyzeSampleContext *context)
 {
-	return scan->rs_rd->rd_tableam->scan_analyze_next_block(scan, blockno,
-															bstrategy);
+	return context->scan->rs_rd->rd_tableam->scan_analyze_next_block(blockno, context);
 }
 
 /*
@@ -1501,13 +1569,21 @@ table_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
  * tuples.
  */
 static inline bool
-table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
-							  double *liverows, double *deadrows,
-							  TupleTableSlot *slot)
+table_scan_analyze_next_tuple(TransactionId OldestXmin, AnalyzeSampleContext *context)
+{
+	return context->scan->rs_rd->rd_tableam->scan_analyze_next_tuple(OldestXmin, context);
+}
+
+static inline void 
+table_scan_analyze_sample_tuple(Index sample, bool replace, AnalyzeSampleContext *context)
+{
+	context->scan->rs_rd->rd_tableam->scan_analyze_sample_tuple(sample, replace, context);
+}
+
+static inline void
+table_scan_analyze_endscan(AnalyzeSampleContext *context)
 {
-	return scan->rs_rd->rd_tableam->scan_analyze_next_tuple(scan, OldestXmin,
-															liverows, deadrows,
-															slot);
+	context->scan->rs_rd->rd_tableam->scan_analyze_endscan(context);
 }
 
 /*
@@ -1783,6 +1859,32 @@ extern void table_block_relation_estimate_size(Relation rel,
 											   Size usable_bytes_per_page);
 
 /* ----------------------------------------------------------------------------
+ * Helper functions to implement analyze scan. 
+j* ----------------------------------------------------------------------------
+ */
+extern AnalyzeSampleContext *
+CreateAnalyzeSampleContext(Relation onerel, List *cols, int targrows,
+						   BufferAccessStrategy strategy);
+extern void DestroyAnalyzeSampleContext(AnalyzeSampleContext *context);
+extern TupleTableSlot * AnalyzeGetSampleSlot(AnalyzeSampleContext *context,
+											 Relation onerel, AnalyzeSampleType type);
+extern void AnalyzeRecordSampleRow(AnalyzeSampleContext *context,
+								   TupleTableSlot *sample_slot,
+								   HeapTuple sample_tuple,
+								   AnalyzeSampleType type, int pos,
+								   bool replace, bool withtid);
+extern void InitAnalyzeSampleContextForChild(AnalyzeSampleContext *context,
+											 Relation child,
+											 int childtargrows);
+extern void AnalyzeGetSampleStats(AnalyzeSampleContext *context,
+								  int *totalsampledrows,
+								  double *totalrows,
+								  double *totaldeadrows);
+extern HeapTuple *
+AnalyzeGetSampleRows(AnalyzeSampleContext *context, AnalyzeSampleType type, int offset);
+extern bool AnalyzeSampleIsValid(AnalyzeSampleContext *context, AnalyzeSampleType type);
+
+/* ----------------------------------------------------------------------------
  * Functions in tableamapi.c
  * ----------------------------------------------------------------------------
  */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 8226860..e0da119 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -18,6 +18,7 @@
 
 /* To avoid including explain.h here, reference ExplainState thus: */
 struct ExplainState;
+struct AnalyzeSampleContext;
 
 
 /*
@@ -139,10 +140,8 @@ typedef void (*ExplainForeignModify_function) (ModifyTableState *mtstate,
 typedef void (*ExplainDirectModify_function) (ForeignScanState *node,
 											  struct ExplainState *es);
 
-typedef int (*AcquireSampleRowsFunc) (Relation relation, int elevel,
-									  HeapTuple *rows, int targrows,
-									  double *totalrows,
-									  double *totaldeadrows);
+typedef void (*AcquireSampleRowsFunc) (Relation relation, int elevel,
+									   struct AnalyzeSampleContext *context);
 
 typedef bool (*AnalyzeForeignTable_function) (Relation relation,
 											  AcquireSampleRowsFunc *func,
-- 
1.8.3.1

0002-Planner-can-estimate-the-pages-based-on-the-columns-.patchapplication/x-patch; name=0002-Planner-can-estimate-the-pages-based-on-the-columns-.patchDownload

From f347347cff55b7e12d7031be10b7d1fd4f4f3ea0 Mon Sep 17 00:00:00 2001
From: Pengzhou Tang <ptang@pivotal.io>
Date: Wed, 20 Nov 2019 06:43:33 -0500
Subject: [PATCH 2/3] Planner can estimate the pages based on the columns
 selected

Planner used to assume we need to scan all the pages even we
only need one or two columns in a query, this is right for
heap tables, however, if we using a column store like
zedstore, we can optimize the number of pages with only
selected columns, this will reduce the IO cost and the number
of parallel workers in some cases.

To do this, this commit added a new field `stadiskfrac` in
catalog `pg_statistic`, it records the fraction of physical
size that a column used comparing to the whole table. planer
will calculate a pages selectivity based on the targetlist
and baserestriction info, then scale it with the rel->pages
got from estimate_rel_size().
---
 src/backend/commands/analyze.c        | 52 +++++++++++++++++++++++++++++++++++
 src/backend/optimizer/path/allpaths.c | 45 ++++++++++++++++++++++++++++++
 src/include/catalog/catversion.h      |  2 +-
 src/include/catalog/pg_statistic.h    |  3 ++
 src/include/commands/vacuum.h         |  6 ++++
 src/include/nodes/parsenodes.h        |  1 +
 6 files changed, 108 insertions(+), 1 deletion(-)

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index cc1649d..9a8ae36 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -87,6 +87,9 @@ static void do_analyze_rel(Relation onerel,
 						   VacuumParams *params, List *va_cols,
 						   AcquireSampleRowsFunc acquirefunc, BlockNumber relpages,
 						   bool inh, bool in_outer_xact, int elevel);
+static void compute_disk_stats(VacAttrStats **stats, int natts,
+							   TupleDesc desc, HeapTuple *rows,
+							   int numrows);
 static void compute_index_stats(Relation onerel, double totalrows,
 								AnlIndexData *indexdata, int nindexes,
 								HeapTuple *rows, int numrows,
@@ -560,6 +563,19 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 			MemoryContextResetAndDeleteChildren(col_context);
 		}
 
+		/* compute disksize ratio stats if any */
+		if (AnalyzeSampleIsValid(sample_context, ANALYZE_SAMPLE_DISKSIZE))
+		{
+			TupleTableSlot *slot =
+				AnalyzeGetSampleSlot(sample_context, onerel, ANALYZE_SAMPLE_DISKSIZE);
+			HeapTuple *rows =
+				AnalyzeGetSampleRows(sample_context, ANALYZE_SAMPLE_DISKSIZE, 0);
+
+			compute_disk_stats(vacattrstats, attr_cnt,
+							   slot->tts_tupleDescriptor,
+							   rows, numrows);
+		}
+
 		if (hasindex)
 			compute_index_stats(onerel, totalrows,
 								indexdata, nindexes,
@@ -705,6 +721,41 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 	anl_context = NULL;
 }
 
+static void
+compute_disk_stats(VacAttrStats **stats, int natts,
+				   TupleDesc desc, HeapTuple *rows,
+				   int numrows)
+{
+	int		i, j;
+	float8	attr_size = 0;
+	float8	total = 0;
+	bool	isNull;
+
+	for (i = 0; i < numrows; i++)
+	{
+		HeapTuple tup = rows[i];
+
+		for (j = 0; j < natts; j++)
+		{
+			VacAttrStats *vac = stats[j];
+			Datum dat = heap_getattr(tup, j + 1, desc, &isNull);
+
+			if (!isNull)
+			{
+				attr_size = DatumGetFloat8(dat);
+				vac->disksize += attr_size;
+				total += attr_size;
+			}
+		}
+	}
+
+	for (j = 0; j < natts; j++)
+	{
+		VacAttrStats *vac = stats[j];
+		vac->stadiskfrac = vac->disksize / total;
+	}
+}
+
 /*
  * Compute statistics about indexes of a relation
  */
@@ -1394,6 +1445,7 @@ update_attstats(Oid relid, bool inh, int natts, VacAttrStats **vacattrstats)
 		values[Anum_pg_statistic_staattnum - 1] = Int16GetDatum(stats->attr->attnum);
 		values[Anum_pg_statistic_stainherit - 1] = BoolGetDatum(inh);
 		values[Anum_pg_statistic_stanullfrac - 1] = Float4GetDatum(stats->stanullfrac);
+		values[Anum_pg_statistic_stadiskfrac - 1] = Float4GetDatum(stats->stadiskfrac);
 		values[Anum_pg_statistic_stawidth - 1] = Int32GetDatum(stats->stawidth);
 		values[Anum_pg_statistic_stadistinct - 1] = Float4GetDatum(stats->stadistinct);
 		i = Anum_pg_statistic_stakind1 - 1;
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index db3a68a..debb116 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -23,6 +23,7 @@
 #include "catalog/pg_class.h"
 #include "catalog/pg_operator.h"
 #include "catalog/pg_proc.h"
+#include "catalog/pg_statistic.h"
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
@@ -47,6 +48,7 @@
 #include "partitioning/partbounds.h"
 #include "partitioning/partprune.h"
 #include "rewrite/rewriteManip.h"
+#include "utils/syscache.h"
 #include "utils/lsyscache.h"
 
 
@@ -80,6 +82,9 @@ static void set_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 							 Index rti, RangeTblEntry *rte);
 static void set_plain_rel_size(PlannerInfo *root, RelOptInfo *rel,
 							   RangeTblEntry *rte);
+static void set_plain_rel_page_estimates(PlannerInfo *root,
+										 RelOptInfo *rel,
+										 RangeTblEntry *rte);
 static void create_plain_partial_paths(PlannerInfo *root, RelOptInfo *rel);
 static void set_rel_consider_parallel(PlannerInfo *root, RelOptInfo *rel,
 									  RangeTblEntry *rte);
@@ -581,6 +586,46 @@ set_plain_rel_size(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 
 	/* Mark rel with estimated output rows, width, etc */
 	set_baserel_size_estimates(root, rel);
+
+	/* Estimate the pages based on the selected columns */
+	set_plain_rel_page_estimates(root, rel, rte);
+}
+
+static void
+set_plain_rel_page_estimates(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
+{
+	double		pages;
+	HeapTuple	tp;
+	AttrNumber	attno;
+	Selectivity sel = 0;
+
+	if (!rte->scanCols)
+		return;
+
+	attno = -1;
+	while ((attno = bms_next_member(rte->scanCols, attno)) >= 0)
+	{
+		tp = SearchSysCache3(STATRELATTINH,
+							 ObjectIdGetDatum(rte->relid),
+							 Int16GetDatum(attno),
+							 BoolGetDatum(rte->inh));
+
+		if (HeapTupleIsValid(tp))
+		{
+			sel += ((Form_pg_statistic) GETSTRUCT(tp))->stadiskfrac;
+			ReleaseSysCache(tp);
+		}
+	}
+
+	if (sel > 0)
+	{
+		pages = rel->pages * sel;
+
+		if (pages <= 1.0)
+			rel->pages = 1;
+		else
+			rel->pages = rint(pages);
+	}
 }
 
 /*
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 304d136..12d2494 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	201912061
+#define CATALOG_VERSION_NO	202002141
 
 #endif
diff --git a/src/include/catalog/pg_statistic.h b/src/include/catalog/pg_statistic.h
index 207be54..66029f6 100644
--- a/src/include/catalog/pg_statistic.h
+++ b/src/include/catalog/pg_statistic.h
@@ -36,6 +36,9 @@ CATALOG(pg_statistic,2619,StatisticRelationId)
 	/* the fraction of the column's entries that are NULL: */
 	float4		stanullfrac;
 
+	/* the fraction of the column's disksize of all columns */
+	float4		stadiskfrac;
+
 	/*
 	 * stawidth is the average width in bytes of non-null entries.  For
 	 * fixed-width datatypes this is of course the same as the typlen, but for
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 128f7ae..077a3c1 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -114,6 +114,12 @@ typedef struct VacAttrStats
 	Datum	   *stavalues[STATISTIC_NUM_SLOTS];
 
 	/*
+	 * These fields are to be filled in compute_disk_stats
+	 */
+	float4		stadiskfrac;	/* fraction of the physical size */
+	float8		disksize;		/* value of the physical size */
+
+	/*
 	 * These fields describe the stavalues[n] element types. They will be
 	 * initialized to match attrtypid, but a custom typanalyze function might
 	 * want to store an array of something other than the analyzed column's
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index ff626cb..ddb0b7d 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -1100,6 +1100,7 @@ typedef struct RangeTblEntry
 	Bitmapset  *updatedCols;	/* columns needing UPDATE permission */
 	Bitmapset  *extraUpdatedCols;	/* generated columns being updated */
 	List	   *securityQuals;	/* security barrier quals to apply, if any */
+	Bitmapset *scanCols;
 } RangeTblEntry;
 
 /*
-- 
1.8.3.1

0003-ZedStore-use-extended-ANAlYZE-API.patchapplication/x-patch; name=0003-ZedStore-use-extended-ANAlYZE-API.patchDownload

From 471b1ba4bb704aac3d6128263ac2dbab103c13e8 Mon Sep 17 00:00:00 2001
From: Pengzhou Tang <ptang@pivotal.io>
Date: Wed, 20 Nov 2019 06:59:22 -0500
Subject: [PATCH 3/3] ZedStore use extended ANAlYZE API

1) use the logical block ID in ANALYZE
2) provide disksize info per column when ANALYZE, so
   planner can estimate the pages need to scan based
   on columns selected.
3) can only analyze the columns specified
---
 src/backend/access/zedstore/zedstore_attstream.c |   7 +-
 src/backend/access/zedstore/zedstoream_handler.c | 118 ++++++++++++++++++++---
 src/include/access/zedstore_internal.h           |   4 +
 3 files changed, 115 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/zedstore/zedstore_attstream.c b/src/backend/access/zedstore/zedstore_attstream.c
index b659c95..7a1a1a9 100644
--- a/src/backend/access/zedstore/zedstore_attstream.c
+++ b/src/backend/access/zedstore/zedstore_attstream.c
@@ -167,6 +167,7 @@ decode_attstream_begin(attstream_decoder *decoder, ZSAttStream *attstream)
 					  attstream->t_size - SizeOfZSAttStreamHeader,
 					  attstream->t_decompressed_bufsize);
 		decoder->chunks_len = attstream->t_decompressed_size;
+		decoder->compression_ratio = ((float8) buf_size_needed) / attstream->t_size;
 	}
 	else
 	{
@@ -174,6 +175,7 @@ decode_attstream_begin(attstream_decoder *decoder, ZSAttStream *attstream)
 			   ((char *) attstream) + SizeOfZSAttStreamHeader,
 			   attstream->t_size - SizeOfZSAttStreamHeader);
 		decoder->chunks_len = attstream->t_size - SizeOfZSAttStreamHeader;
+		decoder->compression_ratio = 1.0;
 	}
 	decoder->firsttid = get_chunk_first_tid(decoder->attlen, decoder->chunks_buf);
 	decoder->lasttid = attstream->t_lasttid;
@@ -182,6 +184,7 @@ decode_attstream_begin(attstream_decoder *decoder, ZSAttStream *attstream)
 	decoder->prevtid = 0;
 
 	decoder->num_elements = 0;
+	decoder->avg_elements_size = 0;
 }
 
 /*
@@ -227,6 +230,7 @@ decode_attstream_cont(attstream_decoder *decoder)
 	zstid		lasttid;
 	int			total_decoded;
 	char	   *p;
+	char	   *lastp;
 	char	   *pend;
 	MemoryContext oldcxt;
 
@@ -237,7 +241,7 @@ decode_attstream_cont(attstream_decoder *decoder)
 		MemoryContextSwitchTo(decoder->tmpcxt);
 	}
 
-	p = decoder->chunks_buf + decoder->pos;
+	lastp = p = decoder->chunks_buf + decoder->pos;
 	pend = decoder->chunks_buf + decoder->chunks_len;
 
 	total_decoded = 0;
@@ -262,6 +266,7 @@ decode_attstream_cont(attstream_decoder *decoder)
 
 	Assert(p <= pend);
 	decoder->num_elements = total_decoded;
+	decoder->avg_elements_size = ((p - lastp) / total_decoded) / decoder->compression_ratio;
 	decoder->pos = p - decoder->chunks_buf;
 	if (total_decoded > 0)
 	{
diff --git a/src/backend/access/zedstore/zedstoream_handler.c b/src/backend/access/zedstore/zedstoream_handler.c
index 0b59191..e844a31 100644
--- a/src/backend/access/zedstore/zedstoream_handler.c
+++ b/src/backend/access/zedstore/zedstoream_handler.c
@@ -35,6 +35,7 @@
 #include "miscadmin.h"
 #include "optimizer/plancat.h"
 #include "pgstat.h"
+#include "parser/parse_relation.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
 #include "storage/procarray.h"
@@ -2420,34 +2421,110 @@ zedstoream_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 	zsbt_tuplebuffer_flush(NewHeap);
 }
 
+static void 
+zedstoream_scan_analyze_beginscan(Relation onerel, AnalyzeSampleContext *context)
+{
+	zstid 	tid;
+	List	*va_cols = context->anl_cols;		
+	Bitmapset	*project_columns = NULL;	
+
+	/* zedstore can sample rows on specified columns only */
+	if (!va_cols)
+		context->scan = table_beginscan_analyze(onerel);
+	else
+	{
+		ListCell	*le;
+
+		foreach(le, va_cols)
+		{
+			char	   *col = strVal(lfirst(le));
+
+			project_columns =
+				bms_add_member(project_columns, attnameAttNum(onerel, col, false));
+		}
+
+		context->scan = 
+			zedstoream_beginscan_with_column_projection(onerel, NULL, 0, NULL,
+														NULL, SO_TYPE_ANALYZE,
+														project_columns);
+	}
+
+	/* zedstore use a logical block number to acquire sample rows */
+	tid = zsbt_get_last_tid(onerel);
+	context->totalblocks = ZSTidGetBlockNumber(tid) + 1;
+}
+
 /*
- * FIXME: The ANALYZE API is problematic for us. acquire_sample_rows() calls
- * RelationGetNumberOfBlocks() directly on the relation, and chooses the
- * block numbers to sample based on that. But the logical block numbers
- * have little to do with physical ones in zedstore.
+ * Get next logical block.
  */
 static bool
-zedstoream_scan_analyze_next_block(TableScanDesc sscan, BlockNumber blockno,
-								   BufferAccessStrategy bstrategy)
+zedstoream_scan_analyze_next_block(BlockNumber blockno,
+								   AnalyzeSampleContext *context)
 {
-	return zs_blkscan_next_block(sscan, blockno, NULL, -1, false);
+	return zs_blkscan_next_block(context->scan, blockno, NULL, -1, false);
 }
 
 static bool
-zedstoream_scan_analyze_next_tuple(TableScanDesc sscan, TransactionId OldestXmin,
-								   double *liverows, double *deadrows,
-								   TupleTableSlot *slot)
+zedstoream_scan_analyze_next_tuple(TransactionId OldestXmin, AnalyzeSampleContext *context)
 {
-	bool		result;
+	int		i;
+	bool	result;
+	AttrNumber		attno;
+	TableScanDesc	scan = context->scan;
+	ZedStoreDesc	sscan = (ZedStoreDesc) scan;
+	ZSAttrTreeScan	*attr_scan;
+	TupleTableSlot	*slot = AnalyzeGetSampleSlot(context, scan->rs_rd, ANALYZE_SAMPLE_DATA);
 
-	result = zs_blkscan_next_tuple(sscan, slot);
+	result = zs_blkscan_next_tuple(scan, slot);
 
 	if (result)
-		(*liverows)++;
+	{
+		/* provide extra disk info when analyzing on full columns */
+		if (!context->anl_cols)
+		{
+			slot = AnalyzeGetSampleSlot(context, scan->rs_rd, ANALYZE_SAMPLE_DISKSIZE);
+
+			for (i = 1; i < sscan->proj_data.num_proj_atts; i++)
+			{
+				attr_scan = &sscan->proj_data.attr_scans[i - 1];	
+				attno = sscan->proj_data.proj_atts[i];
+
+				slot->tts_values[attno - 1] =
+					Float8GetDatum(attr_scan->decoder.avg_elements_size); 
+				slot->tts_isnull[attno - 1] = false;
+				slot->tts_flags &= ~TTS_FLAG_EMPTY;
+			}
+		}
+
+		context->liverows++;
+	}
 
 	return result;
 }
 
+static void
+zedstoream_scan_analyze_sample_tuple(int pos, bool replace, AnalyzeSampleContext *context)
+{
+	TupleTableSlot *slot;
+	Relation onerel = context->scan->rs_rd;
+
+	slot = AnalyzeGetSampleSlot(context, onerel, ANALYZE_SAMPLE_DATA);
+	AnalyzeRecordSampleRow(context, slot, NULL, ANALYZE_SAMPLE_DATA, pos, replace, false);
+
+	/* only record */
+	if (!context->anl_cols)
+	{
+		slot = AnalyzeGetSampleSlot(context, onerel, ANALYZE_SAMPLE_DISKSIZE);
+		AnalyzeRecordSampleRow(context, slot, NULL, ANALYZE_SAMPLE_DISKSIZE, pos, replace, false);
+	}
+}
+
+static void
+zedstoream_scan_analyze_endscan(AnalyzeSampleContext *context)
+{
+	table_endscan(context->scan);
+}
+
 /* ------------------------------------------------------------------------
  * Miscellaneous callbacks for the heap AM
  * ------------------------------------------------------------------------
@@ -2713,6 +2790,18 @@ zs_blkscan_next_tuple(TableScanDesc sscan, TupleTableSlot *slot)
 
 	if (scan->bmscan_nexttuple >= scan->bmscan_ntuples)
 		return false;
+
+	/*
+	 * Initialize the slot.
+	 *
+	 * We initialize all columns to NULL. The values for columns that are projected
+	 * will be set to the actual values below, but it's important that non-projected
+	 * columns are NULL.
+	 */
+	ExecClearTuple(slot);
+	for (int i = 0; i < sscan->rs_rd->rd_att->natts; i++)
+		slot->tts_isnull[i] = true;
+
 	/*
 	 * projection attributes were created based on Relation tuple descriptor
 	 * it better match TupleTableSlot.
@@ -2935,8 +3024,11 @@ static const TableAmRoutine zedstoream_methods = {
 	.relation_copy_data = zedstoream_relation_copy_data,
 	.relation_copy_for_cluster = zedstoream_relation_copy_for_cluster,
 	.relation_vacuum = zedstoream_vacuum_rel,
+	.scan_analyze_beginscan = zedstoream_scan_analyze_beginscan,
 	.scan_analyze_next_block = zedstoream_scan_analyze_next_block,
 	.scan_analyze_next_tuple = zedstoream_scan_analyze_next_tuple,
+	.scan_analyze_sample_tuple = zedstoream_scan_analyze_sample_tuple,
+	.scan_analyze_endscan = zedstoream_scan_analyze_endscan,
 
 	.index_build_range_scan = zedstoream_index_build_range_scan,
 	.index_validate_scan = zedstoream_index_validate_scan,
diff --git a/src/include/access/zedstore_internal.h b/src/include/access/zedstore_internal.h
index 21ea504..58227bd 100644
--- a/src/include/access/zedstore_internal.h
+++ b/src/include/access/zedstore_internal.h
@@ -78,6 +78,9 @@ typedef struct
 	char	   *chunks_buf;
 	int			chunks_buf_size;
 
+	/* attstream compression ratio */
+	float8		compression_ratio;
+
 	/* information about the current attstream in the buffer */
 	int			chunks_len;
 	zstid		firsttid;
@@ -96,6 +99,7 @@ typedef struct
 	Datum		datums[DECODER_MAX_ELEMS];
 	bool		isnulls[DECODER_MAX_ELEMS];
 	int			num_elements;
+	float8		avg_elements_size; /* avg physical size of elements */
 } attstream_decoder;
 
 /*
-- 
1.8.3.1

#15

Melanie Plageman

melanieplageman@gmail.com

almost 6 years ago

In reply to: Melanie Plageman (#12)

Re: Extracting only the columns needed for a query

On Fri, Jan 31, 2020 at 9:52 AM Melanie Plageman <melanieplageman@gmail.com>
wrote:

I'm bumping this to the next commitfest because I haven't had a chance
to address the questions posed by Dmitry. I'm sure I'll get to it in
the next few weeks.

I believe it would be beneficial to add this potential API extension

patch into

the thread (as an example of an interface defining how scanCols could be

used)

and review them together.

As for including some code that uses the scanCols, after discussing
off-list with a few folks, there are three options I would like to
pursue for doing this.

One option I will pursue is using the scanCols to inform the columns
needed to be spilled for memory-bounded hashagg (mentioned by Jeff
here [1]).

The third is exercising it with a test only but providing an example
of how a table AM API user like Zedstore uses the columns during
planning.

Outside of the use case that Pengzhou has provided in [1]/messages/by-id/CAG4reAQc9vYdmQXh=1D789x8XJ=gEkV+E+fT9+s9tOWDXX3L9Q@mail.gmail.com, we started
looking into using scanCols for extracting the subset of columns
needed in two cases:

1) columns required to be spilled for memory-bounded hashagg
2) referenced CTE columns which must be materialized into tuplestore

However, implementing these optimization with the scanCols patch
wouldn't work with its current implementation.

The scanCols are extracted from PlannerInfo->simple_rel_array and
PlannerInfo->simple_rte_array, at which point, we have no way of
knowing if the column was aggregated or if it was part of a CTE or
anything else about how it is used in the query.

We could solve this by creating multiple bitmaps at the time that we
create the scanCols field -- one for aggregated columns, one for
unaggregated columns, one for CTEs. However, that seems like it would
add a lot of extra complexity to the common code path during planning.

Basically, scanCols are simply columns that need to be scanned. It is
probably okay if it is only used by table access method API users, as
Pengzhou's patch illustrates.

Given that we have addressed the feedback about showing a use case,
this patch is probably ready for a once over from Dmitry again. (It is
registered for the March fest).

[1]: /messages/by-id/CAG4reAQc9vYdmQXh=1D789x8XJ=gEkV+E+fT9+s9tOWDXX3L9Q@mail.gmail.com
/messages/by-id/CAG4reAQc9vYdmQXh=1D789x8XJ=gEkV+E+fT9+s9tOWDXX3L9Q@mail.gmail.com

--
Melanie Plageman

#16

Dmitry Dolgov

9erthalion6@gmail.com

almost 6 years ago

In reply to: Melanie Plageman (#15)

Re: Extracting only the columns needed for a query

On Tue, Feb 18, 2020 at 03:26:16PM -0800, Melanie Plageman wrote:

I believe it would be beneficial to add this potential API extension

patch into

the thread (as an example of an interface defining how scanCols could be

used)

and review them together.

As for including some code that uses the scanCols, after discussing
off-list with a few folks, there are three options I would like to
pursue for doing this.

One option I will pursue is using the scanCols to inform the columns
needed to be spilled for memory-bounded hashagg (mentioned by Jeff
here [1]).

The third is exercising it with a test only but providing an example
of how a table AM API user like Zedstore uses the columns during
planning.

Basically, scanCols are simply columns that need to be scanned. It is
probably okay if it is only used by table access method API users, as
Pengzhou's patch illustrates.

Thanks for update. Sure, that would be fine. At the moment I have couple
of intermediate commentaries.

In general implemented functionality looks good. I've checked how it
works on the existing tests, almost everywhere required columns were not
missing in scanCols (which is probably the most important part).
Sometimes exressions were checked multiple times, which could
potentially introduce some overhead, but I believe this effect is
negligible. Just to mention some counterintuitive examples, for this
kind of query

SELECT min(q1) FROM INT8_TBL;

the same column was checked 5 times in my tests, since it's present also
in baserestrictinfo, and build_minmax_path does one extra round of
planning and invoking make_one_rel. I've also noticed that for
partitioned tables every partition is evaluated separately. IIRC they
structure cannot differ, does it makes sense then? Another interesting
example is Values Scan (e.g. in an insert statements with multiple
records), can an abstract table AM user leverage information about
columns in it?

One case, where I believe columns were missing, is statements with
returning:

INSERT INTO foo (col1)
VALUES ('col1'), ('col2'), ('col3')
RETURNING *;

Looks like in this situation there is only expression in reltarget is
for col1, but returning result contains all columns.

And just out of curiosity, what do you think about table AM specific
columns e.g. ctid, xmin/xmax etc? I mean, they're not included into
scanCols and should not be since they're heap AM related. But is there a
chance that there would be some AM specific columns relevant to e.g.
the columnar storage that would also make sense to put into scanCols?

#17

Melanie Plageman

melanieplageman@gmail.com

over 5 years ago

In reply to: Dmitry Dolgov (#16)

Re: Extracting only the columns needed for a query

On Fri, Mar 13, 2020 at 12:09 PM Dmitry Dolgov <9erthalion6@gmail.com>
wrote:

In general implemented functionality looks good. I've checked how it
works on the existing tests, almost everywhere required columns were not
missing in scanCols (which is probably the most important part).
Sometimes exressions were checked multiple times, which could
potentially introduce some overhead, but I believe this effect is
negligible. Just to mention some counterintuitive examples, for this
kind of query

SELECT min(q1) FROM INT8_TBL;

the same column was checked 5 times in my tests, since it's present also
in baserestrictinfo, and build_minmax_path does one extra round of
planning and invoking make_one_rel.

Thanks so much for the very thorough review, Dmitry.

These extra calls to extract_scan_cols() should be okay in this case.
As you mentioned, for min/max queries, planner attempts an optimization
with an indexscan and, to do this, it modifies the querytree and then
calls query_planner() on it.
It tries it with NULLs first and then NULLs last -- each of which
invokes query_planner(), so that is two out of three calls. The
third is the normal invocation. I'm not sure how you would get five,
though.

Another interesting
example is Values Scan (e.g. in an insert statements with multiple
records), can an abstract table AM user leverage information about
columns in it?

One case, where I believe columns were missing, is statements with
returning:

INSERT INTO foo (col1)
VALUES ('col1'), ('col2'), ('col3')
RETURNING *;

Looks like in this situation there is only expression in reltarget is
for col1, but returning result contains all columns.

This relates to both of your above points:
For this RETURNING query, it is a ValuesScan, so no columns have to be
scanned. We actually do add the reference to col1 to the scanCols
bitmap, though. I'm not sure we want to do so, since we don't want to
scan col1 in this case. I wonder what cases we would miss if we special
cased RTEKind RTE_VALUES when doing extract_scan_cols().

--
Melanie

#18

Melanie Plageman

melanieplageman@gmail.com

over 5 years ago

In reply to: Dmitry Dolgov (#16)

Re: Extracting only the columns needed for a query

On Fri, Mar 13, 2020 at 12:09 PM Dmitry Dolgov <9erthalion6@gmail.com>
wrote:

I've also noticed that for partitioned tables every partition is
evaluated separately. IIRC they structure cannot differ, does it
makes sense then?

Good point. At some point, we had discussed only collecting the columns
for one of the child partitions and then using that for all partitions.

It is definitely worthwhile to try that optimization.

Another interesting

example is Values Scan (e.g. in an insert statements with multiple
records), can an abstract table AM user leverage information about
columns in it?

For ValuesScan, I can't see a use case yet for including those in
scanCols, since it is not required to scan the existing columns in the
table, but I am open to a use case.

One case, where I believe columns were missing, is statements with
returning:

INSERT INTO foo (col1)
VALUES ('col1'), ('col2'), ('col3')
RETURNING *;

Looks like in this situation there is only expression in reltarget is
for col1, but returning result contains all columns.

So, you are correct, for INSERT and DELETE queries with RETURNING, the
scanCols should only include columns needed to INSERT or DELETE data.

For DELETE RETURNING, the RETURNING expression is executed separately in
ExecDelete, so scanCols should basically reflect only what needs to be
scanned to evaluate the qual.

For INSERT RETURNING, the scanCols should only reflect those columns
needing to be scanned to perform the INSERT.

There are several reasons why different kinds of RETURNING queries have
too many scanCols. Soumyadeep and I did some investigation into this.

Given:
t1(a, b)
t2(a, b, c)

For:
INSERT INTO t1(a) VALUES (1) RETURNING *;

There is no need to scan t1(a) in order to satisfy the RETURNING
expression here. Because this INSERT doesn't go through make_one_rel(),
scanCols shouldn't be populated.

For:
INSERT INTO t2 SELECT a,a,a FROM t1 RETURNING *;

For this INSERT, the patch correctly identifies that no columns from t2
need be scanned and only t1(a) needs be scanned.

For:
DELETE FROM t1 WHERE a = 2 RETURNING *;

scanCols correctly has only t1(a) which is needed to evaluate the qual.

For:
DELETE FROM t1 USING t2 where a = a RETURNING *;

scanCols should have t1(a) and t2(a), however, it has t1(a) and t2(a,b,c).

This is because in preprocess_targetlist(), all of the returningList
vars from t2 are added to the query tree processed_tlist to make sure
the RETURNING expression can be evaluated later.

However, the query tree processed_tlist items for each rel seem to be
added to the RelOptInfo for t2 later, so, in extract_scan_columns(), we
mistakenly add these to the scanCols.

This is tricky to change because we don't want to change what gets added
to the RelOptInfo->reltarget->exprs (I think), so, we'd need some way to
know which columns are from the RETURNING expression, which are from the
qual, etc. And, we'd need to know the query was a DELETE (since an
UPDATE would need all the columns anyway with the current API, for
example). This is pretty different than the current logic in
extract_scan_cols().

A separate issue with DELETE USING RETURNING queries scanCols arises
with partition tables.

Given this additional table:
t(a, b, c) partition table with partitions
tp1(a, b, c) and
tp2(a, b, c)

the same query above with different relations
DELETE FROM t USING t1 WHERE a = a RETURNING *;

scanCols will say it requires t(a,b,c) and t1(a,b) (all of the columns
in both relations).
t1 columns are wrong for the same reason as in the non-partition example
query described above.

However, the partition table scanCols are wrong for a related but
different reason. This, however, is much more easily fixed. The same
code in to add returningList to processed_tlist for a querytree in
preprocess_targetlist() doesn't handle the case of an inherited table or
partition child (since their result_relation is set to imitate a SELECT
instead of an UPDATE/DELETE/INSERT). I started a different thread on
this topic [1]/messages/by-id/CAAKRu_Y+7qX4JzvfovsBE9T9_2c4kK1Bdda139oQ6cA5B-LTZA@mail.gmail.com.

Lastly is UPDATE...RETURNING queries.
UPDATE queries will require scan of all of the columns with the current
tuple_table_lock() API (mentioned upthread).

And just out of curiosity, what do you think about table AM specific
columns e.g. ctid, xmin/xmax etc? I mean, they're not included into
scanCols and should not be since they're heap AM related. But is there a
chance that there would be some AM specific columns relevant to e.g.
the columnar storage that would also make sense to put into scanCols?

Maybe tid? But, I'm not sure. What do you think?

[1]: /messages/by-id/CAAKRu_Y+7qX4JzvfovsBE9T9_2c4kK1Bdda139oQ6cA5B-LTZA@mail.gmail.com
/messages/by-id/CAAKRu_Y+7qX4JzvfovsBE9T9_2c4kK1Bdda139oQ6cA5B-LTZA@mail.gmail.com

#19

Soumyadeep Chakraborty

soumyadeep2007@gmail.com

over 5 years ago

In reply to: Melanie Plageman (#18)

Re: Extracting only the columns needed for a query

Hello,

Melanie and me will be posting a separate thread with the scanCols patch
listed here, a patch to capture the set of cols in RETURNING and a group
of patches to pass down the list of cols to various table AM functions
together as a patch set. This will take some time. Thus, we are
deregistering this patch for the July commitfest and will come back.

Regards,
Soumyadeep (VMware)