[Proposal] Table partition + join pushdown

Started by Taiki Kondoover 10 years ago20 messages
#1Taiki Kondo
tai-kondo@yk.jp.nec.com
4 attachment(s)

Hi all,

I saw the email about the idea from KaiGai-san[1]/messages/by-id/9A28C8860F777E439AA12E8AEA7694F8010F672B@BPXM15GP.gisp.nec.co.jp,
and I worked to implement this idea.

Now, I have implemented a part of this idea,
so I want to propose this feature.

Patch attached just shows my concept of this feature.
It works fine for EXPLAIN, but it returns wrong result for other operations, sadly.

Table partition + join pushdown
===============================

Motivation
----------
To make join logic working more effectively,
it is important to make the size of relations smaller.

Especially in Hash-join, it is meaningful to make the inner relation smaller,
because smaller inner relation can be stored within smaller hash table.
This means that memory usage can be reduced when joining with big tables.

Design
------
It was mentioned by the email from KaiGai-san.
So I quote below here...

---- begin quotation ---
Let's assume a table which is partitioned to four portions,
and individual child relations have constraint by hash-value
of its ID field.

  tbl_parent
   + tbl_child_0 ... CHECK(hash_func(id) % 4 = 0)
   + tbl_child_1 ... CHECK(hash_func(id) % 4 = 1)
   + tbl_child_2 ... CHECK(hash_func(id) % 4 = 2)
   + tbl_child_3 ... CHECK(hash_func(id) % 4 = 3)

If someone tried to join another relation with tbl_parent
using equivalence condition, like X = tbl_parent.ID, we
know inner tuples that does not satisfies the condition
hash_func(X) % 4 = 0
shall be never joined to the tuples in tbl_child_0.
So, we can omit to load these tuples to inner hash table
preliminary, then it potentially allows to split the
inner hash-table.

Current typical plan structure is below:

HashJoin
-> Append
-> SeqScan on tbl_child_0
-> SeqScan on tbl_child_1
-> SeqScan on tbl_child_2
-> SeqScan on tbl_child_3
-> Hash
-> SeqScan on other_table

It may be rewritable to:

Append
-> HashJoin
-> SeqScan on tbl_child_0
-> Hash ... Filter: hash_func(X) % 4 = 0
-> SeqScan on other_table
-> HashJoin
-> SeqScan on tbl_child_1
-> Hash ... Filter: hash_func(X) % 4 = 1
-> SeqScan on other_table
-> HashJoin
-> SeqScan on tbl_child_2
-> Hash ... Filter: hash_func(X) % 4 = 2
-> SeqScan on other_table
-> HashJoin
-> SeqScan on tbl_child_3
-> Hash ... Filter: hash_func(X) % 4 = 3
-> SeqScan on other_table
---- end quotation ---

In the quotation above, it was written that filter is set at Hash node.
But I implemented that filter is set at SeqScan node under Hash node.
In my opinion, filtering tuples is work of Scanner.

Append
-> HashJoin
-> SeqScan on tbl_child_0
-> Hash
-> SeqScan on other_table ... Filter: hash_func(X) % 4 = 0
-> HashJoin
-> SeqScan on tbl_child_1
-> Hash
-> SeqScan on other_table ... Filter: hash_func(X) % 4 = 1
-> HashJoin
-> SeqScan on tbl_child_2
-> Hash
-> SeqScan on other_table ... Filter: hash_func(X) % 4 = 2
-> HashJoin
-> SeqScan on tbl_child_3
-> Hash
-> SeqScan on other_table ... Filter: hash_func(X) % 4 = 3

API
---
There are 3 new internal (static) functions to implement this feature.
try_hashjoin_pushdown(), which is main function of this feature,
is called from try_hashjoin_path(), and tries to push down HashPath
under the AppendPath.

To do so, this function does following operations.

1. Check if this Hash-join can be pushed down under AppendPath
2. To avoid an influence on other Path making operation,
copy inner path's RelOptInfo and make new SeqScan path from it.
At here, get CHECK() constraints from OUTER path, and convert its
Var node according to join condition. And also convert Var nodes
in join condition itself.
3. Create new HashPath nodes between each sub-paths of AppendPath and
inner path made above.
4. When operations from 1 to 3 are done for each sub-paths,
create new AppendPath which sub-paths are HashPath nodes made above.

get_replaced_clause_constr() is called from try_hashjoin_pushdown(),
and get_var_nodes_recurse() is called from get_replaced_cluase_constr().
These 2 functions help above operations.
(I may revise this part to use expression_tree_walker() and
expression_tree_mutator().)

In patch attached, it has the following limitations.
o It only works for hash-join operation.
(I want to support not only hash-join but also other logic.)
o Join conditions must be "=" operator with int4 variables.
o Inner path must be SeqScan.
(I want to support other path-node.)
o And now, planner may not choose this plan,
because estimated costs are usually larger than original (non-pushdown) plan.

And also 1 internal (static) function, get_relation_constraints() defined in
plancat.c, is changed to global. This function will be called from
get_replaced_clause_constr() to get CHECK() constraints.

Usage
-----
To use this feature, create partition tables and small table to join,
and run select operation with joining these tables.

For your convenience, I attach DDL and DML script.
And I also attach the result of EXPLAIN.

Any comments are welcome. But, at first, I need your advices
to correct this patch's behavior.

At least, I think it has to expand array of RangeTblEntry and other arrays defined
in PlannerInfo to register new RelOptInfos for new Path nodes mentioned above.
Or, is it better choice to modify query parser to implement this feature more further?

Remarks :
[1]: /messages/by-id/9A28C8860F777E439AA12E8AEA7694F8010F672B@BPXM15GP.gisp.nec.co.jp

Best regards,
--
Taiki Kondo

NEC Solution Innovators, Ltd.

Attachments:

EXPLAIN (Original).txttext/plain; name="EXPLAIN (Original).txt"Download
EXPLAIN (Pushdown).txttext/plain; name="EXPLAIN (Pushdown).txt"Download
hashjoin_pushdown.v0.patchapplication/octet-stream; name=hashjoin_pushdown.v0.patchDownload
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index a35c881..a3ef94c 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -21,6 +21,10 @@
 #include "optimizer/cost.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
+#include "nodes/relation.h"
+#include "optimizer/clauses.h"
+#include "optimizer/plancat.h"
+#include "optimizer/restrictinfo.h"
 
 /* Hook for plugins to get control in add_paths_to_joinrel() */
 set_join_pathlist_hook_type set_join_pathlist_hook = NULL;
@@ -37,6 +41,15 @@ static void match_unsorted_outer(PlannerInfo *root, RelOptInfo *joinrel,
 static void hash_inner_and_outer(PlannerInfo *root, RelOptInfo *joinrel,
 					 RelOptInfo *outerrel, RelOptInfo *innerrel,
 					 JoinType jointype, JoinPathExtraData *extra);
+
+static List *get_var_nodes_recurse(Expr *one_check_constr, List *var_nodes);
+static List *get_replaced_clause_constr(PlannerInfo *root, OpExpr *one_joinclause,
+				List *restrictinfo, RelOptInfo *relinfo);
+static void try_hashjoin_pushdown(PlannerInfo *root, RelOptInfo *joinrel,
+				  Path *outer_path, Path *inner_path,
+				  List *hashclauses, JoinType jointype,
+				  JoinPathExtraData *extra);
+
 static List *select_mergejoin_clauses(PlannerInfo *root,
 						 RelOptInfo *joinrel,
 						 RelOptInfo *outerrel,
@@ -512,6 +525,384 @@ try_mergejoin_path(PlannerInfo *root,
 	}
 }
 
+static List *
+get_var_nodes_recurse(Expr *one_check_constr,
+			List *var_nodes)
+{
+	if (IsA(one_check_constr, OpExpr))
+	{
+		OpExpr *op = (OpExpr *) one_check_constr;
+		ListCell *l;
+		foreach(l, op->args)
+		{
+			var_nodes = get_var_nodes_recurse((Expr *) lfirst(l), var_nodes);
+		}
+	}
+	else
+	if (IsA(one_check_constr, Var))
+	{
+		var_nodes = lappend(var_nodes, one_check_constr);
+	}
+
+	return var_nodes;
+}
+
+static List *
+get_replaced_clause_constr(PlannerInfo *root,
+				OpExpr *one_joinclause,
+				List *restrictinfo,
+				RelOptInfo *relinfo)
+{
+	List *result = restrictinfo;
+	RangeTblEntry *childRTE = root->simple_rte_array[relinfo->relid];
+	List *check_constr =
+		get_relation_constraints(root, childRTE->relid, relinfo, false);
+	ListCell *l_vn, *l_constr;
+	Var *var_args[2];
+	int var_index;
+
+	Assert(one_joinclause->opno == 96);
+	Assert(list_length(one_joinclause->args) == 2);
+
+	for (var_index = 0; var_index < 2; var_index++)
+	{
+		var_args[var_index] = (Var *) list_nth(one_joinclause->args, var_index);
+	}
+	var_index = -1;
+
+	if (list_length(check_constr) == 0)
+	{
+		ListCell *l_vn;
+
+		foreach(l_vn, relinfo->reltargetlist)
+		{
+			Var *vn_temp = (Var *) lfirst(l_vn);
+			ListCell *l_app;
+
+			foreach (l_app, root->append_rel_list)
+			{
+				AppendRelInfo *appinfo = (AppendRelInfo *) lfirst(l_app);
+
+				if (appinfo->child_relid != vn_temp->varno)
+					continue;
+
+				if (var_args[0]->varno == appinfo->parent_relid)
+					var_index = 0;
+				else
+				if (var_args[1]->varno == appinfo->parent_relid)
+					var_index = 1;
+
+				if (var_index >= 0)
+					break;
+			}
+
+			Assert(var_index >= 0);
+
+			if (var_args[var_index]->varattno != vn_temp->varattno)
+				continue;
+
+			var_args[var_index]->varno = vn_temp->varno;
+			var_args[var_index]->varattno = vn_temp->varattno;
+			break;
+		}
+		return restrictinfo;
+	}
+
+	foreach (l_constr, check_constr)
+	{
+		List *var_nodes =
+			get_var_nodes_recurse((Expr *) lfirst(l_constr), NIL);
+
+		foreach (l_vn, var_nodes)
+		{
+			Var *vn_temp = (Var *) lfirst(l_vn);
+			ListCell *l_app;
+
+			foreach (l_app, root->append_rel_list)
+			{
+				AppendRelInfo *appinfo = (AppendRelInfo *) lfirst(l_app);
+
+				if (appinfo->child_relid != vn_temp->varno)
+					continue;
+
+				if (var_index < 0)
+				{
+					if (appinfo->parent_relid == var_args[0]->varno)
+						var_index = 1;
+					else
+					if (appinfo->parent_relid == var_args[1]->varno)
+						var_index = 0;
+				}
+
+				if (var_args[1 - var_index]->varno == appinfo->parent_relid)
+				{
+					var_args[1 - var_index]->varno = vn_temp->varno;
+					var_args[1 - var_index]->varattno = vn_temp->varattno;
+				}
+
+				vn_temp->varno = var_args[var_index]->varno;
+				vn_temp->varattno = var_args[var_index]->varattno;
+			}
+		}
+	}
+
+	result = list_concat(result, make_restrictinfos_from_actual_clauses(root, check_constr));
+
+	return result;
+}
+
+/*
+  Try to push down HashPath under AppendPath.
+*/
+static void
+try_hashjoin_pushdown(PlannerInfo *root,
+				  RelOptInfo *joinrel,
+				  Path *outer_path,
+				  Path *inner_path,
+				  List *hashclauses,
+				  JoinType jointype,
+				  JoinPathExtraData *extra)
+{
+	AppendPath	*append_path;
+	Path		*other_path;
+	ListCell	*l;
+
+	List		*new_subpaths = NIL;
+
+	Assert(outer_path != inner_path);
+
+	if (IS_OUTER_JOIN(jointype))
+	{
+		/* TODO : Not supported yet, but we must support this pattern... */
+		elog(DEBUG1, "This join type is not supported... : %d", jointype);
+		return;
+	}
+
+	if (!IsA(outer_path, AppendPath))
+	{
+		/* Outer path must be AppendPath */
+		elog(DEBUG1, "Outer path must be AppendPath.");
+		return;
+	}
+
+	/* Check join clauses */
+	foreach (l, hashclauses)
+	{
+		RestrictInfo *rinfo = lfirst(l);
+		OpExpr *opexpr;
+		ListCell *ll;
+
+		if (!is_opclause(rinfo->clause))
+		{
+			elog(DEBUG1, "This join clause is not supported... : %d", (int) rinfo->clause->type);
+			return;
+		}
+
+		opexpr = (OpExpr *)(rinfo->clause);
+		if (opexpr->opno != 96) /* "96" means "=" for int4 */
+		{
+			elog(DEBUG1, "This operator is not supported... : %d", (int) opexpr->opno);
+			return;
+		}
+
+		foreach (ll, opexpr->args)
+		{
+			if (!IsA(lfirst(ll), Var))
+			{
+				elog(DEBUG1, "This expression is not supported... : %d", (int) ((Expr *) lfirst(ll))->type);
+				return;
+			}
+		}
+	}
+
+	append_path = (AppendPath *) outer_path;
+	other_path = inner_path;
+
+	if (other_path->pathtype != T_SeqScan)
+	{
+		/* TODO : Not supported yet, but we must support this pattern... */
+		elog(DEBUG1, "This pathtype is not supported yet... : %d", (int) other_path->pathtype);
+		return;
+	}
+
+	foreach(l, append_path->subpaths)
+	{
+		Path *one_of_subpaths = (Path *) lfirst(l);
+		if (one_of_subpaths->pathtype != T_SeqScan)
+		{
+			elog(DEBUG1, "This pathtype is not supported yet... : %d", (int) one_of_subpaths->pathtype);
+			return;
+		}
+	}
+
+	foreach(l, append_path->subpaths)
+	{
+		Path		*one_of_subpaths = (Path *) lfirst(l);
+		Path		*rewritten_opath;
+		RelOptInfo	*rel_sp = one_of_subpaths->parent;
+		RelOptInfo	*rel_op = other_path->parent;
+		RelOptInfo	*rel_op_rw;
+
+		Relids		new_joinrelids;
+		Relids		new_required_outer;
+		RelOptInfo	*new_joinrel;
+		SpecialJoinInfo	sjinfo_data;
+		List		*new_restrictinfos;
+		List		*new_restrictlist;
+		List		*new_hashclauses = NIL;
+		ListCell	*l_clause;
+		JoinCostWorkspace	workspace;
+
+		List	*old_join_rel_list = root->join_rel_list;
+		List	**old_join_rel_level = root->join_rel_level;
+
+		/* Create new RelOptInfo for inner path's SeqScan */
+		rel_op_rw = makeNode(RelOptInfo);
+		rel_op_rw->reloptkind = rel_op->reloptkind;
+		rel_op_rw->relids = bms_copy(rel_op->relids);
+		/* rows and width will be changed. Do set_baserel_estimates() later. */
+		rel_op_rw->consider_startup = rel_op->consider_startup;
+		rel_op_rw->consider_param_startup = false;
+		rel_op_rw->reltargetlist = copyObject(rel_op->reltargetlist);
+		rel_op_rw->pathlist = NIL;
+		rel_op_rw->ppilist = NIL;
+		rel_op_rw->cheapest_startup_path = NULL;
+		rel_op_rw->cheapest_total_path = NULL;
+		rel_op_rw->cheapest_unique_path = NULL;
+		rel_op_rw->cheapest_parameterized_paths = NIL;
+		rel_op_rw->relid = rel_op->relid;
+		rel_op_rw->rtekind = rel_op->rtekind;
+		rel_op_rw->min_attr = rel_op->min_attr;
+		rel_op_rw->max_attr = rel_op->max_attr;
+		rel_op_rw->attr_needed = (Relids *)
+			palloc0((rel_op->max_attr - rel_op->min_attr + 1) * sizeof(Relids));
+		rel_op_rw->attr_widths = (int32 *)
+			palloc0((rel_op->max_attr - rel_op->min_attr + 1) * sizeof(int32));
+		{
+			int i;
+			for (i = 0; i < (rel_op->max_attr - rel_op->min_attr); i++)
+			{
+				rel_op_rw->attr_needed[i] = bms_copy(rel_op->attr_needed[i]);
+				rel_op_rw->attr_widths[i] = rel_op->attr_widths[i];
+			}
+		}
+		rel_op_rw->lateral_vars = copyObject(rel_op->lateral_vars);
+		rel_op_rw->lateral_relids = bms_copy(rel_op->lateral_relids);
+		rel_op_rw->lateral_referencers = bms_copy(rel_op->lateral_referencers);
+		rel_op_rw->indexlist = copyObject(rel_op->indexlist);
+		rel_op_rw->pages = rel_op->pages;
+		rel_op_rw->tuples = rel_op->tuples;
+		rel_op_rw->allvisfrac = rel_op->allvisfrac;
+		rel_op_rw->subplan = NULL;
+		rel_op_rw->subroot = NULL;
+		rel_op_rw->subplan_params = NIL;
+		rel_op_rw->serverid = rel_op->serverid;
+		rel_op_rw->fdwroutine = NULL;
+		rel_op_rw->fdw_private = NULL;
+		/* Base restrict List/Cost will be set later. */
+		rel_op_rw->joininfo = copyObject(rel_op->joininfo);
+		rel_op_rw->has_eclass_joins = rel_op->has_eclass_joins;
+
+		/* Estimate rows, width, and costs for inner path's SeqScan */
+		new_restrictinfos = copyObject(rel_op->baserestrictinfo);
+		foreach (l_clause, hashclauses)
+		{
+			OpExpr *new_clause = copyObject(((RestrictInfo *) lfirst(l_clause))->clause);
+
+			/* Create filters of inner path from CHECK() constraints of outer path */
+			new_restrictinfos =
+				get_replaced_clause_constr(root, new_clause,
+					new_restrictinfos, rel_sp);
+			new_hashclauses = lappend(new_hashclauses,
+				make_restrictinfo(new_clause, true, false, false, NULL, NULL, NULL));
+		}
+		rel_op_rw->baserestrictinfo = new_restrictinfos;
+		set_baserel_size_estimates(root, rel_op_rw);
+
+		sjinfo_data.type = T_SpecialJoinInfo;
+		sjinfo_data.min_lefthand = rel_sp->relids;
+		sjinfo_data.min_righthand = rel_op_rw->relids;
+		sjinfo_data.syn_lefthand = rel_sp->relids;
+		sjinfo_data.syn_righthand = rel_op_rw->relids;
+		sjinfo_data.jointype = JOIN_INNER;
+		/* we don't bother trying to make the remaining fields valid */
+		sjinfo_data.lhs_strict = false;
+		sjinfo_data.delay_upper_joins = false;
+		sjinfo_data.semi_can_btree = false;
+		sjinfo_data.semi_can_hash = false;
+		sjinfo_data.semi_operators = NIL;
+		sjinfo_data.semi_rhs_exprs = NIL;
+
+		/* Create NEW SeqScan path for inner path */
+		rewritten_opath = create_seqscan_path(root, rel_op_rw, rel_op_rw->lateral_relids);
+		add_path(rel_op_rw, rewritten_opath);
+
+		/* Create New HashPath between path under AppendPath and inner SeqScan path */
+		new_joinrelids = bms_union(rel_sp->relids, rel_op_rw->relids);
+
+		/*
+		 * For avoidance of failing assertion in allpaths.c.
+		 * We must keep root->join_rel_level[cur_level]->length == 1.
+		 */
+		root->join_rel_list = NIL;
+		root->join_rel_level = NULL;
+
+		new_joinrel = build_join_rel(root, new_joinrelids, rel_sp, rel_op_rw,
+							&sjinfo_data, &new_restrictlist);
+		list_free(root->join_rel_list);
+
+		root->join_rel_list = old_join_rel_list;
+		root->join_rel_level = old_join_rel_level;
+
+		new_required_outer = calc_non_nestloop_required_outer(one_of_subpaths, rewritten_opath);
+
+		initial_cost_hashjoin(root, &workspace, JOIN_INNER, new_hashclauses,
+					one_of_subpaths, rewritten_opath, &sjinfo_data, NULL);
+		if (add_path_precheck(joinrel,
+			workspace.startup_cost, workspace.total_cost,
+			NIL, new_required_outer))
+		{
+			HashPath *new_hj_path;
+
+			elog(DEBUG1, "add_path_precheck() returned TRUE.");
+			new_hj_path = create_hashjoin_path(root,
+							new_joinrel,
+							JOIN_INNER,
+							&workspace,
+							&sjinfo_data,
+							NULL,
+							one_of_subpaths,
+							rewritten_opath,
+							new_restrictlist,
+							new_required_outer,
+							new_hashclauses);
+
+			add_path(new_joinrel, (Path *)new_hj_path);
+			/* new_hj_path can be clobbered above. */
+			if (IsA((Node *)new_hj_path, HashPath))
+				new_subpaths = lappend(new_subpaths, new_hj_path);
+		}
+		else
+		{
+			bms_free(new_required_outer);
+		}
+	}
+
+	if (list_length(new_subpaths) > 0)
+	{
+		elog(DEBUG1, "Pushdown succeeded.");
+		add_path(joinrel, (Path *)
+			create_append_path(joinrel, new_subpaths, NULL));
+		set_cheapest(joinrel);
+	}
+	else
+	{
+		elog(DEBUG1, "Pushdown failed.");
+	}
+
+	return;
+}
+
 /*
  * try_hashjoin_path
  *	  Consider a hash join path; if it appears useful, push it into
@@ -543,6 +934,10 @@ try_hashjoin_path(PlannerInfo *root,
 		return;
 	}
 
+	/* Try to push HashJoin down under Append */
+	try_hashjoin_pushdown(root, joinrel,
+			outer_path, inner_path, hashclauses, jointype, extra);
+
 	/*
 	 * Independently of that, add parameterization needed for any
 	 * PlaceHolderVars that need to be computed at the join.
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 9442e5f..c137b09 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -54,9 +54,6 @@ get_relation_info_hook_type get_relation_info_hook = NULL;
 static bool infer_collation_opclass_match(InferenceElem *elem, Relation idxRel,
 							  List *idxExprs);
 static int32 get_rel_data_width(Relation rel, int32 *attr_widths);
-static List *get_relation_constraints(PlannerInfo *root,
-						 Oid relationObjectId, RelOptInfo *rel,
-						 bool include_notnull);
 static List *build_index_tlist(PlannerInfo *root, IndexOptInfo *index,
 				  Relation heapRelation);
 
@@ -1022,7 +1019,7 @@ get_relation_data_width(Oid relid, int32 *attr_widths)
  * run, and in many cases it won't be invoked at all, so there seems no
  * point in caching the data in RelOptInfo.
  */
-static List *
+List *
 get_relation_constraints(PlannerInfo *root,
 						 Oid relationObjectId, RelOptInfo *rel,
 						 bool include_notnull)
diff --git a/src/include/optimizer/plancat.h b/src/include/optimizer/plancat.h
index 11e7d4d..5246a6c 100644
--- a/src/include/optimizer/plancat.h
+++ b/src/include/optimizer/plancat.h
@@ -28,6 +28,10 @@ extern PGDLLIMPORT get_relation_info_hook_type get_relation_info_hook;
 extern void get_relation_info(PlannerInfo *root, Oid relationObjectId,
 				  bool inhparent, RelOptInfo *rel);
 
+extern List *get_relation_constraints(PlannerInfo *root,
+                                                 Oid relationObjectId, RelOptInfo *rel,
+                                                 bool include_notnull);
+
 extern List *infer_arbiter_indexes(PlannerInfo *root);
 
 extern void estimate_rel_size(Relation rel, int32 *attr_widths,
test_queries.sqlapplication/octet-stream; name=test_queries.sqlDownload
#2Kouhei Kaigai
kaigai@ak.jp.nec.com
In reply to: Taiki Kondo (#1)
Re: [Proposal] Table partition + join pushdown

Hello Kondo-san,

I briefly checked your patch. Let me put some comments about
its design and implementation, even though I have no arguments
towards its concept. :-)

* Construction of RelOptInfo

In your patch, try_hashjoin_pushdown() called by try_hashjoin_path()
constructs RelOptInfo of the join-rel between inner-rel and a subpath
of Append node. It is entirely wrong implementation.

I can understand we (may) have no RelOptInfo for the joinrel between
tbl_child_0 and other_table, when planner investigates a joinpath to
process join Append path with the other_table.

HashJoin
-> Append
-> SeqScan on tbl_child_0
-> SeqScan on tbl_child_1
-> SeqScan on tbl_child_2
-> SeqScan on tbl_child_3
-> Hash
-> SeqScan on other_table

How about these alternatives?

- calls make_join_rel() to the pair of tbl_child_X and other_table
by try_hashjoin_pushdown() or somewhere. make_join_rel() internally
construct a RelOptInfo for the supplied pair of relations, so
relevant RelOptInfo shall be properly constructed.
- make_join_rel() also calls add_paths_to_joinrel() towards all the
join logic, so it makes easier to support to push down other join
logic including nested-loop or custom-join.
- It may be an idea to add an extra argument to make_join_rel() to
inform expressions to be applied for tuple filtering on
construction of inner hash table.

* Why only SeqScan is supported

I think it is the role of Hash-node to filter out inner tuples
obviously unrelated to the join (if CHECK constraint of outer relation
gives information), because this join-pushdown may be able to support
multi-stacked pushdown.

For example, if planner considers a path to join this Append-path
with another relation, and join clause contains a reference to X?

Append
-> HashJoin
-> SeqScan on tbl_child_0
-> Hash ... Filter: hash_func(X) % 4 = 0
-> SeqScan on other_table
-> HashJoin
-> SeqScan on tbl_child_1
-> Hash ... Filter: hash_func(X) % 4 = 1
-> SeqScan on other_table
-> HashJoin
-> SeqScan on tbl_child_2
-> Hash ... Filter: hash_func(X) % 4 = 2
-> SeqScan on other_table
-> HashJoin
-> SeqScan on tbl_child_3
-> Hash ... Filter: hash_func(X) % 4 = 3
-> SeqScan on other_table

It may be a good challenge to consider additional join pushdown,
even if subpaths of Append are HashJoin, not SeqScan, like:

Append
-> HashJoin
-> HashJoin
-> SeqScan on tbl_child_0
-> Hash ... Filter: hash_func(X) % 4 = 0
-> SeqScan on other_table
-> Hash ... Filter: hash_func(X) % 4 = 0
-> SeqScan on another_table
-> HashJoin
-> HashJoin
-> SeqScan on tbl_child_1
-> Hash ... Filter: hash_func(X) % 4 = 1
-> SeqScan on other_table
-> Hash ... Filter: hash_func(X) % 4 = 1
-> SeqScan on another_table
-> HashJoin
-> HashJoin
-> SeqScan on tbl_child_2
-> Hash ... Filter: hash_func(X) % 4 = 2
-> SeqScan on other_table
-> Hash ... Filter: hash_func(X) % 4 = 2
-> SeqScan on another_table
-> HashJoin
-> HashJoin
-> SeqScan on tbl_child_3
-> Hash ... Filter: hash_func(X) % 4 = 3
-> SeqScan on other_table
-> Hash ... Filter: hash_func(X) % 4 = 3
-> SeqScan on another_table

In this case, underlying nodes are not always SeqScan. So, only
Hash-node can have filter clauses.

* Way to handle expression nodes

All this patch supported is CHECK() constraint with equal operation
on INT4 data type. You can learn various useful infrastructure of
PostgreSQL. For example, ...
- expression_tree_mutator() is useful to make a copy of expression
node with small modification
- pull_varnos() is useful to check which relations are referenced
by the expression node.
- RestrictInfo->can_join is useful to check whether the clause is
binary operator, or not.

Anyway, reuse of existing infrastructure is the best way to make
a reliable infrastructure and to keep the implementation simple.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Taiki Kondo
Sent: Thursday, August 13, 2015 6:30 PM
To: pgsql-hackers@postgresql.org
Cc: Kaigai Kouhei(海外 浩平); Iwaasa Akio(岩浅 晃郎)
Subject: [HACKERS] [Proposal] Table partition + join pushdown

Hi all,

I saw the email about the idea from KaiGai-san[1],
and I worked to implement this idea.

Now, I have implemented a part of this idea,
so I want to propose this feature.

Patch attached just shows my concept of this feature.
It works fine for EXPLAIN, but it returns wrong result for other operations, sadly.

Table partition + join pushdown
===============================

Motivation
----------
To make join logic working more effectively,
it is important to make the size of relations smaller.

Especially in Hash-join, it is meaningful to make the inner relation smaller,
because smaller inner relation can be stored within smaller hash table.
This means that memory usage can be reduced when joining with big tables.

Design
------
It was mentioned by the email from KaiGai-san.
So I quote below here...

---- begin quotation ---
Let's assume a table which is partitioned to four portions,
and individual child relations have constraint by hash-value
of its ID field.

tbl_parent
+ tbl_child_0 ... CHECK(hash_func(id) % 4 = 0)
+ tbl_child_1 ... CHECK(hash_func(id) % 4 = 1)
+ tbl_child_2 ... CHECK(hash_func(id) % 4 = 2)
+ tbl_child_3 ... CHECK(hash_func(id) % 4 = 3)

If someone tried to join another relation with tbl_parent
using equivalence condition, like X = tbl_parent.ID, we
know inner tuples that does not satisfies the condition
hash_func(X) % 4 = 0
shall be never joined to the tuples in tbl_child_0.
So, we can omit to load these tuples to inner hash table
preliminary, then it potentially allows to split the
inner hash-table.

Current typical plan structure is below:

HashJoin
-> Append
-> SeqScan on tbl_child_0
-> SeqScan on tbl_child_1
-> SeqScan on tbl_child_2
-> SeqScan on tbl_child_3
-> Hash
-> SeqScan on other_table

It may be rewritable to:

Append
-> HashJoin
-> SeqScan on tbl_child_0
-> Hash ... Filter: hash_func(X) % 4 = 0
-> SeqScan on other_table
-> HashJoin
-> SeqScan on tbl_child_1
-> Hash ... Filter: hash_func(X) % 4 = 1
-> SeqScan on other_table
-> HashJoin
-> SeqScan on tbl_child_2
-> Hash ... Filter: hash_func(X) % 4 = 2
-> SeqScan on other_table
-> HashJoin
-> SeqScan on tbl_child_3
-> Hash ... Filter: hash_func(X) % 4 = 3
-> SeqScan on other_table
---- end quotation ---

In the quotation above, it was written that filter is set at Hash node.
But I implemented that filter is set at SeqScan node under Hash node.
In my opinion, filtering tuples is work of Scanner.

Append
-> HashJoin
-> SeqScan on tbl_child_0
-> Hash
-> SeqScan on other_table ... Filter: hash_func(X) % 4 = 0
-> HashJoin
-> SeqScan on tbl_child_1
-> Hash
-> SeqScan on other_table ... Filter: hash_func(X) % 4 = 1
-> HashJoin
-> SeqScan on tbl_child_2
-> Hash
-> SeqScan on other_table ... Filter: hash_func(X) % 4 = 2
-> HashJoin
-> SeqScan on tbl_child_3
-> Hash
-> SeqScan on other_table ... Filter: hash_func(X) % 4 = 3

API
---
There are 3 new internal (static) functions to implement this feature.
try_hashjoin_pushdown(), which is main function of this feature,
is called from try_hashjoin_path(), and tries to push down HashPath
under the AppendPath.

To do so, this function does following operations.

1. Check if this Hash-join can be pushed down under AppendPath
2. To avoid an influence on other Path making operation,
copy inner path's RelOptInfo and make new SeqScan path from it.
At here, get CHECK() constraints from OUTER path, and convert its
Var node according to join condition. And also convert Var nodes
in join condition itself.
3. Create new HashPath nodes between each sub-paths of AppendPath and
inner path made above.
4. When operations from 1 to 3 are done for each sub-paths,
create new AppendPath which sub-paths are HashPath nodes made above.

get_replaced_clause_constr() is called from try_hashjoin_pushdown(),
and get_var_nodes_recurse() is called from get_replaced_cluase_constr().
These 2 functions help above operations.
(I may revise this part to use expression_tree_walker() and
expression_tree_mutator().)

In patch attached, it has the following limitations.
o It only works for hash-join operation.
(I want to support not only hash-join but also other logic.)
o Join conditions must be "=" operator with int4 variables.
o Inner path must be SeqScan.
(I want to support other path-node.)
o And now, planner may not choose this plan,
because estimated costs are usually larger than original (non-pushdown) plan.

And also 1 internal (static) function, get_relation_constraints() defined in
plancat.c, is changed to global. This function will be called from
get_replaced_clause_constr() to get CHECK() constraints.

Usage
-----
To use this feature, create partition tables and small table to join,
and run select operation with joining these tables.

For your convenience, I attach DDL and DML script.
And I also attach the result of EXPLAIN.

Any comments are welcome. But, at first, I need your advices
to correct this patch's behavior.

At least, I think it has to expand array of RangeTblEntry and other arrays defined
in PlannerInfo to register new RelOptInfos for new Path nodes mentioned above.
Or, is it better choice to modify query parser to implement this feature more
further?

Remarks :
[1]
/messages/by-id/9A28C8860F777E439AA12E8AEA7694F8010F672
B@BPXM15GP.gisp.nec.co.jp

Best regards,
--
Taiki Kondo

NEC Solution Innovators, Ltd.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Taiki Kondo
tai-kondo@yk.jp.nec.com
In reply to: Kouhei Kaigai (#2)
1 attachment(s)
Re: [Proposal] Table partition + join pushdown

Hello, KaiGai-san.

Thank you for your comment, and sorry for late response.

The attached patch is completely rewritten from previous patch[1]/messages/by-id/12A9442FBAE80D4E8953883E0B84E0885C01FD@BPXM01GP.gisp.nec.co.jp, at your suggestion[2]/messages/by-id/9A28C8860F777E439AA12E8AEA7694F8011345B6@BPXM15GP.gisp.nec.co.jp.
Please find attached.

This patch contains following implementation, but I can't determine this is correct or wrong.

1. Cost estimation
In this patch, additional row filter is implemented for Hash, Merge join and Nested Loop.
I implemented cost estimation feature for this filter by watching other parts of filters,
but I am not sure this implementation is correct.

2. Workaround for failing assertion at allpaths.c
In standard_join_search(), we expect to have a single rel at the final level.
But this expectation is disappointed by join pushdown feature, because this will
search for the combinations not searched by original standard_join_serch()
at the final level. Therefore, once trying join pushdown is succeeded,
failing assertion occurs in allpaths.c.

So I implemented workaround by temporary set NULL to root->join_rel_level while
trying join pushdown, but I think this implementation may be wrong.

3. Searching pathkeys for Merge Join
When join pushdown feature chooses merge join for pushed-down join operation,
planner fails to create merge join node because it is unable to find pathkeys
for this merge join. I found this is caused by skipping child table in finding
pathkeys.

I expect that it is for making planner faster, and I implemented that
planner doesn't skip child table in finding pathkeys for merge join.
But I am not sure this expectation is correct.

Any comments/suggestions are welcome.

Remarks :
[1]: /messages/by-id/12A9442FBAE80D4E8953883E0B84E0885C01FD@BPXM01GP.gisp.nec.co.jp
[2]: /messages/by-id/9A28C8860F777E439AA12E8AEA7694F8011345B6@BPXM15GP.gisp.nec.co.jp

Best regards,
--
Taiki Kondo

NEC Solution Innovators, Ltd.

-----Original Message-----
From: Kaigai Kouhei(海外 浩平) [mailto:kaigai@ak.jp.nec.com]
Sent: Tuesday, August 18, 2015 5:47 PM
To: Kondo Taiki(近藤 太樹); pgsql-hackers@postgresql.org
Cc: Iwaasa Akio(岩浅 晃郎)
Subject: RE: [Proposal] Table partition + join pushdown

Hello Kondo-san,

I briefly checked your patch. Let me put some comments about its design and implementation, even though I have no arguments towards its concept. :-)

* Construction of RelOptInfo

In your patch, try_hashjoin_pushdown() called by try_hashjoin_path() constructs RelOptInfo of the join-rel between inner-rel and a subpath of Append node. It is entirely wrong implementation.

I can understand we (may) have no RelOptInfo for the joinrel between
tbl_child_0 and other_table, when planner investigates a joinpath to process join Append path with the other_table.

HashJoin
-> Append
-> SeqScan on tbl_child_0
-> SeqScan on tbl_child_1
-> SeqScan on tbl_child_2
-> SeqScan on tbl_child_3
-> Hash
-> SeqScan on other_table

How about these alternatives?

- calls make_join_rel() to the pair of tbl_child_X and other_table
by try_hashjoin_pushdown() or somewhere. make_join_rel() internally
construct a RelOptInfo for the supplied pair of relations, so
relevant RelOptInfo shall be properly constructed.
- make_join_rel() also calls add_paths_to_joinrel() towards all the
join logic, so it makes easier to support to push down other join
logic including nested-loop or custom-join.
- It may be an idea to add an extra argument to make_join_rel() to
inform expressions to be applied for tuple filtering on
construction of inner hash table.

* Why only SeqScan is supported

I think it is the role of Hash-node to filter out inner tuples obviously unrelated to the join (if CHECK constraint of outer relation gives information), because this join-pushdown may be able to support multi-stacked pushdown.

For example, if planner considers a path to join this Append-path with another relation, and join clause contains a reference to X?

Append
-> HashJoin
-> SeqScan on tbl_child_0
-> Hash ... Filter: hash_func(X) % 4 = 0
-> SeqScan on other_table
-> HashJoin
-> SeqScan on tbl_child_1
-> Hash ... Filter: hash_func(X) % 4 = 1
-> SeqScan on other_table
-> HashJoin
-> SeqScan on tbl_child_2
-> Hash ... Filter: hash_func(X) % 4 = 2
-> SeqScan on other_table
-> HashJoin
-> SeqScan on tbl_child_3
-> Hash ... Filter: hash_func(X) % 4 = 3
-> SeqScan on other_table

It may be a good challenge to consider additional join pushdown, even if subpaths of Append are HashJoin, not SeqScan, like:

Append
-> HashJoin
-> HashJoin
-> SeqScan on tbl_child_0
-> Hash ... Filter: hash_func(X) % 4 = 0
-> SeqScan on other_table
-> Hash ... Filter: hash_func(X) % 4 = 0
-> SeqScan on another_table
-> HashJoin
-> HashJoin
-> SeqScan on tbl_child_1
-> Hash ... Filter: hash_func(X) % 4 = 1
-> SeqScan on other_table
-> Hash ... Filter: hash_func(X) % 4 = 1
-> SeqScan on another_table
-> HashJoin
-> HashJoin
-> SeqScan on tbl_child_2
-> Hash ... Filter: hash_func(X) % 4 = 2
-> SeqScan on other_table
-> Hash ... Filter: hash_func(X) % 4 = 2
-> SeqScan on another_table
-> HashJoin
-> HashJoin
-> SeqScan on tbl_child_3
-> Hash ... Filter: hash_func(X) % 4 = 3
-> SeqScan on other_table
-> Hash ... Filter: hash_func(X) % 4 = 3
-> SeqScan on another_table

In this case, underlying nodes are not always SeqScan. So, only Hash-node can have filter clauses.

* Way to handle expression nodes

All this patch supported is CHECK() constraint with equal operation on INT4 data type. You can learn various useful infrastructure of PostgreSQL. For example, ...
- expression_tree_mutator() is useful to make a copy of expression
node with small modification
- pull_varnos() is useful to check which relations are referenced
by the expression node.
- RestrictInfo->can_join is useful to check whether the clause is
binary operator, or not.

Anyway, reuse of existing infrastructure is the best way to make a reliable infrastructure and to keep the implementation simple.

Thanks,
--
NEC Business Creation Division / PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com>

Show quoted text

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Taiki Kondo
Sent: Thursday, August 13, 2015 6:30 PM
To: pgsql-hackers@postgresql.org
Cc: Kaigai Kouhei(海外 浩平); Iwaasa Akio(岩浅 晃郎)
Subject: [HACKERS] [Proposal] Table partition + join pushdown

Hi all,

I saw the email about the idea from KaiGai-san[1], and I worked to
implement this idea.

Now, I have implemented a part of this idea, so I want to propose this
feature.

Patch attached just shows my concept of this feature.
It works fine for EXPLAIN, but it returns wrong result for other operations, sadly.

Table partition + join pushdown
===============================

Motivation
----------
To make join logic working more effectively, it is important to make
the size of relations smaller.

Especially in Hash-join, it is meaningful to make the inner relation
smaller, because smaller inner relation can be stored within smaller hash table.
This means that memory usage can be reduced when joining with big tables.

Design
------
It was mentioned by the email from KaiGai-san.
So I quote below here...

---- begin quotation ---
Let's assume a table which is partitioned to four portions, and
individual child relations have constraint by hash-value of its ID
field.

tbl_parent
+ tbl_child_0 ... CHECK(hash_func(id) % 4 = 0)
+ tbl_child_1 ... CHECK(hash_func(id) % 4 = 1)
+ tbl_child_2 ... CHECK(hash_func(id) % 4 = 2)
+ tbl_child_3 ... CHECK(hash_func(id) % 4 = 3)

If someone tried to join another relation with tbl_parent using
equivalence condition, like X = tbl_parent.ID, we know inner tuples
that does not satisfies the condition
hash_func(X) % 4 = 0
shall be never joined to the tuples in tbl_child_0.
So, we can omit to load these tuples to inner hash table preliminary,
then it potentially allows to split the inner hash-table.

Current typical plan structure is below:

HashJoin
-> Append
-> SeqScan on tbl_child_0
-> SeqScan on tbl_child_1
-> SeqScan on tbl_child_2
-> SeqScan on tbl_child_3
-> Hash
-> SeqScan on other_table

It may be rewritable to:

Append
-> HashJoin
-> SeqScan on tbl_child_0
-> Hash ... Filter: hash_func(X) % 4 = 0
-> SeqScan on other_table
-> HashJoin
-> SeqScan on tbl_child_1
-> Hash ... Filter: hash_func(X) % 4 = 1
-> SeqScan on other_table
-> HashJoin
-> SeqScan on tbl_child_2
-> Hash ... Filter: hash_func(X) % 4 = 2
-> SeqScan on other_table
-> HashJoin
-> SeqScan on tbl_child_3
-> Hash ... Filter: hash_func(X) % 4 = 3
-> SeqScan on other_table
---- end quotation ---

In the quotation above, it was written that filter is set at Hash node.
But I implemented that filter is set at SeqScan node under Hash node.
In my opinion, filtering tuples is work of Scanner.

Append
-> HashJoin
-> SeqScan on tbl_child_0
-> Hash
-> SeqScan on other_table ... Filter: hash_func(X) % 4 = 0
-> HashJoin
-> SeqScan on tbl_child_1
-> Hash
-> SeqScan on other_table ... Filter: hash_func(X) % 4 = 1
-> HashJoin
-> SeqScan on tbl_child_2
-> Hash
-> SeqScan on other_table ... Filter: hash_func(X) % 4 = 2
-> HashJoin
-> SeqScan on tbl_child_3
-> Hash
-> SeqScan on other_table ... Filter: hash_func(X) % 4 = 3

API
---
There are 3 new internal (static) functions to implement this feature.
try_hashjoin_pushdown(), which is main function of this feature, is
called from try_hashjoin_path(), and tries to push down HashPath under
the AppendPath.

To do so, this function does following operations.

1. Check if this Hash-join can be pushed down under AppendPath
2. To avoid an influence on other Path making operation,
copy inner path's RelOptInfo and make new SeqScan path from it.
At here, get CHECK() constraints from OUTER path, and convert its
Var node according to join condition. And also convert Var nodes
in join condition itself.
3. Create new HashPath nodes between each sub-paths of AppendPath and
inner path made above.
4. When operations from 1 to 3 are done for each sub-paths,
create new AppendPath which sub-paths are HashPath nodes made above.

get_replaced_clause_constr() is called from try_hashjoin_pushdown(),
and get_var_nodes_recurse() is called from get_replaced_cluase_constr().
These 2 functions help above operations.
(I may revise this part to use expression_tree_walker() and
expression_tree_mutator().)

In patch attached, it has the following limitations.
o It only works for hash-join operation.
(I want to support not only hash-join but also other logic.)
o Join conditions must be "=" operator with int4 variables.
o Inner path must be SeqScan.
(I want to support other path-node.)
o And now, planner may not choose this plan,
because estimated costs are usually larger than original (non-pushdown) plan.

And also 1 internal (static) function, get_relation_constraints()
defined in plancat.c, is changed to global. This function will be
called from
get_replaced_clause_constr() to get CHECK() constraints.

Usage
-----
To use this feature, create partition tables and small table to join,
and run select operation with joining these tables.

For your convenience, I attach DDL and DML script.
And I also attach the result of EXPLAIN.

Any comments are welcome. But, at first, I need your advices to
correct this patch's behavior.

At least, I think it has to expand array of RangeTblEntry and other
arrays defined in PlannerInfo to register new RelOptInfos for new Path nodes mentioned above.
Or, is it better choice to modify query parser to implement this
feature more further?

Remarks :
[1]
/messages/by-id/9A28C8860F777E439AA12E8AEA7694F80
10F672
B@BPXM15GP.gisp.nec.co.jp

Best regards,
--
Taiki Kondo

NEC Solution Innovators, Ltd.

Attachments:

join_pushdown.v1.patchapplication/octet-stream; name=join_pushdown.v1.patchDownload
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index f0d9e94..995bc6c 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1340,9 +1340,11 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			if (((NestLoop *) plan)->join.joinqual)
 				show_instrumentation_count("Rows Removed by Join Filter", 1,
 										   planstate, es);
-			show_upper_qual(plan->qual, "Filter", planstate, ancestors, es);
-			if (plan->qual)
-				show_instrumentation_count("Rows Removed by Filter", 2,
+			show_upper_qual(plan->qual, "Other Filter", planstate, ancestors, es);
+			show_upper_qual(((NestLoop *) plan)->join.filterqual,
+							"Inner Filter", planstate, ancestors, es);
+			if (plan->qual || ((NestLoop *) plan)->join.filterqual)
+				show_instrumentation_count("Rows Removed by Inner/Other Filter", 2,
 										   planstate, es);
 			break;
 		case T_MergeJoin:
@@ -1353,9 +1355,11 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			if (((MergeJoin *) plan)->join.joinqual)
 				show_instrumentation_count("Rows Removed by Join Filter", 1,
 										   planstate, es);
-			show_upper_qual(plan->qual, "Filter", planstate, ancestors, es);
-			if (plan->qual)
-				show_instrumentation_count("Rows Removed by Filter", 2,
+			show_upper_qual(plan->qual, "Other Filter", planstate, ancestors, es);
+			show_upper_qual(((MergeJoin *) plan)->join.filterqual,
+							"Inner Filter", planstate, ancestors, es);
+			if (plan->qual || ((MergeJoin *) plan)->join.filterqual)
+				show_instrumentation_count("Rows Removed by Inner/Other Filters", 2,
 										   planstate, es);
 			break;
 		case T_HashJoin:
@@ -1366,9 +1370,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			if (((HashJoin *) plan)->join.joinqual)
 				show_instrumentation_count("Rows Removed by Join Filter", 1,
 										   planstate, es);
-			show_upper_qual(plan->qual, "Filter", planstate, ancestors, es);
+			show_upper_qual(plan->qual, "Other Filter", planstate, ancestors, es);
 			if (plan->qual)
-				show_instrumentation_count("Rows Removed by Filter", 2,
+				show_instrumentation_count("Rows Removed by Other Filter", 2,
 										   planstate, es);
 			break;
 		case T_Agg:
@@ -1407,6 +1411,11 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			break;
 		case T_Hash:
 			show_hash_info((HashState *) planstate, es);
+			show_upper_qual(((Hash *) plan)->filterqual, "Inner Filter",
+							planstate, ancestors, es);
+			if (((Hash *) plan)->filterqual)
+				show_instrumentation_count("Rows Removed by Inner Filter", 2,
+										   planstate, es);
 			break;
 		default:
 			break;
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 0b2c139..87a71a0 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -75,6 +75,7 @@ MultiExecHash(HashState *node)
 {
 	PlanState  *outerNode;
 	List	   *hashkeys;
+	List	   *filterqual;
 	HashJoinTable hashtable;
 	TupleTableSlot *slot;
 	ExprContext *econtext;
@@ -95,6 +96,7 @@ MultiExecHash(HashState *node)
 	 */
 	hashkeys = node->hashkeys;
 	econtext = node->ps.ps_ExprContext;
+	filterqual = node->filterqual;
 
 	/*
 	 * get all inner tuples and insert into the hash table (or temp files)
@@ -104,8 +106,31 @@ MultiExecHash(HashState *node)
 		slot = ExecProcNode(outerNode);
 		if (TupIsNull(slot))
 			break;
+
+		/*
+		 * Sub node is connected to this node as "OUTER",
+		 * so we temporary specify slot as outer tuple during ExecQual.
+		 */
+		econtext->ecxt_outertuple = slot;
+
+		/*
+		 * Now, we filter with filterqual.
+		 */
+		if (filterqual == NIL || ExecQual(filterqual, econtext, false))
+		{
+			/* Nothing to do. No-op */
+		}
+		else
+		{
+			/* filterqual is neither scanqual nor joinqual. */
+			InstrCountFiltered2(node, 1);
+			continue;
+		}
+
 		/* We have to compute the hash value */
 		econtext->ecxt_innertuple = slot;
+		econtext->ecxt_outertuple = NULL;
+
 		if (ExecHashGetHashValue(hashtable, econtext, hashkeys,
 								 false, hashtable->keepNulls,
 								 &hashvalue))
@@ -206,6 +231,9 @@ ExecInitHash(Hash *node, EState *estate, int eflags)
 	hashstate->ps.qual = (List *)
 		ExecInitExpr((Expr *) node->plan.qual,
 					 (PlanState *) hashstate);
+	hashstate->filterqual = (List *)
+		ExecInitExpr((Expr *) node->filterqual,
+					  (PlanState *) hashstate);
 
 	/*
 	 * initialize child nodes
@@ -273,7 +301,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	 */
 	outerNode = outerPlan(node);
 
-	ExecChooseHashTableSize(outerNode->plan_rows, outerNode->plan_width,
+	ExecChooseHashTableSize(node->plan.plan_rows, outerNode->plan_width,
 							OidIsValid(node->skewTable),
 							&nbuckets, &nbatch, &num_skew_mcvs);
 
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 1d78cdf..eb2f250 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -472,6 +472,8 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 	hjstate->js.joinqual = (List *)
 		ExecInitExpr((Expr *) node->join.joinqual,
 					 (PlanState *) hjstate);
+	 /* filterqual is not needed here, needed in Hash instead*/
+	hjstate->js.filterqual = NIL;
 	hjstate->hashclauses = (List *)
 		ExecInitExpr((Expr *) node->hashclauses,
 					 (PlanState *) hjstate);
diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c
index 34b6cf6..cd858f5 100644
--- a/src/backend/executor/nodeMergejoin.c
+++ b/src/backend/executor/nodeMergejoin.c
@@ -355,6 +355,22 @@ MJEvalInnerValues(MergeJoinState *mergestate, TupleTableSlot *innerslot)
 
 	econtext->ecxt_innertuple = innerslot;
 
+	/*
+	 * We filter inner tuple with filterqual here.
+	 */
+	if (mergestate->js.filterqual == NIL ||
+			ExecQual(mergestate->js.filterqual, econtext, false))
+	{
+		/* Nothing to do. No-op */
+	}
+	else
+	{
+		/* Filtered. */
+		MemoryContextSwitchTo(oldContext);
+		InstrCountFiltered2(mergestate, 1);
+		return MJEVAL_NONMATCHABLE;
+	}
+
 	for (i = 0; i < mergestate->mj_NumClauses; i++)
 	{
 		MergeJoinClause clause = &mergestate->mj_Clauses[i];
@@ -1514,6 +1530,9 @@ ExecInitMergeJoin(MergeJoin *node, EState *estate, int eflags)
 	mergestate->js.joinqual = (List *)
 		ExecInitExpr((Expr *) node->join.joinqual,
 					 (PlanState *) mergestate);
+	mergestate->js.filterqual = (List *)
+		ExecInitExpr((Expr *) node->join.filterqual,
+					 (PlanState *) mergestate);
 	mergestate->mj_ConstFalseJoin = false;
 	/* mergeclauses are handled below */
 
diff --git a/src/backend/executor/nodeNestloop.c b/src/backend/executor/nodeNestloop.c
index e66bcda..93ed171 100644
--- a/src/backend/executor/nodeNestloop.c
+++ b/src/backend/executor/nodeNestloop.c
@@ -65,6 +65,7 @@ ExecNestLoop(NestLoopState *node)
 	TupleTableSlot *outerTupleSlot;
 	TupleTableSlot *innerTupleSlot;
 	List	   *joinqual;
+	List	   *filterqual;
 	List	   *otherqual;
 	ExprContext *econtext;
 	ListCell   *lc;
@@ -76,6 +77,7 @@ ExecNestLoop(NestLoopState *node)
 
 	nl = (NestLoop *) node->js.ps.plan;
 	joinqual = node->js.joinqual;
+	filterqual = node->js.filterqual;
 	otherqual = node->js.ps.qual;
 	outerPlan = outerPlanState(node);
 	innerPlan = innerPlanState(node);
@@ -174,6 +176,20 @@ ExecNestLoop(NestLoopState *node)
 		innerTupleSlot = ExecProcNode(innerPlan);
 		econtext->ecxt_innertuple = innerTupleSlot;
 
+		/*
+		 * We filter inner tuple with filterqual here.
+		 */
+		if (filterqual == NIL || ExecQual(filterqual, econtext, false))
+		{
+			/* Nothing to do. No-op */
+		}
+		else
+		{
+			/* Filtered. */
+			InstrCountFiltered2(node, 1);
+			continue;
+		}
+
 		if (TupIsNull(innerTupleSlot))
 		{
 			ENL1_printf("no inner tuple, need new outer tuple");
@@ -330,6 +346,9 @@ ExecInitNestLoop(NestLoop *node, EState *estate, int eflags)
 	nlstate->js.joinqual = (List *)
 		ExecInitExpr((Expr *) node->join.joinqual,
 					 (PlanState *) nlstate);
+	nlstate->js.filterqual = (List *)
+		ExecInitExpr((Expr *) node->join.filterqual,
+					 (PlanState *) nlstate);
 
 	/*
 	 * initialize child nodes
diff --git a/src/backend/optimizer/geqo/geqo_eval.c b/src/backend/optimizer/geqo/geqo_eval.c
index d9a20da..faadb69 100644
--- a/src/backend/optimizer/geqo/geqo_eval.c
+++ b/src/backend/optimizer/geqo/geqo_eval.c
@@ -261,7 +261,8 @@ merge_clump(PlannerInfo *root, List *clumps, Clump *new_clump, bool force)
 			 */
 			joinrel = make_join_rel(root,
 									old_clump->joinrel,
-									new_clump->joinrel);
+									new_clump->joinrel,
+									NIL);
 
 			/* Keep searching if join order is not valid */
 			if (joinrel)
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index d107d76..e9bd7ec 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1995,6 +1995,7 @@ final_cost_nestloop(PlannerInfo *root, NestPath *path,
 
 	/* CPU costs */
 	cost_qual_eval(&restrict_qual_cost, path->joinrestrictinfo, root);
+	cost_qual_eval(&restrict_qual_cost, path->filterrestrictinfo, root);
 	startup_cost += restrict_qual_cost.startup;
 	cpu_per_tuple = cpu_tuple_cost + restrict_qual_cost.per_tuple;
 	run_cost += cpu_per_tuple * ntuples;
@@ -2306,6 +2307,7 @@ final_cost_mergejoin(PlannerInfo *root, MergePath *path,
 	 */
 	cost_qual_eval(&merge_qual_cost, mergeclauses, root);
 	cost_qual_eval(&qp_qual_cost, path->jpath.joinrestrictinfo, root);
+	cost_qual_eval(&qp_qual_cost, path->jpath.filterrestrictinfo, root);
 	qp_qual_cost.startup -= merge_qual_cost.startup;
 	qp_qual_cost.per_tuple -= merge_qual_cost.per_tuple;
 
@@ -2547,6 +2549,7 @@ void
 initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 					  JoinType jointype,
 					  List *hashclauses,
+					  List *added_restrictinfo,
 					  Path *outer_path, Path *inner_path,
 					  SpecialJoinInfo *sjinfo,
 					  SemiAntiJoinFactors *semifactors)
@@ -2565,6 +2568,18 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 	run_cost += outer_path->total_cost - outer_path->startup_cost;
 	startup_cost += inner_path->total_cost;
 
+	/* estimate nrows of inner_path filtered by added_restrictlist */
+	if (added_restrictinfo != NIL)
+	{
+		inner_path_rows *=
+				clauselist_selectivity(root,
+									   added_restrictinfo,
+									   0,
+									   JOIN_INNER,
+									   NULL);
+		inner_path_rows = clamp_row_est(inner_path_rows);
+	}
+
 	/*
 	 * Cost of computing hash function: must do it once per input tuple. We
 	 * charge one cpu_operator_cost for each column's hash function.  Also,
@@ -2655,6 +2670,7 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
 	Cost		cpu_per_tuple;
 	QualCost	hash_qual_cost;
 	QualCost	qp_qual_cost;
+	QualCost	filter_qual_cost;
 	double		hashjointuples;
 	double		virtualbuckets;
 	Selectivity innerbucketsize;
@@ -2674,6 +2690,18 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
 	if (!enable_hashjoin)
 		startup_cost += disable_cost;
 
+	/* estimate nrows of inner_path filtered by filter restrict info */
+	if (path->jpath.filterrestrictinfo != NIL)
+	{
+		inner_path_rows *=
+				clauselist_selectivity(root,
+									   path->jpath.filterrestrictinfo,
+									   0,
+									   JOIN_INNER,
+									   NULL);
+		inner_path_rows = clamp_row_est(inner_path_rows);
+	}
+
 	/* mark the path with estimated # of batches */
 	path->num_batches = numbatches;
 
@@ -2753,6 +2781,7 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
 	 */
 	cost_qual_eval(&hash_qual_cost, hashclauses, root);
 	cost_qual_eval(&qp_qual_cost, path->jpath.joinrestrictinfo, root);
+	cost_qual_eval(&filter_qual_cost, path->jpath.filterrestrictinfo, root);
 	qp_qual_cost.startup -= hash_qual_cost.startup;
 	qp_qual_cost.per_tuple -= hash_qual_cost.per_tuple;
 
@@ -2835,6 +2864,8 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
 	 * not all of the quals may get evaluated at each tuple.)
 	 */
 	startup_cost += qp_qual_cost.startup;
+	startup_cost += filter_qual_cost.startup +
+			filter_qual_cost.per_tuple * inner_path_rows;
 	cpu_per_tuple = cpu_tuple_cost + qp_qual_cost.per_tuple;
 	run_cost += cpu_per_tuple * hashjointuples;
 
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index a35c881..23de7f2 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -18,9 +18,22 @@
 
 #include "executor/executor.h"
 #include "foreign/fdwapi.h"
+#include "nodes/nodeFuncs.h"
+#include "nodes/nodes.h"
+#include "optimizer/clauses.h"
 #include "optimizer/cost.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
+#include "optimizer/plancat.h"
+#include "optimizer/restrictinfo.h"
+#include "rewrite/rewriteManip.h"
+#include "utils/lsyscache.h"
+
+typedef struct
+{
+	List	*joininfo;
+	bool	 is_mutated;
+} check_constraint_mutator_context;
 
 /* Hook for plugins to get control in add_paths_to_joinrel() */
 set_join_pathlist_hook_type set_join_pathlist_hook = NULL;
@@ -45,6 +58,11 @@ static List *select_mergejoin_clauses(PlannerInfo *root,
 						 JoinType jointype,
 						 bool *mergejoin_allowed);
 
+static void try_join_pushdown(PlannerInfo *root,
+						  RelOptInfo *joinrel, RelOptInfo *outer_rel,
+						  RelOptInfo *inner_rel,
+						  List *restrictlist);
+
 
 /*
  * add_paths_to_joinrel
@@ -76,13 +94,33 @@ add_paths_to_joinrel(PlannerInfo *root,
 					 RelOptInfo *innerrel,
 					 JoinType jointype,
 					 SpecialJoinInfo *sjinfo,
-					 List *restrictlist)
+					 List *restrictlist,
+					 List *added_restrictlist,
+					 bool  added_rinfo_for_outer)
 {
 	JoinPathExtraData extra;
 	bool		mergejoin_allowed = true;
 	ListCell   *lc;
 
-	extra.restrictlist = restrictlist;
+	/*
+	 * Try to push Join down under Append
+	 */
+	if (!IS_OUTER_JOIN(jointype))
+	{
+		try_join_pushdown(root, joinrel, outerrel, innerrel, restrictlist);
+	}
+
+	if (added_restrictlist != NIL && added_rinfo_for_outer)
+	{
+		extra.restrictlist =
+				list_concat(list_copy(restrictlist), added_restrictlist);
+		extra.added_restrictlist = NIL;
+	}
+	else
+	{
+		extra.restrictlist = restrictlist;
+		extra.added_restrictlist = added_restrictlist;
+	}
 	extra.mergeclause_list = NIL;
 	extra.sjinfo = sjinfo;
 	extra.param_source_rels = NULL;
@@ -417,6 +455,7 @@ try_nestloop_path(PlannerInfo *root,
 									  outer_path,
 									  inner_path,
 									  extra->restrictlist,
+									  extra->added_restrictlist,
 									  pathkeys,
 									  required_outer));
 	}
@@ -499,6 +538,7 @@ try_mergejoin_path(PlannerInfo *root,
 									   outer_path,
 									   inner_path,
 									   extra->restrictlist,
+									   extra->added_restrictlist,
 									   pathkeys,
 									   required_outer,
 									   mergeclauses,
@@ -554,6 +594,7 @@ try_hashjoin_path(PlannerInfo *root,
 	 * never have any output pathkeys, per comments in create_hashjoin_path.
 	 */
 	initial_cost_hashjoin(root, &workspace, jointype, hashclauses,
+						  extra->added_restrictlist,
 						  outer_path, inner_path,
 						  extra->sjinfo, &extra->semifactors);
 
@@ -571,6 +612,7 @@ try_hashjoin_path(PlannerInfo *root,
 									  outer_path,
 									  inner_path,
 									  extra->restrictlist,
+									  extra->added_restrictlist,
 									  required_outer,
 									  hashclauses));
 	}
@@ -1474,3 +1516,250 @@ select_mergejoin_clauses(PlannerInfo *root,
 
 	return result_list;
 }
+
+static Node *
+check_constraint_mutator(Node *node, check_constraint_mutator_context *context)
+{
+	/* Failed to mutate. Abort. */
+	if (!context->is_mutated)
+		return (Node *) copyObject(node);
+
+	if (node == NULL)
+		return NULL;
+
+	if (IsA(node, Var))
+	{
+		List		*l = context->joininfo;
+		ListCell	*lc;
+
+		Assert(list_length(l) > 0);
+
+		foreach (lc, l)
+		{
+			RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+			Expr *expr = rinfo->clause;
+
+			if (!rinfo->can_join ||
+				!IsA(expr, OpExpr) ||
+				!op_hashjoinable(((OpExpr *) expr)->opno,
+								exprType(get_leftop(expr))))
+				continue;
+
+			if (equal(get_leftop(expr), node))
+			{
+				/*
+				 * This node is equal to LEFT of join clause,
+				 * thus will be replaced with RIGHT clause.
+				 */
+				return (Node *) copyObject(get_rightop(expr));
+			}
+			else
+			if (equal(get_rightop(expr), node))
+			{
+				/*
+				 * This node is equal to RIGHT of join clause,
+				 * thus will be replaced with LEFT clause.
+				 */
+				return (Node *) copyObject(get_leftop(expr));
+			}
+		}
+
+		/* Unfortunately, mutating is failed. */
+		context->is_mutated = false;
+		return (Node *) copyObject(node);
+	}
+
+	return expression_tree_mutator(node, check_constraint_mutator, context);
+}
+
+/*
+ * Make RestrictInfo_List from CHECK() constraints.
+ */
+static List *
+make_restrictinfos_from_check_constr(PlannerInfo *root,
+									List *joininfo, RelOptInfo *outer_rel)
+{
+	List			*result = NIL;
+	RangeTblEntry	*childRTE = root->simple_rte_array[outer_rel->relid];
+	List			*check_constr =
+						get_relation_constraints(root, childRTE->relid,
+													outer_rel, false);
+	ListCell		*lc;
+
+	check_constraint_mutator_context	context;
+
+	context.joininfo = joininfo;
+	context.is_mutated = true;
+
+	/*
+	 * Try to change CHECK() constraints to filter expressions.
+	 */
+	foreach(lc, check_constr)
+	{
+		Node *mutated =
+				expression_tree_mutator((Node *) lfirst(lc),
+										check_constraint_mutator,
+										(void *) &context);
+
+		if (context.is_mutated)
+			result = lappend(result, mutated);
+	}
+
+	Assert(list_length(check_constr) == list_length(result));
+	list_free_deep(check_constr);
+
+	return make_restrictinfos_from_actual_clauses(root, result);
+}
+
+/*
+ * Mutate parent's relid to child one.
+ */
+static List *
+mutate_parent_relid_to_child(PlannerInfo *root, List *join_clauses,
+								RelOptInfo *outer_rel)
+{
+	Index		parent_relid =
+					find_childrel_appendrelinfo(root, outer_rel)->parent_relid;
+	List		*old_clauses = get_actual_clauses(join_clauses);
+	List		*new_clauses = NIL;
+	ListCell	*lc;
+
+	foreach(lc, old_clauses)
+	{
+		Node	*new_clause = (Node *) copyObject(lfirst(lc));
+
+		ChangeVarNodes(new_clause, parent_relid, outer_rel->relid, 0);
+		new_clauses = lappend(new_clauses, new_clause);
+	}
+
+	return make_restrictinfos_from_actual_clauses(root, new_clauses);
+}
+
+static inline List *
+extract_join_clauses(List *restrictlist, RelOptInfo *outer_prel,
+						RelOptInfo *inner_rel)
+{
+	List		*result = NIL;
+	ListCell	*lc;
+
+	foreach (lc, restrictlist)
+	{
+		RestrictInfo	*rinfo = (RestrictInfo *) lfirst(lc);
+
+		if (clause_sides_match_join(rinfo, outer_prel, inner_rel))
+			result = lappend(result, rinfo);
+	}
+
+	return result;
+}
+
+/*
+  Try to push JoinPath down under AppendPath.
+*/
+static void
+try_join_pushdown(PlannerInfo *root,
+				  RelOptInfo *joinrel, RelOptInfo *outer_rel,
+				  RelOptInfo *inner_rel,
+				  List *restrictlist)
+{
+	AppendPath	*outer_path;
+	ListCell	*lc;
+	List		*old_joinclauses;
+	List		*new_append_subpaths = NIL;
+
+	Assert(outer_rel->cheapest_total_path != NULL);
+
+	/* When specified outer path is not an AppendPath, nothing to do here. */
+	if (!IsA(outer_rel->cheapest_total_path, AppendPath))
+	{
+		elog(DEBUG1, "Outer path is not an AppendPath. Do nothing.");
+		return;
+	}
+
+	outer_path = (AppendPath *) outer_rel->cheapest_total_path;
+
+	/*
+	 * Extract join clauses to mutate CHECK() constraints.
+	 * We don't have to clobber this list to mutate CHECK() constraints,
+	 * so we need to do only once.
+	 */
+	old_joinclauses = extract_join_clauses(restrictlist, outer_rel, inner_rel);
+
+	/*
+	  * Make new joinrel between each of outer path's sub-paths and inner path.
+	  */
+	foreach(lc, outer_path->subpaths)
+	{
+		RelOptInfo	*old_outer_rel = ((Path *) lfirst(lc))->parent;
+		RelOptInfo	*new_outer_rel;
+		List		*new_joinclauses;
+		List		*added_restrictlist;
+		List		**join_rel_level;
+
+		Assert(!IS_DUMMY_REL(old_outer_rel));
+
+		/*
+		 * Join clause points parent's relid,
+		 * so we must change it to child's one.
+		 */
+		new_joinclauses = mutate_parent_relid_to_child(root, old_joinclauses,
+													old_outer_rel);
+
+		/*
+		 * Make RestrictInfo list from CHECK() constraints of outer table.
+		 */
+		added_restrictlist =
+				make_restrictinfos_from_check_constr(root, new_joinclauses,
+													old_outer_rel);
+
+		/* XXX This is workaround for failing assertion at allpaths.c */
+		join_rel_level = root->join_rel_level;
+		root->join_rel_level = NULL;
+
+		/*
+		 * Create new joinrel with restriction made above.
+		 */
+		new_outer_rel =
+				make_join_rel(root, old_outer_rel, inner_rel,
+						added_restrictlist);
+
+		root->join_rel_level = join_rel_level;
+
+		Assert(new_outer_rel != NULL);
+
+		if (IS_DUMMY_REL(new_outer_rel))
+		{
+			pfree(new_outer_rel);
+			continue;
+		}
+
+		/*
+		 * We must check if each of all new joinrels have one path at least.
+		 * add_path() sometime rejects to add new path to parent RelOptInfo.
+		 */
+		if (list_length(new_outer_rel->pathlist) <= 0)
+		{
+			/*
+			 * Sadly, No paths added. This means that pushdown is failed,
+			 * thus clean up here.
+			 */
+			list_free_deep(new_append_subpaths);
+			pfree(new_outer_rel);
+			list_free(old_joinclauses);
+			elog(DEBUG1, "Join pushdown failed.");
+			return;
+		}
+
+		set_cheapest(new_outer_rel);
+		Assert(new_outer_rel->cheapest_total_path != NULL);
+		new_append_subpaths = lappend(new_append_subpaths,
+									new_outer_rel->cheapest_total_path);
+	}
+
+	/* Join Pushdown is succeeded. Add path to original joinrel. */
+	add_path(joinrel,
+			(Path *) create_append_path(joinrel, new_append_subpaths, NULL));
+
+	list_free(old_joinclauses);
+	elog(DEBUG1, "Join pushdown succeeded.");
+}
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index b2cc9f0..7075552 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -33,7 +33,6 @@ static void mark_dummy_rel(RelOptInfo *rel);
 static bool restriction_is_constant_false(List *restrictlist,
 							  bool only_pushed_down);
 
-
 /*
  * join_search_one_level
  *	  Consider ways to produce join relations containing exactly 'level'
@@ -170,7 +169,7 @@ join_search_one_level(PlannerInfo *root, int level)
 					if (have_relevant_joinclause(root, old_rel, new_rel) ||
 						have_join_order_restriction(root, old_rel, new_rel))
 					{
-						(void) make_join_rel(root, old_rel, new_rel);
+						(void) make_join_rel(root, old_rel, new_rel, NIL);
 					}
 				}
 			}
@@ -271,7 +270,7 @@ make_rels_by_clause_joins(PlannerInfo *root,
 			(have_relevant_joinclause(root, old_rel, other_rel) ||
 			 have_join_order_restriction(root, old_rel, other_rel)))
 		{
-			(void) make_join_rel(root, old_rel, other_rel);
+			(void) make_join_rel(root, old_rel, other_rel, NIL);
 		}
 	}
 }
@@ -303,7 +302,7 @@ make_rels_by_clauseless_joins(PlannerInfo *root,
 
 		if (!bms_overlap(other_rel->relids, old_rel->relids))
 		{
-			(void) make_join_rel(root, old_rel, other_rel);
+			(void) make_join_rel(root, old_rel, other_rel, NIL);
 		}
 	}
 }
@@ -589,7 +588,8 @@ join_is_legal(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
  * turned into joins.
  */
 RelOptInfo *
-make_join_rel(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2)
+make_join_rel(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
+			  List *added_restrictlist)
 {
 	Relids		joinrelids;
 	SpecialJoinInfo *sjinfo;
@@ -691,10 +691,14 @@ make_join_rel(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2)
 			}
 			add_paths_to_joinrel(root, joinrel, rel1, rel2,
 								 JOIN_INNER, sjinfo,
-								 restrictlist);
+								 restrictlist,
+								 added_restrictlist,
+								 false);
 			add_paths_to_joinrel(root, joinrel, rel2, rel1,
 								 JOIN_INNER, sjinfo,
-								 restrictlist);
+								 restrictlist,
+								 added_restrictlist,
+								 true);
 			break;
 		case JOIN_LEFT:
 			if (is_dummy_rel(rel1) ||
@@ -708,10 +712,14 @@ make_join_rel(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2)
 				mark_dummy_rel(rel2);
 			add_paths_to_joinrel(root, joinrel, rel1, rel2,
 								 JOIN_LEFT, sjinfo,
-								 restrictlist);
+								 restrictlist,
+								 added_restrictlist,
+								 false);
 			add_paths_to_joinrel(root, joinrel, rel2, rel1,
 								 JOIN_RIGHT, sjinfo,
-								 restrictlist);
+								 restrictlist,
+								 added_restrictlist,
+								 true);
 			break;
 		case JOIN_FULL:
 			if ((is_dummy_rel(rel1) && is_dummy_rel(rel2)) ||
@@ -722,10 +730,14 @@ make_join_rel(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2)
 			}
 			add_paths_to_joinrel(root, joinrel, rel1, rel2,
 								 JOIN_FULL, sjinfo,
-								 restrictlist);
+								 restrictlist,
+								 added_restrictlist,
+								 false);
 			add_paths_to_joinrel(root, joinrel, rel2, rel1,
 								 JOIN_FULL, sjinfo,
-								 restrictlist);
+								 restrictlist,
+								 added_restrictlist,
+								 true);
 
 			/*
 			 * If there are join quals that aren't mergeable or hashable, we
@@ -758,7 +770,9 @@ make_join_rel(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2)
 				}
 				add_paths_to_joinrel(root, joinrel, rel1, rel2,
 									 JOIN_SEMI, sjinfo,
-									 restrictlist);
+									 restrictlist,
+									 added_restrictlist,
+									 false);
 			}
 
 			/*
@@ -781,10 +795,14 @@ make_join_rel(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2)
 				}
 				add_paths_to_joinrel(root, joinrel, rel1, rel2,
 									 JOIN_UNIQUE_INNER, sjinfo,
-									 restrictlist);
+									 restrictlist,
+									 added_restrictlist,
+									 false);
 				add_paths_to_joinrel(root, joinrel, rel2, rel1,
 									 JOIN_UNIQUE_OUTER, sjinfo,
-									 restrictlist);
+									 restrictlist,
+									 added_restrictlist,
+									 true);
 			}
 			break;
 		case JOIN_ANTI:
@@ -799,7 +817,9 @@ make_join_rel(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2)
 				mark_dummy_rel(rel2);
 			add_paths_to_joinrel(root, joinrel, rel1, rel2,
 								 JOIN_ANTI, sjinfo,
-								 restrictlist);
+								 restrictlist,
+								 added_restrictlist,
+								 false);
 			break;
 		default:
 			/* other values not expected here */
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 404c6f5..8cbf86e 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -135,7 +135,8 @@ static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 static BitmapAnd *make_bitmap_and(List *bitmapplans);
 static BitmapOr *make_bitmap_or(List *bitmapplans);
 static NestLoop *make_nestloop(List *tlist,
-			  List *joinclauses, List *otherclauses, List *nestParams,
+			  List *joinclauses,  List *filterclauses,
+			  List *otherclauses, List *nestParams,
 			  Plan *lefttree, Plan *righttree,
 			  JoinType jointype);
 static HashJoin *make_hashjoin(List *tlist,
@@ -144,14 +145,15 @@ static HashJoin *make_hashjoin(List *tlist,
 			  Plan *lefttree, Plan *righttree,
 			  JoinType jointype);
 static Hash *make_hash(Plan *lefttree,
+		  List *filterclauses,
 		  Oid skewTable,
 		  AttrNumber skewColumn,
 		  bool skewInherit,
 		  Oid skewColType,
 		  int32 skewColTypmod);
 static MergeJoin *make_mergejoin(List *tlist,
-			   List *joinclauses, List *otherclauses,
-			   List *mergeclauses,
+			   List *joinclauses, List *filterclauses,
+			   List *otherclauses, List *mergeclauses,
 			   Oid *mergefamilies,
 			   Oid *mergecollations,
 			   int *mergestrategies,
@@ -2239,6 +2241,7 @@ create_nestloop_plan(PlannerInfo *root,
 	List	   *tlist = build_path_tlist(root, &best_path->path);
 	List	   *joinrestrictclauses = best_path->joinrestrictinfo;
 	List	   *joinclauses;
+	List	   *filterclauses;
 	List	   *otherclauses;
 	Relids		outerrelids;
 	List	   *nestParams;
@@ -2248,6 +2251,7 @@ create_nestloop_plan(PlannerInfo *root,
 
 	/* Sort join qual clauses into best execution order */
 	joinrestrictclauses = order_qual_clauses(root, joinrestrictclauses);
+	filterclauses = order_qual_clauses(root, best_path->filterrestrictinfo);
 
 	/* Get the join qual clauses (in plain expression form) */
 	/* Any pseudoconstant clauses are ignored here */
@@ -2263,6 +2267,8 @@ create_nestloop_plan(PlannerInfo *root,
 		otherclauses = NIL;
 	}
 
+	filterclauses = extract_actual_clauses(filterclauses, false);
+
 	/* Replace any outer-relation variables with nestloop params */
 	if (best_path->path.param_info)
 	{
@@ -2309,6 +2315,7 @@ create_nestloop_plan(PlannerInfo *root,
 
 	join_plan = make_nestloop(tlist,
 							  joinclauses,
+							  filterclauses,
 							  otherclauses,
 							  nestParams,
 							  outer_plan,
@@ -2328,6 +2335,7 @@ create_mergejoin_plan(PlannerInfo *root,
 {
 	List	   *tlist = build_path_tlist(root, &best_path->jpath.path);
 	List	   *joinclauses;
+	List	   *filterclauses;
 	List	   *otherclauses;
 	List	   *mergeclauses;
 	List	   *outerpathkeys;
@@ -2346,6 +2354,7 @@ create_mergejoin_plan(PlannerInfo *root,
 	/* Sort join qual clauses into best execution order */
 	/* NB: do NOT reorder the mergeclauses */
 	joinclauses = order_qual_clauses(root, best_path->jpath.joinrestrictinfo);
+	filterclauses = order_qual_clauses(root, best_path->jpath.filterrestrictinfo);
 
 	/* Get the join qual clauses (in plain expression form) */
 	/* Any pseudoconstant clauses are ignored here */
@@ -2361,6 +2370,7 @@ create_mergejoin_plan(PlannerInfo *root,
 		otherclauses = NIL;
 	}
 
+	filterclauses = extract_actual_clauses(filterclauses, false);
 	/*
 	 * Remove the mergeclauses from the list of join qual clauses, leaving the
 	 * list of quals that must be checked as qpquals.
@@ -2599,6 +2609,7 @@ create_mergejoin_plan(PlannerInfo *root,
 	 */
 	join_plan = make_mergejoin(tlist,
 							   joinclauses,
+							   filterclauses,
 							   otherclauses,
 							   mergeclauses,
 							   mergefamilies,
@@ -2623,6 +2634,7 @@ create_hashjoin_plan(PlannerInfo *root,
 {
 	List	   *tlist = build_path_tlist(root, &best_path->jpath.path);
 	List	   *joinclauses;
+	List	   *filterclauses;
 	List	   *otherclauses;
 	List	   *hashclauses;
 	Oid			skewTable = InvalidOid;
@@ -2635,6 +2647,7 @@ create_hashjoin_plan(PlannerInfo *root,
 
 	/* Sort join qual clauses into best execution order */
 	joinclauses = order_qual_clauses(root, best_path->jpath.joinrestrictinfo);
+	filterclauses = order_qual_clauses(root, best_path->jpath.filterrestrictinfo);
 	/* There's no point in sorting the hash clauses ... */
 
 	/* Get the join qual clauses (in plain expression form) */
@@ -2720,8 +2733,11 @@ create_hashjoin_plan(PlannerInfo *root,
 
 	/*
 	 * Build the hash node and hash join node.
+	 *
+	 * In HashJoin, filterclauses is needed by Hash, not HashJoin.
 	 */
 	hash_plan = make_hash(inner_plan,
+						  filterclauses,
 						  skewTable,
 						  skewColumn,
 						  skewInherit,
@@ -3862,6 +3878,7 @@ make_bitmap_or(List *bitmapplans)
 static NestLoop *
 make_nestloop(List *tlist,
 			  List *joinclauses,
+			  List *filterclauses,
 			  List *otherclauses,
 			  List *nestParams,
 			  Plan *lefttree,
@@ -3878,6 +3895,7 @@ make_nestloop(List *tlist,
 	plan->righttree = righttree;
 	node->join.jointype = jointype;
 	node->join.joinqual = joinclauses;
+	node->join.filterqual = filterclauses;
 	node->nestParams = nestParams;
 
 	return node;
@@ -3903,12 +3921,15 @@ make_hashjoin(List *tlist,
 	node->hashclauses = hashclauses;
 	node->join.jointype = jointype;
 	node->join.joinqual = joinclauses;
+	/* filterqual is not needed for HashJoin node */
+	node->join.filterqual = NIL;
 
 	return node;
 }
 
 static Hash *
 make_hash(Plan *lefttree,
+		  List *filterclauses,
 		  Oid skewTable,
 		  AttrNumber skewColumn,
 		  bool skewInherit,
@@ -3919,6 +3940,27 @@ make_hash(Plan *lefttree,
 	Plan	   *plan = &node->plan;
 
 	copy_plan_costsize(plan, lefttree);
+	/*
+	 * estimate nrows of 'lefttree' rel filtered by 'filterclauses'.
+	 * All selectivity values maybe cached, because clauselist_selectivity()
+	 * had already been called at this timing.
+	 * Thus, clauselist_selectivity() need not to specify real PlannerInfo here.
+	 */
+	if (filterclauses != NIL)
+	{
+#ifdef USE_ASSERT_CHECKING
+		ListCell *lc;
+
+		foreach (lc, filterclauses)
+		{
+			Assert(IsA(lfirst(lc), RestrictInfo));
+			Assert(((RestrictInfo *)lfirst(lc))->norm_selec != -1);
+		}
+#endif
+		plan->plan_rows *=
+				clauselist_selectivity(NULL, filterclauses, 0, JOIN_INNER, NULL);
+		plan->plan_rows = clamp_row_est(plan->plan_rows);
+	}
 
 	/*
 	 * For plausibility, make startup & total costs equal total cost of input
@@ -3930,6 +3972,7 @@ make_hash(Plan *lefttree,
 	plan->lefttree = lefttree;
 	plan->righttree = NULL;
 
+	node->filterqual = extract_actual_clauses(filterclauses, false);
 	node->skewTable = skewTable;
 	node->skewColumn = skewColumn;
 	node->skewInherit = skewInherit;
@@ -3942,6 +3985,7 @@ make_hash(Plan *lefttree,
 static MergeJoin *
 make_mergejoin(List *tlist,
 			   List *joinclauses,
+			   List *filterclauses,
 			   List *otherclauses,
 			   List *mergeclauses,
 			   Oid *mergefamilies,
@@ -3967,6 +4011,7 @@ make_mergejoin(List *tlist,
 	node->mergeNullsFirst = mergenullsfirst;
 	node->join.jointype = jointype;
 	node->join.joinqual = joinclauses;
+	node->join.filterqual = filterclauses;
 
 	return node;
 }
@@ -4011,6 +4056,21 @@ make_sort(PlannerInfo *root, Plan *lefttree, int numCols,
 	return node;
 }
 
+static inline bool
+should_ignore_ec_member(EquivalenceMember *em, Relids relids)
+{
+	/*
+	 * If this is called from make_sort_from_pathkeys, relids may be NULL.
+	 * In this case, we must not ignore child members because inner/outer plan
+	 * of pushed-down merge join is always child table.
+	 */
+	if (!relids)
+		return false;
+
+	return (em->em_is_child &&
+		!bms_equal(em->em_relids, relids));
+}
+
 /*
  * prepare_sort_from_pathkeys
  *	  Prepare to sort according to given pathkeys
@@ -4190,8 +4250,7 @@ prepare_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree, List *pathkeys,
 				 * Ignore child members unless they match the rel being
 				 * sorted.
 				 */
-				if (em->em_is_child &&
-					!bms_equal(em->em_relids, relids))
+				if (should_ignore_ec_member(em, relids))
 					continue;
 
 				sortexpr = em->em_expr;
@@ -4304,8 +4363,7 @@ find_ec_member_for_tle(EquivalenceClass *ec,
 		/*
 		 * Ignore child members unless they match the rel being sorted.
 		 */
-		if (em->em_is_child &&
-			!bms_equal(em->em_relids, relids))
+		if (should_ignore_ec_member(em, relids))
 			continue;
 
 		/* Match if same expression (after stripping relabel) */
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index daeb584..3d9eb00 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -104,6 +104,7 @@ static Node *fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context);
 static bool fix_scan_expr_walker(Node *node, fix_scan_expr_context *context);
 static void set_join_references(PlannerInfo *root, Join *join, int rtoffset);
 static void set_upper_references(PlannerInfo *root, Plan *plan, int rtoffset);
+static void set_hash_references(PlannerInfo *root, Hash *hash, int rtoffset);
 static void set_dummy_tlist_references(Plan *plan, int rtoffset);
 static indexed_tlist *build_tlist_index(List *tlist);
 static Var *search_indexed_tlist_for_var(Var *var,
@@ -598,6 +599,12 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 			break;
 
 		case T_Hash:
+			/*
+			 * For Hash node, we need to set INNER_VAR to varno of filterqual.
+			 */
+			set_hash_references(root, (Hash *) plan, rtoffset);
+			/* FALL THRU */
+
 		case T_Material:
 		case T_Sort:
 		case T_Unique:
@@ -1493,6 +1500,14 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 								   (Index) 0,
 								   rtoffset);
 
+	/* We need not to do for HashJoin */
+	if (!IsA(join, HashJoin) || join->filterqual != NIL)
+		join->filterqual = fix_upper_expr(root,
+										  (Node *) join->filterqual,
+										  inner_itlist,
+										  INNER_VAR,
+										  rtoffset);
+
 	/* Now do join-type-specific stuff */
 	if (IsA(join, NestLoop))
 	{
@@ -1655,6 +1670,30 @@ set_upper_references(PlannerInfo *root, Plan *plan, int rtoffset)
 	pfree(subplan_itlist);
 }
 
+static void
+set_hash_references(PlannerInfo *root, Hash *hash, int rtoffset)
+{
+	Plan	   *subplan = hash->plan.lefttree;
+	indexed_tlist *subplan_itlist;
+	List	   *output_targetlist;
+	ListCell   *l;
+
+	subplan_itlist = build_tlist_index(subplan->targetlist);
+
+	/*
+	 * Sub plan node is connected to this plan node as "OUTER",
+	 * so we specify OUTER_VAR instead of INNER_VAR.
+	 */
+	hash->filterqual =
+		(List *) fix_upper_expr(root,
+								(Node *) hash->filterqual,
+								subplan_itlist,
+								OUTER_VAR,
+								rtoffset);
+
+	pfree(subplan_itlist);
+}
+
 /*
  * set_dummy_tlist_references
  *	  Replace the targetlist of an upper-level plan node with a simple
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 935bc2b..4d8fe4b 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1562,6 +1562,7 @@ create_nestloop_path(PlannerInfo *root,
 					 Path *outer_path,
 					 Path *inner_path,
 					 List *restrict_clauses,
+					 List *filtering_clauses,
 					 List *pathkeys,
 					 Relids required_outer)
 {
@@ -1609,6 +1610,7 @@ create_nestloop_path(PlannerInfo *root,
 	pathnode->outerjoinpath = outer_path;
 	pathnode->innerjoinpath = inner_path;
 	pathnode->joinrestrictinfo = restrict_clauses;
+	pathnode->filterrestrictinfo = filtering_clauses;
 
 	final_cost_nestloop(root, pathnode, workspace, sjinfo, semifactors);
 
@@ -1643,6 +1645,7 @@ create_mergejoin_path(PlannerInfo *root,
 					  Path *outer_path,
 					  Path *inner_path,
 					  List *restrict_clauses,
+					  List *filtering_clauses,
 					  List *pathkeys,
 					  Relids required_outer,
 					  List *mergeclauses,
@@ -1666,6 +1669,7 @@ create_mergejoin_path(PlannerInfo *root,
 	pathnode->jpath.outerjoinpath = outer_path;
 	pathnode->jpath.innerjoinpath = inner_path;
 	pathnode->jpath.joinrestrictinfo = restrict_clauses;
+	pathnode->jpath.filterrestrictinfo = filtering_clauses;
 	pathnode->path_mergeclauses = mergeclauses;
 	pathnode->outersortkeys = outersortkeys;
 	pathnode->innersortkeys = innersortkeys;
@@ -1702,6 +1706,7 @@ create_hashjoin_path(PlannerInfo *root,
 					 Path *outer_path,
 					 Path *inner_path,
 					 List *restrict_clauses,
+					 List *filtering_clauses,
 					 Relids required_outer,
 					 List *hashclauses)
 {
@@ -1734,6 +1739,7 @@ create_hashjoin_path(PlannerInfo *root,
 	pathnode->jpath.outerjoinpath = outer_path;
 	pathnode->jpath.innerjoinpath = inner_path;
 	pathnode->jpath.joinrestrictinfo = restrict_clauses;
+	pathnode->jpath.filterrestrictinfo = filtering_clauses;
 	pathnode->path_hashclauses = hashclauses;
 	/* final_cost_hashjoin will fill in pathnode->num_batches */
 
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 9442e5f..c137b09 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -54,9 +54,6 @@ get_relation_info_hook_type get_relation_info_hook = NULL;
 static bool infer_collation_opclass_match(InferenceElem *elem, Relation idxRel,
 							  List *idxExprs);
 static int32 get_rel_data_width(Relation rel, int32 *attr_widths);
-static List *get_relation_constraints(PlannerInfo *root,
-						 Oid relationObjectId, RelOptInfo *rel,
-						 bool include_notnull);
 static List *build_index_tlist(PlannerInfo *root, IndexOptInfo *index,
 				  Relation heapRelation);
 
@@ -1022,7 +1019,7 @@ get_relation_data_width(Oid relid, int32 *attr_widths)
  * run, and in many cases it won't be invoked at all, so there seems no
  * point in caching the data in RelOptInfo.
  */
-static List *
+List *
 get_relation_constraints(PlannerInfo *root,
 						 Oid relationObjectId, RelOptInfo *rel,
 						 bool include_notnull)
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index 68a93a1..d6717db 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -496,19 +496,24 @@ build_joinrel_tlist(PlannerInfo *root, RelOptInfo *joinrel,
 {
 	Relids		relids = joinrel->relids;
 	ListCell   *vars;
+	int			nth = 0;
 
 	foreach(vars, input_rel->reltargetlist)
 	{
 		Var		   *var = (Var *) lfirst(vars);
 		RelOptInfo *baserel;
 		int			ndx;
+		bool		is_needed = false;
 
 		/*
 		 * Ignore PlaceHolderVars in the input tlists; we'll make our own
 		 * decisions about whether to copy them.
 		 */
 		if (IsA(var, PlaceHolderVar))
+		{
+			nth++;
 			continue;
+		}
 
 		/*
 		 * Otherwise, anything in a baserel or joinrel targetlist ought to be
@@ -521,15 +526,83 @@ build_joinrel_tlist(PlannerInfo *root, RelOptInfo *joinrel,
 
 		/* Get the Var's original base rel */
 		baserel = find_base_rel(root, var->varno);
+		ndx = var->varattno - baserel->min_attr;
+
+		/*
+		 * We must handle case of join pushdown.
+		 */
+		if (input_rel->reloptkind == RELOPT_OTHER_MEMBER_REL)
+		{
+			/* Get the Var's PARENT base rel */
+			Index	parent_relid =
+						find_childrel_appendrelinfo(root, input_rel)->parent_relid;
+			RelOptInfo *parent_rel = find_base_rel(root, parent_relid);
+			Var		*parent_var =
+						(Var *) list_nth(parent_rel->reltargetlist, nth);
+			int		parent_ndx = parent_var->varattno - parent_rel->min_attr;
+			/* Relids have included parent_rel's instead of input_rel's. */
+			Relids	relids_tmp =
+					bms_del_members(bms_copy(relids), input_rel->relids);
+
+			relids_tmp = bms_union(relids_tmp, parent_rel->relids);
+
+			Assert(ndx == parent_ndx);
+			is_needed =
+					(bms_nonempty_difference(
+							parent_rel->attr_needed[parent_ndx],
+							relids_tmp));
+
+			bms_free(relids_tmp);
+		}
+		else
+		{
+			Relids	relids_tmp =
+					bms_del_members(bms_copy(relids), input_rel->relids);
+			Index	another_relid = -1;
+
+			/* Try to detect Inner relation of pushed-down join. */
+			if (bms_get_singleton_member(relids_tmp, &another_relid))
+			{
+				RelOptInfo	*another_rel =
+						find_base_rel(root, another_relid);
+
+				if (another_rel->reloptkind == RELOPT_OTHER_MEMBER_REL)
+				{
+					/* This may be inner relation of pushed-down join. */
+					Index	parent_relid =
+								find_childrel_appendrelinfo(root, another_rel)->parent_relid;
+					RelOptInfo *parent_rel = find_base_rel(root, parent_relid);
+
+					bms_free(relids_tmp);
+					relids_tmp =
+							bms_union(input_rel->relids, parent_rel->relids);
+				}
+			}
+
+			if (!bms_is_subset(input_rel->relids, relids_tmp))
+			{
+				/* Can't detect inner relation of pushed-down join */
+				bms_free(relids_tmp);
+				relids_tmp = bms_copy(relids);
+			}
+
+			is_needed =
+					(bms_nonempty_difference(
+							baserel->attr_needed[ndx],
+							relids_tmp));
+
+			bms_free(relids_tmp);
+		}
 
 		/* Is it still needed above this joinrel? */
-		ndx = var->varattno - baserel->min_attr;
-		if (bms_nonempty_difference(baserel->attr_needed[ndx], relids))
+		if (is_needed)
 		{
 			/* Yup, add it to the output */
 			joinrel->reltargetlist = lappend(joinrel->reltargetlist, var);
 			joinrel->width += baserel->attr_widths[ndx];
 		}
+
+		nth++;
 	}
 }
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 4ae2f3e..f2b56d0 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1644,6 +1644,7 @@ typedef struct JoinState
 	PlanState	ps;
 	JoinType	jointype;
 	List	   *joinqual;		/* JOIN quals (in addition to ps.qual) */
+	List	   *filterqual;		/* FILTER quals (in addition to ps.qual) */
 } JoinState;
 
 /* ----------------
@@ -1957,6 +1958,7 @@ typedef struct UniqueState
 typedef struct HashState
 {
 	PlanState	ps;				/* its first field is NodeTag */
+	List	   *filterqual;		/* FILTER quals (in addition to ps.qual) */
 	HashJoinTable hashtable;	/* hash table for the hashjoin */
 	List	   *hashkeys;		/* list of ExprState nodes */
 	/* hashkeys is same as parent's hj_InnerHashKeys */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index cc259f1..37cef10 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -593,6 +593,7 @@ typedef struct Join
 	Plan		plan;
 	JoinType	jointype;
 	List	   *joinqual;		/* JOIN quals (in addition to plan.qual) */
+	List	   *filterqual;		/* FILTER quals (in addition to plan.qual) */
 } Join;
 
 /* ----------------
@@ -764,6 +765,7 @@ typedef struct Unique
 typedef struct Hash
 {
 	Plan		plan;
+	List	   *filterqual;		/* FILTER quals (in addition to plan.qual) */
 	Oid			skewTable;		/* outer join key's table OID, or InvalidOid */
 	AttrNumber	skewColumn;		/* outer join key's column #, or zero */
 	bool		skewInherit;	/* is outer join rel an inheritance tree? */
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 79bed33..83d41de 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1058,6 +1058,7 @@ typedef struct JoinPath
 	Path	   *innerjoinpath;	/* path for the inner side of the join */
 
 	List	   *joinrestrictinfo;		/* RestrictInfos to apply to join */
+	List	   *filterrestrictinfo;		/* RestrictInfos to filter at join */
 
 	/*
 	 * See the notes for RelOptInfo and ParamPathInfo to understand why
@@ -1706,6 +1707,7 @@ typedef struct JoinPathExtraData
 {
 	List	   *restrictlist;
 	List	   *mergeclause_list;
+	List	   *added_restrictlist;
 	SpecialJoinInfo *sjinfo;
 	SemiAntiJoinFactors semifactors;
 	Relids		param_source_rels;
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index dd43e45..e685a8d 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -137,6 +137,7 @@ extern void initial_cost_hashjoin(PlannerInfo *root,
 					  JoinCostWorkspace *workspace,
 					  JoinType jointype,
 					  List *hashclauses,
+					  List *added_restrictlist,
 					  Path *outer_path, Path *inner_path,
 					  SpecialJoinInfo *sjinfo,
 					  SemiAntiJoinFactors *semifactors);
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 161644c..9d3718d 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -97,6 +97,7 @@ extern NestPath *create_nestloop_path(PlannerInfo *root,
 					 Path *outer_path,
 					 Path *inner_path,
 					 List *restrict_clauses,
+					 List *filtering_clauses,
 					 List *pathkeys,
 					 Relids required_outer);
 
@@ -108,6 +109,7 @@ extern MergePath *create_mergejoin_path(PlannerInfo *root,
 					  Path *outer_path,
 					  Path *inner_path,
 					  List *restrict_clauses,
+					  List *filtering_clauses,
 					  List *pathkeys,
 					  Relids required_outer,
 					  List *mergeclauses,
@@ -123,6 +125,7 @@ extern HashPath *create_hashjoin_path(PlannerInfo *root,
 					 Path *outer_path,
 					 Path *inner_path,
 					 List *restrict_clauses,
+					 List *filtering_clauses,
 					 Relids required_outer,
 					 List *hashclauses);
 
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 87123a5..f038f5d 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -87,7 +87,8 @@ extern void create_tidscan_paths(PlannerInfo *root, RelOptInfo *rel);
 extern void add_paths_to_joinrel(PlannerInfo *root, RelOptInfo *joinrel,
 					 RelOptInfo *outerrel, RelOptInfo *innerrel,
 					 JoinType jointype, SpecialJoinInfo *sjinfo,
-					 List *restrictlist);
+					 List *restrictlist, List *added_restrictlist,
+					 bool added_rinfo_for_outer);
 
 /*
  * joinrels.c
@@ -95,7 +96,7 @@ extern void add_paths_to_joinrel(PlannerInfo *root, RelOptInfo *joinrel,
  */
 extern void join_search_one_level(PlannerInfo *root, int level);
 extern RelOptInfo *make_join_rel(PlannerInfo *root,
-			  RelOptInfo *rel1, RelOptInfo *rel2);
+			  RelOptInfo *rel1, RelOptInfo *rel2, List *added_restrictlist);
 extern bool have_join_order_restriction(PlannerInfo *root,
 							RelOptInfo *rel1, RelOptInfo *rel2);
 
diff --git a/src/include/optimizer/plancat.h b/src/include/optimizer/plancat.h
index 11e7d4d..f799a5b 100644
--- a/src/include/optimizer/plancat.h
+++ b/src/include/optimizer/plancat.h
@@ -28,6 +28,10 @@ extern PGDLLIMPORT get_relation_info_hook_type get_relation_info_hook;
 extern void get_relation_info(PlannerInfo *root, Oid relationObjectId,
 				  bool inhparent, RelOptInfo *rel);
 
+extern List *get_relation_constraints(PlannerInfo *root,
+						 Oid relationObjectId, RelOptInfo *rel,
+						 bool include_notnull);
+
 extern List *infer_arbiter_indexes(PlannerInfo *root);
 
 extern void estimate_rel_size(Relation rel, int32 *attr_widths,
#4Kouhei Kaigai
kaigai@ak.jp.nec.com
In reply to: Taiki Kondo (#3)
Re: [Proposal] Table partition + join pushdown

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Taiki Kondo
Sent: Thursday, September 24, 2015 8:06 PM
To: Kaigai Kouhei(海外 浩平)
Cc: Iwaasa Akio(岩浅 晃郎); pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] [Proposal] Table partition + join pushdown

Hello, KaiGai-san.

Thank you for your comment, and sorry for late response.

The attached patch is completely rewritten from previous patch[1], at your
suggestion[2].
Please find attached.

Thanks for your work, and let me introduce purpose of the work briefly,
because the last submission was August.

His work intends (1) to save resource consumption on tables join at this
moment, and (2) to provide an infrastructure of one parallel join scenario
once Funnel node gets capable.

Even if we construct partition tables, it is unavailable to utilize to
filter out candidate rows of join. In the result, size of Hash table
may grow more than necessity and it causes unnecessary nBatch increase.

Below is the scenario this project tries to tackle. In case when tables
join takes partitioned table on one side, usually, we once need to run
entire partitioned table unless we cannot drop particular child tables.

XXXXJoin cond (x = y)
-> Append
-> SeqScan on tbl_child_0 ... CHECK (hash_func(x) % 4 = 0)
-> SeqScan on tbl_child_1 ... CHECK (hash_func(x) % 4 = 1)
-> SeqScan on tbl_child_2 ... CHECK (hash_func(x) % 4 = 2)
-> SeqScan on tbl_child_3 ... CHECK (hash_func(x) % 4 = 3)
-> Hash
-> SeqScan on other_table

However, CHECK() constraint assigned on child tables give us hint which
rows in other side are never related to this join.
For example, all the rows in other_table to be joined with tbl_child_0
should have multiple number of 4 on hash_func(y). We may be able to omit
unrelated rows from the hash-table in this case, then it eventually allows
to reduce the size of hash table.

In case of INNER_JOIN, we can rewrite the query execution plan as below.

Append
-> HashJoin cond (x = y)
-> SeqScan on tbl_child_0
-> Hash ... Filter: hash_func(y) % 4 = 0
-> SeqScan on other_table
-> HashJoin cond (x = y)
-> SeqScan on tbl_child_1
-> Hash ... Filter: hash_func(y) % 4 = 1
-> SeqScan on other_table
-> HashJoin cond (x = y)
-> SeqScan on tbl_child_2
-> Hash ... Filter: hash_func(y) % 4 = 2
-> SeqScan on other_table
-> HashJoin cond (x = y)
-> SeqScan on tbl_child_3
-> Hash ... Filter: hash_func(y) % 4 = 3
-> SeqScan on other_table

Unrelated rows of Hash table is preliminarily, it allows to avoid hash
table split when its size reaches to work_mem limitation.

This join-pushdown is valuable on hash-join and merge-join if MJ takes
unsorted relation and number of rows to be sorted is performance factor.
Also, once Funnel gets capable to run Append on background worker, it
is also helpful to run NestLoop in parallel.

How about the opinion from third parties? I'm a bit biased, of course.

OK, below is the brief comment to patch.

* Suppose we focus on only HashJoin in the first version?
This patch also add support on NestLoop and MergeJoin, however, NestLoop
has no valuable scenario without parallel execution capability, and the
most valuable scenario on MergeJoin is reduction of rows prior to Sort.
Once input rows gets sorted, it is less attractive to filter out rows.

* MultiExecHash() once put slot on outer_slot then move it to inner_slot
This patch add set_hash_references() to replace varno in the expression
of Hash->filterqual to OUTER_VAR. Why not INNER_VAR?
If Var nodes would be initialized t oreference inner_slot, you don't need
to re-assign slot.

I'll try to have deeper investigation, later.

This patch contains following implementation, but I can't determine this is
correct or wrong.

1. Cost estimation
In this patch, additional row filter is implemented for Hash, Merge join and Nested
Loop.
I implemented cost estimation feature for this filter by watching other parts
of filters,
but I am not sure this implementation is correct.

@@ -2835,6 +2864,8 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
     * not all of the quals may get evaluated at each tuple.)
     */
    startup_cost += qp_qual_cost.startup;
+   startup_cost += filter_qual_cost.startup +
+           filter_qual_cost.per_tuple * inner_path_rows;
    cpu_per_tuple = cpu_tuple_cost + qp_qual_cost.per_tuple;
    run_cost += cpu_per_tuple * hashjointuples;

It seems to me it is not a fair estimation because inner_path_rows means
number of rows already filtered out, but filter_qual shall be applied to
all the inner input rows.

2. Workaround for failing assertion at allpaths.c
In standard_join_search(), we expect to have a single rel at the final level.
But this expectation is disappointed by join pushdown feature, because this will
search for the combinations not searched by original standard_join_serch()
at the final level. Therefore, once trying join pushdown is succeeded,
failing assertion occurs in allpaths.c.

So I implemented workaround by temporary set NULL to root->join_rel_level while
trying join pushdown, but I think this implementation may be wrong.

It is my off-list suggestion. The standard_join_search expects root of
the partition tables will appear, but child tables are out of scope.
Once we try to push down the join under the append, we need to consider
table joins between inner table and every outer child tables, however,
it should not be visible to standard_join_search context.
From the standpoint of standard_join_search, it get an AppendPath that
represents a table join A and B, even if A contains 100 children and
join was pushed down on behalf of the AppendPath.
So, it is a reasonable way to set NULL on root->join_rel_level to avoid
unexpected RelOptInfo addition by build_join_rel().
"to avoid assertion" is one fact, however, intension of the code is
avoid pollution of the global data structure. ;-)

3. Searching pathkeys for Merge Join
When join pushdown feature chooses merge join for pushed-down join operation,
planner fails to create merge join node because it is unable to find pathkeys
for this merge join. I found this is caused by skipping child table in finding
pathkeys.

I expect that it is for making planner faster, and I implemented that
planner doesn't skip child table in finding pathkeys for merge join.
But I am not sure this expectation is correct.

I like to recommend to omit MergeJoin support at the first version.

Thanks,

Any comments/suggestions are welcome.

Remarks :
[1]
/messages/by-id/12A9442FBAE80D4E8953883E0B84E0885C01FD@
BPXM01GP.gisp.nec.co.jp
[2]
/messages/by-id/9A28C8860F777E439AA12E8AEA7694F8011345B
6@BPXM15GP.gisp.nec.co.jp

Best regards,
--
Taiki Kondo

NEC Solution Innovators, Ltd.

-----Original Message-----
From: Kaigai Kouhei(海外 浩平) [mailto:kaigai@ak.jp.nec.com]
Sent: Tuesday, August 18, 2015 5:47 PM
To: Kondo Taiki(近藤 太樹); pgsql-hackers@postgresql.org
Cc: Iwaasa Akio(岩浅 晃郎)
Subject: RE: [Proposal] Table partition + join pushdown

Hello Kondo-san,

I briefly checked your patch. Let me put some comments about its design and
implementation, even though I have no arguments towards its concept. :-)

* Construction of RelOptInfo

In your patch, try_hashjoin_pushdown() called by try_hashjoin_path() constructs
RelOptInfo of the join-rel between inner-rel and a subpath of Append node. It
is entirely wrong implementation.

I can understand we (may) have no RelOptInfo for the joinrel between
tbl_child_0 and other_table, when planner investigates a joinpath to process join
Append path with the other_table.

HashJoin
-> Append
-> SeqScan on tbl_child_0
-> SeqScan on tbl_child_1
-> SeqScan on tbl_child_2
-> SeqScan on tbl_child_3
-> Hash
-> SeqScan on other_table

How about these alternatives?

- calls make_join_rel() to the pair of tbl_child_X and other_table
by try_hashjoin_pushdown() or somewhere. make_join_rel() internally
construct a RelOptInfo for the supplied pair of relations, so
relevant RelOptInfo shall be properly constructed.
- make_join_rel() also calls add_paths_to_joinrel() towards all the
join logic, so it makes easier to support to push down other join
logic including nested-loop or custom-join.
- It may be an idea to add an extra argument to make_join_rel() to
inform expressions to be applied for tuple filtering on
construction of inner hash table.

* Why only SeqScan is supported

I think it is the role of Hash-node to filter out inner tuples obviously unrelated
to the join (if CHECK constraint of outer relation gives information), because
this join-pushdown may be able to support multi-stacked pushdown.

For example, if planner considers a path to join this Append-path with another
relation, and join clause contains a reference to X?

Append
-> HashJoin
-> SeqScan on tbl_child_0
-> Hash ... Filter: hash_func(X) % 4 = 0
-> SeqScan on other_table
-> HashJoin
-> SeqScan on tbl_child_1
-> Hash ... Filter: hash_func(X) % 4 = 1
-> SeqScan on other_table
-> HashJoin
-> SeqScan on tbl_child_2
-> Hash ... Filter: hash_func(X) % 4 = 2
-> SeqScan on other_table
-> HashJoin
-> SeqScan on tbl_child_3
-> Hash ... Filter: hash_func(X) % 4 = 3
-> SeqScan on other_table

It may be a good challenge to consider additional join pushdown, even if subpaths
of Append are HashJoin, not SeqScan, like:

Append
-> HashJoin
-> HashJoin
-> SeqScan on tbl_child_0
-> Hash ... Filter: hash_func(X) % 4 = 0
-> SeqScan on other_table
-> Hash ... Filter: hash_func(X) % 4 = 0
-> SeqScan on another_table
-> HashJoin
-> HashJoin
-> SeqScan on tbl_child_1
-> Hash ... Filter: hash_func(X) % 4 = 1
-> SeqScan on other_table
-> Hash ... Filter: hash_func(X) % 4 = 1
-> SeqScan on another_table
-> HashJoin
-> HashJoin
-> SeqScan on tbl_child_2
-> Hash ... Filter: hash_func(X) % 4 = 2
-> SeqScan on other_table
-> Hash ... Filter: hash_func(X) % 4 = 2
-> SeqScan on another_table
-> HashJoin
-> HashJoin
-> SeqScan on tbl_child_3
-> Hash ... Filter: hash_func(X) % 4 = 3
-> SeqScan on other_table
-> Hash ... Filter: hash_func(X) % 4 = 3
-> SeqScan on another_table

In this case, underlying nodes are not always SeqScan. So, only Hash-node can
have filter clauses.

* Way to handle expression nodes

All this patch supported is CHECK() constraint with equal operation on INT4 data
type. You can learn various useful infrastructure of PostgreSQL. For example, ...
- expression_tree_mutator() is useful to make a copy of expression
node with small modification
- pull_varnos() is useful to check which relations are referenced
by the expression node.
- RestrictInfo->can_join is useful to check whether the clause is
binary operator, or not.

Anyway, reuse of existing infrastructure is the best way to make a reliable
infrastructure and to keep the implementation simple.

Thanks,
--
NEC Business Creation Division / PG-Strom Project KaiGai Kohei
<kaigai@ak.jp.nec.com>

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Taiki Kondo
Sent: Thursday, August 13, 2015 6:30 PM
To: pgsql-hackers@postgresql.org
Cc: Kaigai Kouhei(海外 浩平); Iwaasa Akio(岩浅 晃郎)
Subject: [HACKERS] [Proposal] Table partition + join pushdown

Hi all,

I saw the email about the idea from KaiGai-san[1], and I worked to
implement this idea.

Now, I have implemented a part of this idea, so I want to propose this
feature.

Patch attached just shows my concept of this feature.
It works fine for EXPLAIN, but it returns wrong result for other operations,

sadly.

Table partition + join pushdown
===============================

Motivation
----------
To make join logic working more effectively, it is important to make
the size of relations smaller.

Especially in Hash-join, it is meaningful to make the inner relation
smaller, because smaller inner relation can be stored within smaller hash table.
This means that memory usage can be reduced when joining with big tables.

Design
------
It was mentioned by the email from KaiGai-san.
So I quote below here...

---- begin quotation ---
Let's assume a table which is partitioned to four portions, and
individual child relations have constraint by hash-value of its ID
field.

tbl_parent
+ tbl_child_0 ... CHECK(hash_func(id) % 4 = 0)
+ tbl_child_1 ... CHECK(hash_func(id) % 4 = 1)
+ tbl_child_2 ... CHECK(hash_func(id) % 4 = 2)
+ tbl_child_3 ... CHECK(hash_func(id) % 4 = 3)

If someone tried to join another relation with tbl_parent using
equivalence condition, like X = tbl_parent.ID, we know inner tuples
that does not satisfies the condition
hash_func(X) % 4 = 0
shall be never joined to the tuples in tbl_child_0.
So, we can omit to load these tuples to inner hash table preliminary,
then it potentially allows to split the inner hash-table.

Current typical plan structure is below:

HashJoin
-> Append
-> SeqScan on tbl_child_0
-> SeqScan on tbl_child_1
-> SeqScan on tbl_child_2
-> SeqScan on tbl_child_3
-> Hash
-> SeqScan on other_table

It may be rewritable to:

Append
-> HashJoin
-> SeqScan on tbl_child_0
-> Hash ... Filter: hash_func(X) % 4 = 0
-> SeqScan on other_table
-> HashJoin
-> SeqScan on tbl_child_1
-> Hash ... Filter: hash_func(X) % 4 = 1
-> SeqScan on other_table
-> HashJoin
-> SeqScan on tbl_child_2
-> Hash ... Filter: hash_func(X) % 4 = 2
-> SeqScan on other_table
-> HashJoin
-> SeqScan on tbl_child_3
-> Hash ... Filter: hash_func(X) % 4 = 3
-> SeqScan on other_table
---- end quotation ---

In the quotation above, it was written that filter is set at Hash node.
But I implemented that filter is set at SeqScan node under Hash node.
In my opinion, filtering tuples is work of Scanner.

Append
-> HashJoin
-> SeqScan on tbl_child_0
-> Hash
-> SeqScan on other_table ... Filter: hash_func(X) % 4 = 0
-> HashJoin
-> SeqScan on tbl_child_1
-> Hash
-> SeqScan on other_table ... Filter: hash_func(X) % 4 = 1
-> HashJoin
-> SeqScan on tbl_child_2
-> Hash
-> SeqScan on other_table ... Filter: hash_func(X) % 4 = 2
-> HashJoin
-> SeqScan on tbl_child_3
-> Hash
-> SeqScan on other_table ... Filter: hash_func(X) % 4 = 3

API
---
There are 3 new internal (static) functions to implement this feature.
try_hashjoin_pushdown(), which is main function of this feature, is
called from try_hashjoin_path(), and tries to push down HashPath under
the AppendPath.

To do so, this function does following operations.

1. Check if this Hash-join can be pushed down under AppendPath
2. To avoid an influence on other Path making operation,
copy inner path's RelOptInfo and make new SeqScan path from it.
At here, get CHECK() constraints from OUTER path, and convert its
Var node according to join condition. And also convert Var nodes
in join condition itself.
3. Create new HashPath nodes between each sub-paths of AppendPath and
inner path made above.
4. When operations from 1 to 3 are done for each sub-paths,
create new AppendPath which sub-paths are HashPath nodes made above.

get_replaced_clause_constr() is called from try_hashjoin_pushdown(),
and get_var_nodes_recurse() is called from get_replaced_cluase_constr().
These 2 functions help above operations.
(I may revise this part to use expression_tree_walker() and
expression_tree_mutator().)

In patch attached, it has the following limitations.
o It only works for hash-join operation.
(I want to support not only hash-join but also other logic.)
o Join conditions must be "=" operator with int4 variables.
o Inner path must be SeqScan.
(I want to support other path-node.)
o And now, planner may not choose this plan,
because estimated costs are usually larger than original (non-pushdown)

plan.

And also 1 internal (static) function, get_relation_constraints()
defined in plancat.c, is changed to global. This function will be
called from
get_replaced_clause_constr() to get CHECK() constraints.

Usage
-----
To use this feature, create partition tables and small table to join,
and run select operation with joining these tables.

For your convenience, I attach DDL and DML script.
And I also attach the result of EXPLAIN.

Any comments are welcome. But, at first, I need your advices to
correct this patch's behavior.

At least, I think it has to expand array of RangeTblEntry and other
arrays defined in PlannerInfo to register new RelOptInfos for new Path nodes

mentioned above.

Or, is it better choice to modify query parser to implement this
feature more further?

Remarks :
[1]
/messages/by-id/9A28C8860F777E439AA12E8AEA7694F80
10F672
B@BPXM15GP.gisp.nec.co.jp

Best regards,
--
Taiki Kondo

NEC Solution Innovators, Ltd.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Taiki Kondo
tai-kondo@yk.jp.nec.com
In reply to: Kouhei Kaigai (#4)
Re: [Proposal] Table partition + join pushdown

Hello, KaiGai-san

Thank you for your introduction and comments.

* Suppose we focus on only HashJoin in the first version?
This patch also add support on NestLoop and MergeJoin, however, NestLoop
has no valuable scenario without parallel execution capability, and the
most valuable scenario on MergeJoin is reduction of rows prior to Sort.
Once input rows gets sorted, it is less attractive to filter out rows.

I agree that handling for NestLoop doesn't make sense in this timing.
But I think that handling for MergeJoin still makes sense in this timing.

In my v1 patch, I implemented that the additional filter is used for
qualification at same place as join filter, same as NestLoop.
It is not useful indeed. I agree with you at this point.

I think, and you also mentioned, large factor of cost estimation for MergeJoin is
Sort under MergeJoin, so I believe additional filtering at Sort is a good choice for
this situation, as same way at Hash under HashJoin.

Furthermore, I think it is better way that the additional filtering shall be
"added" to Scan node under each child (pushed-down) Join nodes, because we
don't have to implement additional qualification at Join nodes and
we only have to implement simply concatenating original and additional
RestrictInfos for filtering.

As a mere idea, for realizing this way, I think we have to implement copyObject()
for Scan path nodes and use ppi_clauses for this usage.

What is your opinion?

* MultiExecHash() once put slot on outer_slot then move it to inner_slot
This patch add set_hash_references() to replace varno in the expression
of Hash->filterqual to OUTER_VAR. Why not INNER_VAR?
If Var nodes would be initialized t oreference inner_slot, you don't need
to re-assign slot.

The node under Hash node is connected as the OUTER node. This implementation may be
from implementation of set_dummy_tlist_references() commonly used by Material,
Sort, Unique, SetOp, and Hash.

And I was faced a problem when I was implementing EXPLAIN for the additional filter.
I implemented same as you mentioned above, then error occurred in running EXPLAIN.
I think EXPLAIN expects expression's varno is same as the position that the under node
is connected to; i.e. if it is connected to OUTER, varno must be OUTER_VAR.

It seems to me it is not a fair estimation because inner_path_rows means
number of rows already filtered out, but filter_qual shall be applied to
all the inner input rows.

OK. I'll fix it.

Best regards,
--
Taiki Kondo

NEC Solution Innovators, Ltd.

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Kouhei Kaigai
Sent: Tuesday, September 29, 2015 11:46 AM
To: Taiki Kondo
Cc: Akio Iwaasa; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] [Proposal] Table partition + join pushdown

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Taiki Kondo
Sent: Thursday, September 24, 2015 8:06 PM
To: Kaigai Kouhei(海外 浩平)
Cc: Iwaasa Akio(岩浅 晃郎); pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] [Proposal] Table partition + join pushdown

Hello, KaiGai-san.

Thank you for your comment, and sorry for late response.

The attached patch is completely rewritten from previous patch[1], at your
suggestion[2].
Please find attached.

Thanks for your work, and let me introduce purpose of the work briefly,
because the last submission was August.

His work intends (1) to save resource consumption on tables join at this
moment, and (2) to provide an infrastructure of one parallel join scenario
once Funnel node gets capable.

Even if we construct partition tables, it is unavailable to utilize to
filter out candidate rows of join. In the result, size of Hash table
may grow more than necessity and it causes unnecessary nBatch increase.

Below is the scenario this project tries to tackle. In case when tables
join takes partitioned table on one side, usually, we once need to run
entire partitioned table unless we cannot drop particular child tables.

XXXXJoin cond (x = y)
-> Append
-> SeqScan on tbl_child_0 ... CHECK (hash_func(x) % 4 = 0)
-> SeqScan on tbl_child_1 ... CHECK (hash_func(x) % 4 = 1)
-> SeqScan on tbl_child_2 ... CHECK (hash_func(x) % 4 = 2)
-> SeqScan on tbl_child_3 ... CHECK (hash_func(x) % 4 = 3)
-> Hash
-> SeqScan on other_table

However, CHECK() constraint assigned on child tables give us hint which
rows in other side are never related to this join.
For example, all the rows in other_table to be joined with tbl_child_0
should have multiple number of 4 on hash_func(y). We may be able to omit
unrelated rows from the hash-table in this case, then it eventually allows
to reduce the size of hash table.

In case of INNER_JOIN, we can rewrite the query execution plan as below.

Append
-> HashJoin cond (x = y)
-> SeqScan on tbl_child_0
-> Hash ... Filter: hash_func(y) % 4 = 0
-> SeqScan on other_table
-> HashJoin cond (x = y)
-> SeqScan on tbl_child_1
-> Hash ... Filter: hash_func(y) % 4 = 1
-> SeqScan on other_table
-> HashJoin cond (x = y)
-> SeqScan on tbl_child_2
-> Hash ... Filter: hash_func(y) % 4 = 2
-> SeqScan on other_table
-> HashJoin cond (x = y)
-> SeqScan on tbl_child_3
-> Hash ... Filter: hash_func(y) % 4 = 3
-> SeqScan on other_table

Unrelated rows of Hash table is preliminarily, it allows to avoid hash
table split when its size reaches to work_mem limitation.

This join-pushdown is valuable on hash-join and merge-join if MJ takes
unsorted relation and number of rows to be sorted is performance factor.
Also, once Funnel gets capable to run Append on background worker, it
is also helpful to run NestLoop in parallel.

How about the opinion from third parties? I'm a bit biased, of course.

OK, below is the brief comment to patch.

* Suppose we focus on only HashJoin in the first version?
This patch also add support on NestLoop and MergeJoin, however, NestLoop
has no valuable scenario without parallel execution capability, and the
most valuable scenario on MergeJoin is reduction of rows prior to Sort.
Once input rows gets sorted, it is less attractive to filter out rows.

* MultiExecHash() once put slot on outer_slot then move it to inner_slot
This patch add set_hash_references() to replace varno in the expression
of Hash->filterqual to OUTER_VAR. Why not INNER_VAR?
If Var nodes would be initialized t oreference inner_slot, you don't need
to re-assign slot.

I'll try to have deeper investigation, later.

This patch contains following implementation, but I can't determine this is
correct or wrong.

1. Cost estimation
In this patch, additional row filter is implemented for Hash, Merge join and Nested
Loop.
I implemented cost estimation feature for this filter by watching other parts
of filters,
but I am not sure this implementation is correct.

@@ -2835,6 +2864,8 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
     * not all of the quals may get evaluated at each tuple.)
     */
    startup_cost += qp_qual_cost.startup;
+   startup_cost += filter_qual_cost.startup +
+           filter_qual_cost.per_tuple * inner_path_rows;
    cpu_per_tuple = cpu_tuple_cost + qp_qual_cost.per_tuple;
    run_cost += cpu_per_tuple * hashjointuples;

It seems to me it is not a fair estimation because inner_path_rows means
number of rows already filtered out, but filter_qual shall be applied to
all the inner input rows.

2. Workaround for failing assertion at allpaths.c
In standard_join_search(), we expect to have a single rel at the final level.
But this expectation is disappointed by join pushdown feature, because this will
search for the combinations not searched by original standard_join_serch()
at the final level. Therefore, once trying join pushdown is succeeded,
failing assertion occurs in allpaths.c.

So I implemented workaround by temporary set NULL to root->join_rel_level while
trying join pushdown, but I think this implementation may be wrong.

It is my off-list suggestion. The standard_join_search expects root of
the partition tables will appear, but child tables are out of scope.
Once we try to push down the join under the append, we need to consider
table joins between inner table and every outer child tables, however,
it should not be visible to standard_join_search context.
From the standpoint of standard_join_search, it get an AppendPath that
represents a table join A and B, even if A contains 100 children and
join was pushed down on behalf of the AppendPath.
So, it is a reasonable way to set NULL on root->join_rel_level to avoid
unexpected RelOptInfo addition by build_join_rel().
"to avoid assertion" is one fact, however, intension of the code is
avoid pollution of the global data structure. ;-)

3. Searching pathkeys for Merge Join
When join pushdown feature chooses merge join for pushed-down join operation,
planner fails to create merge join node because it is unable to find pathkeys
for this merge join. I found this is caused by skipping child table in finding
pathkeys.

I expect that it is for making planner faster, and I implemented that
planner doesn't skip child table in finding pathkeys for merge join.
But I am not sure this expectation is correct.

I like to recommend to omit MergeJoin support at the first version.

Thanks,

Any comments/suggestions are welcome.

Remarks :
[1]
/messages/by-id/12A9442FBAE80D4E8953883E0B84E0885C01FD@
BPXM01GP.gisp.nec.co.jp
[2]
/messages/by-id/9A28C8860F777E439AA12E8AEA7694F8011345B
6@BPXM15GP.gisp.nec.co.jp

Best regards,
--
Taiki Kondo

NEC Solution Innovators, Ltd.

-----Original Message-----
From: Kaigai Kouhei(海外 浩平) [mailto:kaigai@ak.jp.nec.com]
Sent: Tuesday, August 18, 2015 5:47 PM
To: Kondo Taiki(近藤 太樹); pgsql-hackers@postgresql.org
Cc: Iwaasa Akio(岩浅 晃郎)
Subject: RE: [Proposal] Table partition + join pushdown

Hello Kondo-san,

I briefly checked your patch. Let me put some comments about its design and
implementation, even though I have no arguments towards its concept. :-)

* Construction of RelOptInfo

In your patch, try_hashjoin_pushdown() called by try_hashjoin_path() constructs
RelOptInfo of the join-rel between inner-rel and a subpath of Append node. It
is entirely wrong implementation.

I can understand we (may) have no RelOptInfo for the joinrel between
tbl_child_0 and other_table, when planner investigates a joinpath to process join
Append path with the other_table.

HashJoin
-> Append
-> SeqScan on tbl_child_0
-> SeqScan on tbl_child_1
-> SeqScan on tbl_child_2
-> SeqScan on tbl_child_3
-> Hash
-> SeqScan on other_table

How about these alternatives?

- calls make_join_rel() to the pair of tbl_child_X and other_table
by try_hashjoin_pushdown() or somewhere. make_join_rel() internally
construct a RelOptInfo for the supplied pair of relations, so
relevant RelOptInfo shall be properly constructed.
- make_join_rel() also calls add_paths_to_joinrel() towards all the
join logic, so it makes easier to support to push down other join
logic including nested-loop or custom-join.
- It may be an idea to add an extra argument to make_join_rel() to
inform expressions to be applied for tuple filtering on
construction of inner hash table.

* Why only SeqScan is supported

I think it is the role of Hash-node to filter out inner tuples obviously unrelated
to the join (if CHECK constraint of outer relation gives information), because
this join-pushdown may be able to support multi-stacked pushdown.

For example, if planner considers a path to join this Append-path with another
relation, and join clause contains a reference to X?

Append
-> HashJoin
-> SeqScan on tbl_child_0
-> Hash ... Filter: hash_func(X) % 4 = 0
-> SeqScan on other_table
-> HashJoin
-> SeqScan on tbl_child_1
-> Hash ... Filter: hash_func(X) % 4 = 1
-> SeqScan on other_table
-> HashJoin
-> SeqScan on tbl_child_2
-> Hash ... Filter: hash_func(X) % 4 = 2
-> SeqScan on other_table
-> HashJoin
-> SeqScan on tbl_child_3
-> Hash ... Filter: hash_func(X) % 4 = 3
-> SeqScan on other_table

It may be a good challenge to consider additional join pushdown, even if subpaths
of Append are HashJoin, not SeqScan, like:

Append
-> HashJoin
-> HashJoin
-> SeqScan on tbl_child_0
-> Hash ... Filter: hash_func(X) % 4 = 0
-> SeqScan on other_table
-> Hash ... Filter: hash_func(X) % 4 = 0
-> SeqScan on another_table
-> HashJoin
-> HashJoin
-> SeqScan on tbl_child_1
-> Hash ... Filter: hash_func(X) % 4 = 1
-> SeqScan on other_table
-> Hash ... Filter: hash_func(X) % 4 = 1
-> SeqScan on another_table
-> HashJoin
-> HashJoin
-> SeqScan on tbl_child_2
-> Hash ... Filter: hash_func(X) % 4 = 2
-> SeqScan on other_table
-> Hash ... Filter: hash_func(X) % 4 = 2
-> SeqScan on another_table
-> HashJoin
-> HashJoin
-> SeqScan on tbl_child_3
-> Hash ... Filter: hash_func(X) % 4 = 3
-> SeqScan on other_table
-> Hash ... Filter: hash_func(X) % 4 = 3
-> SeqScan on another_table

In this case, underlying nodes are not always SeqScan. So, only Hash-node can
have filter clauses.

* Way to handle expression nodes

All this patch supported is CHECK() constraint with equal operation on INT4 data
type. You can learn various useful infrastructure of PostgreSQL. For example, ...
- expression_tree_mutator() is useful to make a copy of expression
node with small modification
- pull_varnos() is useful to check which relations are referenced
by the expression node.
- RestrictInfo->can_join is useful to check whether the clause is
binary operator, or not.

Anyway, reuse of existing infrastructure is the best way to make a reliable
infrastructure and to keep the implementation simple.

Thanks,
--
NEC Business Creation Division / PG-Strom Project KaiGai Kohei
<kaigai@ak.jp.nec.com>

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Taiki Kondo
Sent: Thursday, August 13, 2015 6:30 PM
To: pgsql-hackers@postgresql.org
Cc: Kaigai Kouhei(海外 浩平); Iwaasa Akio(岩浅 晃郎)
Subject: [HACKERS] [Proposal] Table partition + join pushdown

Hi all,

I saw the email about the idea from KaiGai-san[1], and I worked to
implement this idea.

Now, I have implemented a part of this idea, so I want to propose this
feature.

Patch attached just shows my concept of this feature.
It works fine for EXPLAIN, but it returns wrong result for other operations,

sadly.

Table partition + join pushdown
===============================

Motivation
----------
To make join logic working more effectively, it is important to make
the size of relations smaller.

Especially in Hash-join, it is meaningful to make the inner relation
smaller, because smaller inner relation can be stored within smaller hash table.
This means that memory usage can be reduced when joining with big tables.

Design
------
It was mentioned by the email from KaiGai-san.
So I quote below here...

---- begin quotation ---
Let's assume a table which is partitioned to four portions, and
individual child relations have constraint by hash-value of its ID
field.

tbl_parent
+ tbl_child_0 ... CHECK(hash_func(id) % 4 = 0)
+ tbl_child_1 ... CHECK(hash_func(id) % 4 = 1)
+ tbl_child_2 ... CHECK(hash_func(id) % 4 = 2)
+ tbl_child_3 ... CHECK(hash_func(id) % 4 = 3)

If someone tried to join another relation with tbl_parent using
equivalence condition, like X = tbl_parent.ID, we know inner tuples
that does not satisfies the condition
hash_func(X) % 4 = 0
shall be never joined to the tuples in tbl_child_0.
So, we can omit to load these tuples to inner hash table preliminary,
then it potentially allows to split the inner hash-table.

Current typical plan structure is below:

HashJoin
-> Append
-> SeqScan on tbl_child_0
-> SeqScan on tbl_child_1
-> SeqScan on tbl_child_2
-> SeqScan on tbl_child_3
-> Hash
-> SeqScan on other_table

It may be rewritable to:

Append
-> HashJoin
-> SeqScan on tbl_child_0
-> Hash ... Filter: hash_func(X) % 4 = 0
-> SeqScan on other_table
-> HashJoin
-> SeqScan on tbl_child_1
-> Hash ... Filter: hash_func(X) % 4 = 1
-> SeqScan on other_table
-> HashJoin
-> SeqScan on tbl_child_2
-> Hash ... Filter: hash_func(X) % 4 = 2
-> SeqScan on other_table
-> HashJoin
-> SeqScan on tbl_child_3
-> Hash ... Filter: hash_func(X) % 4 = 3
-> SeqScan on other_table
---- end quotation ---

In the quotation above, it was written that filter is set at Hash node.
But I implemented that filter is set at SeqScan node under Hash node.
In my opinion, filtering tuples is work of Scanner.

Append
-> HashJoin
-> SeqScan on tbl_child_0
-> Hash
-> SeqScan on other_table ... Filter: hash_func(X) % 4 = 0
-> HashJoin
-> SeqScan on tbl_child_1
-> Hash
-> SeqScan on other_table ... Filter: hash_func(X) % 4 = 1
-> HashJoin
-> SeqScan on tbl_child_2
-> Hash
-> SeqScan on other_table ... Filter: hash_func(X) % 4 = 2
-> HashJoin
-> SeqScan on tbl_child_3
-> Hash
-> SeqScan on other_table ... Filter: hash_func(X) % 4 = 3

API
---
There are 3 new internal (static) functions to implement this feature.
try_hashjoin_pushdown(), which is main function of this feature, is
called from try_hashjoin_path(), and tries to push down HashPath under
the AppendPath.

To do so, this function does following operations.

1. Check if this Hash-join can be pushed down under AppendPath
2. To avoid an influence on other Path making operation,
copy inner path's RelOptInfo and make new SeqScan path from it.
At here, get CHECK() constraints from OUTER path, and convert its
Var node according to join condition. And also convert Var nodes
in join condition itself.
3. Create new HashPath nodes between each sub-paths of AppendPath and
inner path made above.
4. When operations from 1 to 3 are done for each sub-paths,
create new AppendPath which sub-paths are HashPath nodes made above.

get_replaced_clause_constr() is called from try_hashjoin_pushdown(),
and get_var_nodes_recurse() is called from get_replaced_cluase_constr().
These 2 functions help above operations.
(I may revise this part to use expression_tree_walker() and
expression_tree_mutator().)

In patch attached, it has the following limitations.
o It only works for hash-join operation.
(I want to support not only hash-join but also other logic.)
o Join conditions must be "=" operator with int4 variables.
o Inner path must be SeqScan.
(I want to support other path-node.)
o And now, planner may not choose this plan,
because estimated costs are usually larger than original (non-pushdown)

plan.

And also 1 internal (static) function, get_relation_constraints()
defined in plancat.c, is changed to global. This function will be
called from
get_replaced_clause_constr() to get CHECK() constraints.

Usage
-----
To use this feature, create partition tables and small table to join,
and run select operation with joining these tables.

For your convenience, I attach DDL and DML script.
And I also attach the result of EXPLAIN.

Any comments are welcome. But, at first, I need your advices to
correct this patch's behavior.

At least, I think it has to expand array of RangeTblEntry and other
arrays defined in PlannerInfo to register new RelOptInfos for new Path nodes

mentioned above.

Or, is it better choice to modify query parser to implement this
feature more further?

Remarks :
[1]
/messages/by-id/9A28C8860F777E439AA12E8AEA7694F80
10F672
B@BPXM15GP.gisp.nec.co.jp

Best regards,
--
Taiki Kondo

NEC Solution Innovators, Ltd.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Kouhei Kaigai
kaigai@ak.jp.nec.com
In reply to: Taiki Kondo (#5)
Re: [Proposal] Table partition + join pushdown

* Suppose we focus on only HashJoin in the first version?
This patch also add support on NestLoop and MergeJoin, however, NestLoop
has no valuable scenario without parallel execution capability, and the
most valuable scenario on MergeJoin is reduction of rows prior to Sort.
Once input rows gets sorted, it is less attractive to filter out rows.

I agree that handling for NestLoop doesn't make sense in this timing.
But I think that handling for MergeJoin still makes sense in this timing.

In my v1 patch, I implemented that the additional filter is used for
qualification at same place as join filter, same as NestLoop.
It is not useful indeed. I agree with you at this point.

I think, and you also mentioned, large factor of cost estimation for MergeJoin
is
Sort under MergeJoin, so I believe additional filtering at Sort is a good choice
for
this situation, as same way at Hash under HashJoin.

Furthermore, I think it is better way that the additional filtering shall be
"added" to Scan node under each child (pushed-down) Join nodes, because we
don't have to implement additional qualification at Join nodes and
we only have to implement simply concatenating original and additional
RestrictInfos for filtering.

As a mere idea, for realizing this way, I think we have to implement copyObject()
for Scan path nodes and use ppi_clauses for this usage.

What is your opinion?

You are saying this part at create_scan_plan(), aren't you.

/*
* If this is a parameterized scan, we also need to enforce all the join
* clauses available from the outer relation(s).
*
* For paranoia's sake, don't modify the stored baserestrictinfo list.
*/
if (best_path->param_info)
scan_clauses = list_concat(list_copy(scan_clauses),
best_path->param_info->ppi_clauses);

If inner-scan of the join under the append node has param_info, its qualifier
shall be implicitly attached to the scan node. So, if it is legal, I'd like to
have this approach because it is less invasive than enhancement of Hash node.

You mention about copyObject() to make a duplication of underlying scan-path.
Actually, copyObject() support is not minimum requirement, because all you
need to do here is flat copy of the original path-node, then put param_info.
(Be careful to check whether the original path is not parametalized.)

ParamPathInfo is declared as below:

typedef struct ParamPathInfo
{
NodeTag type;

Relids ppi_req_outer; /* rels supplying parameters used by path */
double ppi_rows; /* estimated number of result tuples */
List *ppi_clauses; /* join clauses available from outer rels */
} ParamPathInfo;

You may need to set the additional filter on ppi_clauses, number of rows
after the filtering on ppi_rows and NULL on ppi_req_outer.
However, I'm not 100% certain whether NULL is legal value on ppi_req_outer.

If somebody can comment on, it is helpful.

* MultiExecHash() once put slot on outer_slot then move it to inner_slot
This patch add set_hash_references() to replace varno in the expression
of Hash->filterqual to OUTER_VAR. Why not INNER_VAR?
If Var nodes would be initialized t oreference inner_slot, you don't need
to re-assign slot.

The node under Hash node is connected as the OUTER node. This implementation may
be
from implementation of set_dummy_tlist_references() commonly used by Material,
Sort, Unique, SetOp, and Hash.

And I was faced a problem when I was implementing EXPLAIN for the additional filter.
I implemented same as you mentioned above, then error occurred in running EXPLAIN.
I think EXPLAIN expects expression's varno is same as the position that the under
node
is connected to; i.e. if it is connected to OUTER, varno must be OUTER_VAR.

Ah, OK. It is a trade-off matter, indeed.

It seems to me it is not a fair estimation because inner_path_rows means
number of rows already filtered out, but filter_qual shall be applied to
all the inner input rows.

OK. I'll fix it.

Best regards,
--
Taiki Kondo

NEC Solution Innovators, Ltd.

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Kouhei Kaigai
Sent: Tuesday, September 29, 2015 11:46 AM
To: Taiki Kondo
Cc: Akio Iwaasa; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] [Proposal] Table partition + join pushdown

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Taiki Kondo
Sent: Thursday, September 24, 2015 8:06 PM
To: Kaigai Kouhei(海外 浩平)
Cc: Iwaasa Akio(岩浅 晃郎); pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] [Proposal] Table partition + join pushdown

Hello, KaiGai-san.

Thank you for your comment, and sorry for late response.

The attached patch is completely rewritten from previous patch[1], at your
suggestion[2].
Please find attached.

Thanks for your work, and let me introduce purpose of the work briefly,
because the last submission was August.

His work intends (1) to save resource consumption on tables join at this
moment, and (2) to provide an infrastructure of one parallel join scenario
once Funnel node gets capable.

Even if we construct partition tables, it is unavailable to utilize to
filter out candidate rows of join. In the result, size of Hash table
may grow more than necessity and it causes unnecessary nBatch increase.

Below is the scenario this project tries to tackle. In case when tables
join takes partitioned table on one side, usually, we once need to run
entire partitioned table unless we cannot drop particular child tables.

XXXXJoin cond (x = y)
-> Append
-> SeqScan on tbl_child_0 ... CHECK (hash_func(x) % 4 = 0)
-> SeqScan on tbl_child_1 ... CHECK (hash_func(x) % 4 = 1)
-> SeqScan on tbl_child_2 ... CHECK (hash_func(x) % 4 = 2)
-> SeqScan on tbl_child_3 ... CHECK (hash_func(x) % 4 = 3)
-> Hash
-> SeqScan on other_table

However, CHECK() constraint assigned on child tables give us hint which
rows in other side are never related to this join.
For example, all the rows in other_table to be joined with tbl_child_0
should have multiple number of 4 on hash_func(y). We may be able to omit
unrelated rows from the hash-table in this case, then it eventually allows
to reduce the size of hash table.

In case of INNER_JOIN, we can rewrite the query execution plan as below.

Append
-> HashJoin cond (x = y)
-> SeqScan on tbl_child_0
-> Hash ... Filter: hash_func(y) % 4 = 0
-> SeqScan on other_table
-> HashJoin cond (x = y)
-> SeqScan on tbl_child_1
-> Hash ... Filter: hash_func(y) % 4 = 1
-> SeqScan on other_table
-> HashJoin cond (x = y)
-> SeqScan on tbl_child_2
-> Hash ... Filter: hash_func(y) % 4 = 2
-> SeqScan on other_table
-> HashJoin cond (x = y)
-> SeqScan on tbl_child_3
-> Hash ... Filter: hash_func(y) % 4 = 3
-> SeqScan on other_table

Unrelated rows of Hash table is preliminarily, it allows to avoid hash
table split when its size reaches to work_mem limitation.

This join-pushdown is valuable on hash-join and merge-join if MJ takes
unsorted relation and number of rows to be sorted is performance factor.
Also, once Funnel gets capable to run Append on background worker, it
is also helpful to run NestLoop in parallel.

How about the opinion from third parties? I'm a bit biased, of course.

OK, below is the brief comment to patch.

* Suppose we focus on only HashJoin in the first version?
This patch also add support on NestLoop and MergeJoin, however, NestLoop
has no valuable scenario without parallel execution capability, and the
most valuable scenario on MergeJoin is reduction of rows prior to Sort.
Once input rows gets sorted, it is less attractive to filter out rows.

* MultiExecHash() once put slot on outer_slot then move it to inner_slot
This patch add set_hash_references() to replace varno in the expression
of Hash->filterqual to OUTER_VAR. Why not INNER_VAR?
If Var nodes would be initialized t oreference inner_slot, you don't need
to re-assign slot.

I'll try to have deeper investigation, later.

This patch contains following implementation, but I can't determine this is
correct or wrong.

1. Cost estimation
In this patch, additional row filter is implemented for Hash, Merge join and

Nested

Loop.
I implemented cost estimation feature for this filter by watching other parts
of filters,
but I am not sure this implementation is correct.

@@ -2835,6 +2864,8 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
* not all of the quals may get evaluated at each tuple.)
*/
startup_cost += qp_qual_cost.startup;
+   startup_cost += filter_qual_cost.startup +
+           filter_qual_cost.per_tuple * inner_path_rows;
cpu_per_tuple = cpu_tuple_cost + qp_qual_cost.per_tuple;
run_cost += cpu_per_tuple * hashjointuples;

It seems to me it is not a fair estimation because inner_path_rows means
number of rows already filtered out, but filter_qual shall be applied to
all the inner input rows.

2. Workaround for failing assertion at allpaths.c
In standard_join_search(), we expect to have a single rel at the final level.
But this expectation is disappointed by join pushdown feature, because this

will

search for the combinations not searched by original standard_join_serch()
at the final level. Therefore, once trying join pushdown is succeeded,
failing assertion occurs in allpaths.c.

So I implemented workaround by temporary set NULL to root->join_rel_level while
trying join pushdown, but I think this implementation may be wrong.

It is my off-list suggestion. The standard_join_search expects root of
the partition tables will appear, but child tables are out of scope.
Once we try to push down the join under the append, we need to consider
table joins between inner table and every outer child tables, however,
it should not be visible to standard_join_search context.
From the standpoint of standard_join_search, it get an AppendPath that
represents a table join A and B, even if A contains 100 children and
join was pushed down on behalf of the AppendPath.
So, it is a reasonable way to set NULL on root->join_rel_level to avoid
unexpected RelOptInfo addition by build_join_rel().
"to avoid assertion" is one fact, however, intension of the code is
avoid pollution of the global data structure. ;-)

3. Searching pathkeys for Merge Join
When join pushdown feature chooses merge join for pushed-down join operation,
planner fails to create merge join node because it is unable to find pathkeys
for this merge join. I found this is caused by skipping child table in finding
pathkeys.

I expect that it is for making planner faster, and I implemented that
planner doesn't skip child table in finding pathkeys for merge join.
But I am not sure this expectation is correct.

I like to recommend to omit MergeJoin support at the first version.

Thanks,

Any comments/suggestions are welcome.

Remarks :
[1]

/messages/by-id/12A9442FBAE80D4E8953883E0B84E0885C01FD@

BPXM01GP.gisp.nec.co.jp
[2]

/messages/by-id/9A28C8860F777E439AA12E8AEA7694F8011345B

6@BPXM15GP.gisp.nec.co.jp

Best regards,
--
Taiki Kondo

NEC Solution Innovators, Ltd.

-----Original Message-----
From: Kaigai Kouhei(海外 浩平) [mailto:kaigai@ak.jp.nec.com]
Sent: Tuesday, August 18, 2015 5:47 PM
To: Kondo Taiki(近藤 太樹); pgsql-hackers@postgresql.org
Cc: Iwaasa Akio(岩浅 晃郎)
Subject: RE: [Proposal] Table partition + join pushdown

Hello Kondo-san,

I briefly checked your patch. Let me put some comments about its design and
implementation, even though I have no arguments towards its concept. :-)

* Construction of RelOptInfo

In your patch, try_hashjoin_pushdown() called by try_hashjoin_path()

constructs

RelOptInfo of the join-rel between inner-rel and a subpath of Append node. It
is entirely wrong implementation.

I can understand we (may) have no RelOptInfo for the joinrel between
tbl_child_0 and other_table, when planner investigates a joinpath to process

join

Append path with the other_table.

HashJoin
-> Append
-> SeqScan on tbl_child_0
-> SeqScan on tbl_child_1
-> SeqScan on tbl_child_2
-> SeqScan on tbl_child_3
-> Hash
-> SeqScan on other_table

How about these alternatives?

- calls make_join_rel() to the pair of tbl_child_X and other_table
by try_hashjoin_pushdown() or somewhere. make_join_rel() internally
construct a RelOptInfo for the supplied pair of relations, so
relevant RelOptInfo shall be properly constructed.
- make_join_rel() also calls add_paths_to_joinrel() towards all the
join logic, so it makes easier to support to push down other join
logic including nested-loop or custom-join.
- It may be an idea to add an extra argument to make_join_rel() to
inform expressions to be applied for tuple filtering on
construction of inner hash table.

* Why only SeqScan is supported

I think it is the role of Hash-node to filter out inner tuples obviously unrelated
to the join (if CHECK constraint of outer relation gives information), because
this join-pushdown may be able to support multi-stacked pushdown.

For example, if planner considers a path to join this Append-path with another
relation, and join clause contains a reference to X?

Append
-> HashJoin
-> SeqScan on tbl_child_0
-> Hash ... Filter: hash_func(X) % 4 = 0
-> SeqScan on other_table
-> HashJoin
-> SeqScan on tbl_child_1
-> Hash ... Filter: hash_func(X) % 4 = 1
-> SeqScan on other_table
-> HashJoin
-> SeqScan on tbl_child_2
-> Hash ... Filter: hash_func(X) % 4 = 2
-> SeqScan on other_table
-> HashJoin
-> SeqScan on tbl_child_3
-> Hash ... Filter: hash_func(X) % 4 = 3
-> SeqScan on other_table

It may be a good challenge to consider additional join pushdown, even if subpaths
of Append are HashJoin, not SeqScan, like:

Append
-> HashJoin
-> HashJoin
-> SeqScan on tbl_child_0
-> Hash ... Filter: hash_func(X) % 4 = 0
-> SeqScan on other_table
-> Hash ... Filter: hash_func(X) % 4 = 0
-> SeqScan on another_table
-> HashJoin
-> HashJoin
-> SeqScan on tbl_child_1
-> Hash ... Filter: hash_func(X) % 4 = 1
-> SeqScan on other_table
-> Hash ... Filter: hash_func(X) % 4 = 1
-> SeqScan on another_table
-> HashJoin
-> HashJoin
-> SeqScan on tbl_child_2
-> Hash ... Filter: hash_func(X) % 4 = 2
-> SeqScan on other_table
-> Hash ... Filter: hash_func(X) % 4 = 2
-> SeqScan on another_table
-> HashJoin
-> HashJoin
-> SeqScan on tbl_child_3
-> Hash ... Filter: hash_func(X) % 4 = 3
-> SeqScan on other_table
-> Hash ... Filter: hash_func(X) % 4 = 3
-> SeqScan on another_table

In this case, underlying nodes are not always SeqScan. So, only Hash-node can
have filter clauses.

* Way to handle expression nodes

All this patch supported is CHECK() constraint with equal operation on INT4

data

type. You can learn various useful infrastructure of PostgreSQL. For example, ...
- expression_tree_mutator() is useful to make a copy of expression
node with small modification
- pull_varnos() is useful to check which relations are referenced
by the expression node.
- RestrictInfo->can_join is useful to check whether the clause is
binary operator, or not.

Anyway, reuse of existing infrastructure is the best way to make a reliable
infrastructure and to keep the implementation simple.

Thanks,
--
NEC Business Creation Division / PG-Strom Project KaiGai Kohei
<kaigai@ak.jp.nec.com>

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Taiki Kondo
Sent: Thursday, August 13, 2015 6:30 PM
To: pgsql-hackers@postgresql.org
Cc: Kaigai Kouhei(海外 浩平); Iwaasa Akio(岩浅 晃郎)
Subject: [HACKERS] [Proposal] Table partition + join pushdown

Hi all,

I saw the email about the idea from KaiGai-san[1], and I worked to
implement this idea.

Now, I have implemented a part of this idea, so I want to propose this
feature.

Patch attached just shows my concept of this feature.
It works fine for EXPLAIN, but it returns wrong result for other operations,

sadly.

Table partition + join pushdown
===============================

Motivation
----------
To make join logic working more effectively, it is important to make
the size of relations smaller.

Especially in Hash-join, it is meaningful to make the inner relation
smaller, because smaller inner relation can be stored within smaller hash

table.

This means that memory usage can be reduced when joining with big tables.

Design
------
It was mentioned by the email from KaiGai-san.
So I quote below here...

---- begin quotation ---
Let's assume a table which is partitioned to four portions, and
individual child relations have constraint by hash-value of its ID
field.

tbl_parent
+ tbl_child_0 ... CHECK(hash_func(id) % 4 = 0)
+ tbl_child_1 ... CHECK(hash_func(id) % 4 = 1)
+ tbl_child_2 ... CHECK(hash_func(id) % 4 = 2)
+ tbl_child_3 ... CHECK(hash_func(id) % 4 = 3)

If someone tried to join another relation with tbl_parent using
equivalence condition, like X = tbl_parent.ID, we know inner tuples
that does not satisfies the condition
hash_func(X) % 4 = 0
shall be never joined to the tuples in tbl_child_0.
So, we can omit to load these tuples to inner hash table preliminary,
then it potentially allows to split the inner hash-table.

Current typical plan structure is below:

HashJoin
-> Append
-> SeqScan on tbl_child_0
-> SeqScan on tbl_child_1
-> SeqScan on tbl_child_2
-> SeqScan on tbl_child_3
-> Hash
-> SeqScan on other_table

It may be rewritable to:

Append
-> HashJoin
-> SeqScan on tbl_child_0
-> Hash ... Filter: hash_func(X) % 4 = 0
-> SeqScan on other_table
-> HashJoin
-> SeqScan on tbl_child_1
-> Hash ... Filter: hash_func(X) % 4 = 1
-> SeqScan on other_table
-> HashJoin
-> SeqScan on tbl_child_2
-> Hash ... Filter: hash_func(X) % 4 = 2
-> SeqScan on other_table
-> HashJoin
-> SeqScan on tbl_child_3
-> Hash ... Filter: hash_func(X) % 4 = 3
-> SeqScan on other_table
---- end quotation ---

In the quotation above, it was written that filter is set at Hash node.
But I implemented that filter is set at SeqScan node under Hash node.
In my opinion, filtering tuples is work of Scanner.

Append
-> HashJoin
-> SeqScan on tbl_child_0
-> Hash
-> SeqScan on other_table ... Filter: hash_func(X) % 4 = 0
-> HashJoin
-> SeqScan on tbl_child_1
-> Hash
-> SeqScan on other_table ... Filter: hash_func(X) % 4 = 1
-> HashJoin
-> SeqScan on tbl_child_2
-> Hash
-> SeqScan on other_table ... Filter: hash_func(X) % 4 = 2
-> HashJoin
-> SeqScan on tbl_child_3
-> Hash
-> SeqScan on other_table ... Filter: hash_func(X) % 4 = 3

API
---
There are 3 new internal (static) functions to implement this feature.
try_hashjoin_pushdown(), which is main function of this feature, is
called from try_hashjoin_path(), and tries to push down HashPath under
the AppendPath.

To do so, this function does following operations.

1. Check if this Hash-join can be pushed down under AppendPath
2. To avoid an influence on other Path making operation,
copy inner path's RelOptInfo and make new SeqScan path from it.
At here, get CHECK() constraints from OUTER path, and convert its
Var node according to join condition. And also convert Var nodes
in join condition itself.
3. Create new HashPath nodes between each sub-paths of AppendPath and
inner path made above.
4. When operations from 1 to 3 are done for each sub-paths,
create new AppendPath which sub-paths are HashPath nodes made above.

get_replaced_clause_constr() is called from try_hashjoin_pushdown(),
and get_var_nodes_recurse() is called from get_replaced_cluase_constr().
These 2 functions help above operations.
(I may revise this part to use expression_tree_walker() and
expression_tree_mutator().)

In patch attached, it has the following limitations.
o It only works for hash-join operation.
(I want to support not only hash-join but also other logic.)
o Join conditions must be "=" operator with int4 variables.
o Inner path must be SeqScan.
(I want to support other path-node.)
o And now, planner may not choose this plan,
because estimated costs are usually larger than original (non-pushdown)

plan.

And also 1 internal (static) function, get_relation_constraints()
defined in plancat.c, is changed to global. This function will be
called from
get_replaced_clause_constr() to get CHECK() constraints.

Usage
-----
To use this feature, create partition tables and small table to join,
and run select operation with joining these tables.

For your convenience, I attach DDL and DML script.
And I also attach the result of EXPLAIN.

Any comments are welcome. But, at first, I need your advices to
correct this patch's behavior.

At least, I think it has to expand array of RangeTblEntry and other
arrays defined in PlannerInfo to register new RelOptInfos for new Path nodes

mentioned above.

Or, is it better choice to modify query parser to implement this
feature more further?

Remarks :
[1]
/messages/by-id/9A28C8860F777E439AA12E8AEA7694F80
10F672
B@BPXM15GP.gisp.nec.co.jp

Best regards,
--
Taiki Kondo

NEC Solution Innovators, Ltd.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Taiki Kondo
tai-kondo@yk.jp.nec.com
In reply to: Kouhei Kaigai (#6)
Re: [Proposal] Table partition + join pushdown

Hello, KaiGai-san.

Thank you for your comment, and sorry for late response.

If inner-scan of the join under the append node has param_info, its qualifier shall be implicitly attached to the scan node. So, if it is legal, I'd like to have this approach because it is less invasive than enhancement of Hash node.

You mention about copyObject() to make a duplication of underlying scan-path.
Actually, copyObject() support is not minimum requirement, because all you need to do here is flat copy of the original path-node, then put param_info.
(Be careful to check whether the original path is not parametalized.)

OK. I'll try implementation by a method you mentioned.

Best regards,
--
Taiki Kondo

NEC Solution Innovators, Ltd.

-----Original Message-----
From: Kaigai Kouhei(海外 浩平) [mailto:kaigai@ak.jp.nec.com]
Sent: Wednesday, September 30, 2015 11:19 PM
To: Kondo Taiki(近藤 太樹)
Cc: Iwaasa Akio(岩浅 晃郎); pgsql-hackers@postgresql.org
Subject: RE: [HACKERS] [Proposal] Table partition + join pushdown

* Suppose we focus on only HashJoin in the first version?
This patch also add support on NestLoop and MergeJoin, however,
NestLoop has no valuable scenario without parallel execution
capability, and the most valuable scenario on MergeJoin is reduction of rows prior to Sort.
Once input rows gets sorted, it is less attractive to filter out rows.

I agree that handling for NestLoop doesn't make sense in this timing.
But I think that handling for MergeJoin still makes sense in this timing.

In my v1 patch, I implemented that the additional filter is used for
qualification at same place as join filter, same as NestLoop.
It is not useful indeed. I agree with you at this point.

I think, and you also mentioned, large factor of cost estimation for
MergeJoin is Sort under MergeJoin, so I believe additional filtering
at Sort is a good choice for this situation, as same way at Hash under
HashJoin.

Furthermore, I think it is better way that the additional filtering
shall be "added" to Scan node under each child (pushed-down) Join
nodes, because we don't have to implement additional qualification at
Join nodes and we only have to implement simply concatenating original
and additional RestrictInfos for filtering.

As a mere idea, for realizing this way, I think we have to implement
copyObject() for Scan path nodes and use ppi_clauses for this usage.

What is your opinion?

You are saying this part at create_scan_plan(), aren't you.

/*
* If this is a parameterized scan, we also need to enforce all the join
* clauses available from the outer relation(s).
*
* For paranoia's sake, don't modify the stored baserestrictinfo list.
*/
if (best_path->param_info)
scan_clauses = list_concat(list_copy(scan_clauses),
best_path->param_info->ppi_clauses);

If inner-scan of the join under the append node has param_info, its qualifier shall be implicitly attached to the scan node. So, if it is legal, I'd like to have this approach because it is less invasive than enhancement of Hash node.

You mention about copyObject() to make a duplication of underlying scan-path.
Actually, copyObject() support is not minimum requirement, because all you need to do here is flat copy of the original path-node, then put param_info.
(Be careful to check whether the original path is not parametalized.)

ParamPathInfo is declared as below:

typedef struct ParamPathInfo
{
NodeTag type;

Relids ppi_req_outer; /* rels supplying parameters used by path */
double ppi_rows; /* estimated number of result tuples */
List *ppi_clauses; /* join clauses available from outer rels */
} ParamPathInfo;

You may need to set the additional filter on ppi_clauses, number of rows after the filtering on ppi_rows and NULL on ppi_req_outer.
However, I'm not 100% certain whether NULL is legal value on ppi_req_outer.

If somebody can comment on, it is helpful.

* MultiExecHash() once put slot on outer_slot then move it to
inner_slot This patch add set_hash_references() to replace varno in
the expression of Hash->filterqual to OUTER_VAR. Why not INNER_VAR?
If Var nodes would be initialized t oreference inner_slot, you don't
need to re-assign slot.

The node under Hash node is connected as the OUTER node. This
implementation may be from implementation of
set_dummy_tlist_references() commonly used by Material, Sort, Unique,
SetOp, and Hash.

And I was faced a problem when I was implementing EXPLAIN for the additional filter.
I implemented same as you mentioned above, then error occurred in running EXPLAIN.
I think EXPLAIN expects expression's varno is same as the position
that the under node is connected to; i.e. if it is connected to OUTER,
varno must be OUTER_VAR.

Ah, OK. It is a trade-off matter, indeed.

It seems to me it is not a fair estimation because inner_path_rows
means number of rows already filtered out, but filter_qual shall be
applied to all the inner input rows.

OK. I'll fix it.

Best regards,
--
Taiki Kondo

NEC Solution Innovators, Ltd.

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Kouhei Kaigai
Sent: Tuesday, September 29, 2015 11:46 AM
To: Taiki Kondo
Cc: Akio Iwaasa; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] [Proposal] Table partition + join pushdown

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Taiki Kondo
Sent: Thursday, September 24, 2015 8:06 PM
To: Kaigai Kouhei(海外 浩平)
Cc: Iwaasa Akio(岩浅 晃郎); pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] [Proposal] Table partition + join pushdown

Hello, KaiGai-san.

Thank you for your comment, and sorry for late response.

The attached patch is completely rewritten from previous patch[1],
at your suggestion[2].
Please find attached.

Thanks for your work, and let me introduce purpose of the work
briefly, because the last submission was August.

His work intends (1) to save resource consumption on tables join at
this moment, and (2) to provide an infrastructure of one parallel join
scenario once Funnel node gets capable.

Even if we construct partition tables, it is unavailable to utilize to
filter out candidate rows of join. In the result, size of Hash table
may grow more than necessity and it causes unnecessary nBatch increase.

Below is the scenario this project tries to tackle. In case when
tables join takes partitioned table on one side, usually, we once need
to run entire partitioned table unless we cannot drop particular child tables.

XXXXJoin cond (x = y)
-> Append
-> SeqScan on tbl_child_0 ... CHECK (hash_func(x) % 4 = 0)
-> SeqScan on tbl_child_1 ... CHECK (hash_func(x) % 4 = 1)
-> SeqScan on tbl_child_2 ... CHECK (hash_func(x) % 4 = 2)
-> SeqScan on tbl_child_3 ... CHECK (hash_func(x) % 4 = 3)
-> Hash
-> SeqScan on other_table

However, CHECK() constraint assigned on child tables give us hint
which rows in other side are never related to this join.
For example, all the rows in other_table to be joined with tbl_child_0
should have multiple number of 4 on hash_func(y). We may be able to
omit unrelated rows from the hash-table in this case, then it
eventually allows to reduce the size of hash table.

In case of INNER_JOIN, we can rewrite the query execution plan as below.

Append
-> HashJoin cond (x = y)
-> SeqScan on tbl_child_0
-> Hash ... Filter: hash_func(y) % 4 = 0
-> SeqScan on other_table
-> HashJoin cond (x = y)
-> SeqScan on tbl_child_1
-> Hash ... Filter: hash_func(y) % 4 = 1
-> SeqScan on other_table
-> HashJoin cond (x = y)
-> SeqScan on tbl_child_2
-> Hash ... Filter: hash_func(y) % 4 = 2
-> SeqScan on other_table
-> HashJoin cond (x = y)
-> SeqScan on tbl_child_3
-> Hash ... Filter: hash_func(y) % 4 = 3
-> SeqScan on other_table

Unrelated rows of Hash table is preliminarily, it allows to avoid hash
table split when its size reaches to work_mem limitation.

This join-pushdown is valuable on hash-join and merge-join if MJ takes
unsorted relation and number of rows to be sorted is performance factor.
Also, once Funnel gets capable to run Append on background worker, it
is also helpful to run NestLoop in parallel.

How about the opinion from third parties? I'm a bit biased, of course.

OK, below is the brief comment to patch.

* Suppose we focus on only HashJoin in the first version?
This patch also add support on NestLoop and MergeJoin, however,
NestLoop has no valuable scenario without parallel execution
capability, and the most valuable scenario on MergeJoin is reduction of rows prior to Sort.
Once input rows gets sorted, it is less attractive to filter out rows.

* MultiExecHash() once put slot on outer_slot then move it to
inner_slot This patch add set_hash_references() to replace varno in
the expression of Hash->filterqual to OUTER_VAR. Why not INNER_VAR?
If Var nodes would be initialized t oreference inner_slot, you don't
need to re-assign slot.

I'll try to have deeper investigation, later.

This patch contains following implementation, but I can't determine
this is correct or wrong.

1. Cost estimation
In this patch, additional row filter is implemented for Hash, Merge
join and

Nested

Loop.
I implemented cost estimation feature for this filter by watching
other parts of filters, but I am not sure this implementation is
correct.

@@ -2835,6 +2864,8 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
* not all of the quals may get evaluated at each tuple.)
*/
startup_cost += qp_qual_cost.startup;
+   startup_cost += filter_qual_cost.startup +
+           filter_qual_cost.per_tuple * inner_path_rows;
cpu_per_tuple = cpu_tuple_cost + qp_qual_cost.per_tuple;
run_cost += cpu_per_tuple * hashjointuples;

It seems to me it is not a fair estimation because inner_path_rows
means number of rows already filtered out, but filter_qual shall be
applied to all the inner input rows.

2. Workaround for failing assertion at allpaths.c In
standard_join_search(), we expect to have a single rel at the final level.
But this expectation is disappointed by join pushdown feature,
because this

will

search for the combinations not searched by original
standard_join_serch() at the final level. Therefore, once trying
join pushdown is succeeded, failing assertion occurs in allpaths.c.

So I implemented workaround by temporary set NULL to
root->join_rel_level while trying join pushdown, but I think this implementation may be wrong.

It is my off-list suggestion. The standard_join_search expects root of
the partition tables will appear, but child tables are out of scope.
Once we try to push down the join under the append, we need to
consider table joins between inner table and every outer child tables,
however, it should not be visible to standard_join_search context.
From the standpoint of standard_join_search, it get an AppendPath that
represents a table join A and B, even if A contains 100 children and
join was pushed down on behalf of the AppendPath.
So, it is a reasonable way to set NULL on root->join_rel_level to
avoid unexpected RelOptInfo addition by build_join_rel().
"to avoid assertion" is one fact, however, intension of the code is
avoid pollution of the global data structure. ;-)

3. Searching pathkeys for Merge Join When join pushdown feature
chooses merge join for pushed-down join operation, planner fails to
create merge join node because it is unable to find pathkeys for
this merge join. I found this is caused by skipping child table in
finding pathkeys.

I expect that it is for making planner faster, and I implemented
that planner doesn't skip child table in finding pathkeys for merge join.
But I am not sure this expectation is correct.

I like to recommend to omit MergeJoin support at the first version.

Thanks,

Any comments/suggestions are welcome.

Remarks :
[1]

/messages/by-id/12A9442FBAE80D4E8953883E0B84E0885
C01FD@

BPXM01GP.gisp.nec.co.jp
[2]

/messages/by-id/9A28C8860F777E439AA12E8AEA7694F80
11345B

6@BPXM15GP.gisp.nec.co.jp

Best regards,
--
Taiki Kondo

NEC Solution Innovators, Ltd.

-----Original Message-----
From: Kaigai Kouhei(海外 浩平) [mailto:kaigai@ak.jp.nec.com]
Sent: Tuesday, August 18, 2015 5:47 PM
To: Kondo Taiki(近藤 太樹); pgsql-hackers@postgresql.org
Cc: Iwaasa Akio(岩浅 晃郎)
Subject: RE: [Proposal] Table partition + join pushdown

Hello Kondo-san,

I briefly checked your patch. Let me put some comments about its
design and implementation, even though I have no arguments towards
its concept. :-)

* Construction of RelOptInfo

In your patch, try_hashjoin_pushdown() called by try_hashjoin_path()

constructs

RelOptInfo of the join-rel between inner-rel and a subpath of Append
node. It is entirely wrong implementation.

I can understand we (may) have no RelOptInfo for the joinrel between
tbl_child_0 and other_table, when planner investigates a joinpath to
process

join

Append path with the other_table.

HashJoin
-> Append
-> SeqScan on tbl_child_0
-> SeqScan on tbl_child_1
-> SeqScan on tbl_child_2
-> SeqScan on tbl_child_3
-> Hash
-> SeqScan on other_table

How about these alternatives?

- calls make_join_rel() to the pair of tbl_child_X and other_table
by try_hashjoin_pushdown() or somewhere. make_join_rel() internally
construct a RelOptInfo for the supplied pair of relations, so
relevant RelOptInfo shall be properly constructed.
- make_join_rel() also calls add_paths_to_joinrel() towards all the
join logic, so it makes easier to support to push down other join
logic including nested-loop or custom-join.
- It may be an idea to add an extra argument to make_join_rel() to
inform expressions to be applied for tuple filtering on
construction of inner hash table.

* Why only SeqScan is supported

I think it is the role of Hash-node to filter out inner tuples
obviously unrelated to the join (if CHECK constraint of outer
relation gives information), because this join-pushdown may be able to support multi-stacked pushdown.

For example, if planner considers a path to join this Append-path
with another relation, and join clause contains a reference to X?

Append
-> HashJoin
-> SeqScan on tbl_child_0
-> Hash ... Filter: hash_func(X) % 4 = 0
-> SeqScan on other_table
-> HashJoin
-> SeqScan on tbl_child_1
-> Hash ... Filter: hash_func(X) % 4 = 1
-> SeqScan on other_table
-> HashJoin
-> SeqScan on tbl_child_2
-> Hash ... Filter: hash_func(X) % 4 = 2
-> SeqScan on other_table
-> HashJoin
-> SeqScan on tbl_child_3
-> Hash ... Filter: hash_func(X) % 4 = 3
-> SeqScan on other_table

It may be a good challenge to consider additional join pushdown,
even if subpaths of Append are HashJoin, not SeqScan, like:

Append
-> HashJoin
-> HashJoin
-> SeqScan on tbl_child_0
-> Hash ... Filter: hash_func(X) % 4 = 0
-> SeqScan on other_table
-> Hash ... Filter: hash_func(X) % 4 = 0
-> SeqScan on another_table
-> HashJoin
-> HashJoin
-> SeqScan on tbl_child_1
-> Hash ... Filter: hash_func(X) % 4 = 1
-> SeqScan on other_table
-> Hash ... Filter: hash_func(X) % 4 = 1
-> SeqScan on another_table
-> HashJoin
-> HashJoin
-> SeqScan on tbl_child_2
-> Hash ... Filter: hash_func(X) % 4 = 2
-> SeqScan on other_table
-> Hash ... Filter: hash_func(X) % 4 = 2
-> SeqScan on another_table
-> HashJoin
-> HashJoin
-> SeqScan on tbl_child_3
-> Hash ... Filter: hash_func(X) % 4 = 3
-> SeqScan on other_table
-> Hash ... Filter: hash_func(X) % 4 = 3
-> SeqScan on another_table

In this case, underlying nodes are not always SeqScan. So, only
Hash-node can have filter clauses.

* Way to handle expression nodes

All this patch supported is CHECK() constraint with equal operation
on INT4

data

type. You can learn various useful infrastructure of PostgreSQL. For example, ...
- expression_tree_mutator() is useful to make a copy of expression
node with small modification
- pull_varnos() is useful to check which relations are referenced
by the expression node.
- RestrictInfo->can_join is useful to check whether the clause is
binary operator, or not.

Anyway, reuse of existing infrastructure is the best way to make a
reliable infrastructure and to keep the implementation simple.

Thanks,
--
NEC Business Creation Division / PG-Strom Project KaiGai Kohei
<kaigai@ak.jp.nec.com>

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Taiki
Kondo
Sent: Thursday, August 13, 2015 6:30 PM
To: pgsql-hackers@postgresql.org
Cc: Kaigai Kouhei(海外 浩平); Iwaasa Akio(岩浅 晃郎)
Subject: [HACKERS] [Proposal] Table partition + join pushdown

Hi all,

I saw the email about the idea from KaiGai-san[1], and I worked to
implement this idea.

Now, I have implemented a part of this idea, so I want to propose
this feature.

Patch attached just shows my concept of this feature.
It works fine for EXPLAIN, but it returns wrong result for other
operations,

sadly.

Table partition + join pushdown
===============================

Motivation
----------
To make join logic working more effectively, it is important to
make the size of relations smaller.

Especially in Hash-join, it is meaningful to make the inner
relation smaller, because smaller inner relation can be stored
within smaller hash

table.

This means that memory usage can be reduced when joining with big tables.

Design
------
It was mentioned by the email from KaiGai-san.
So I quote below here...

---- begin quotation ---
Let's assume a table which is partitioned to four portions, and
individual child relations have constraint by hash-value of its ID
field.

tbl_parent
+ tbl_child_0 ... CHECK(hash_func(id) % 4 = 0)
+ tbl_child_1 ... CHECK(hash_func(id) % 4 = 1)
+ tbl_child_2 ... CHECK(hash_func(id) % 4 = 2)
+ tbl_child_3 ... CHECK(hash_func(id) % 4 = 3)

If someone tried to join another relation with tbl_parent using
equivalence condition, like X = tbl_parent.ID, we know inner
tuples that does not satisfies the condition
hash_func(X) % 4 = 0
shall be never joined to the tuples in tbl_child_0.
So, we can omit to load these tuples to inner hash table
preliminary, then it potentially allows to split the inner hash-table.

Current typical plan structure is below:

HashJoin
-> Append
-> SeqScan on tbl_child_0
-> SeqScan on tbl_child_1
-> SeqScan on tbl_child_2
-> SeqScan on tbl_child_3
-> Hash
-> SeqScan on other_table

It may be rewritable to:

Append
-> HashJoin
-> SeqScan on tbl_child_0
-> Hash ... Filter: hash_func(X) % 4 = 0
-> SeqScan on other_table
-> HashJoin
-> SeqScan on tbl_child_1
-> Hash ... Filter: hash_func(X) % 4 = 1
-> SeqScan on other_table
-> HashJoin
-> SeqScan on tbl_child_2
-> Hash ... Filter: hash_func(X) % 4 = 2
-> SeqScan on other_table
-> HashJoin
-> SeqScan on tbl_child_3
-> Hash ... Filter: hash_func(X) % 4 = 3
-> SeqScan on other_table
---- end quotation ---

In the quotation above, it was written that filter is set at Hash node.
But I implemented that filter is set at SeqScan node under Hash node.
In my opinion, filtering tuples is work of Scanner.

Append
-> HashJoin
-> SeqScan on tbl_child_0
-> Hash
-> SeqScan on other_table ... Filter: hash_func(X) % 4 = 0
-> HashJoin
-> SeqScan on tbl_child_1
-> Hash
-> SeqScan on other_table ... Filter: hash_func(X) % 4 = 1
-> HashJoin
-> SeqScan on tbl_child_2
-> Hash
-> SeqScan on other_table ... Filter: hash_func(X) % 4 = 2
-> HashJoin
-> SeqScan on tbl_child_3
-> Hash
-> SeqScan on other_table ... Filter: hash_func(X) % 4 = 3

API
---
There are 3 new internal (static) functions to implement this feature.
try_hashjoin_pushdown(), which is main function of this feature,
is called from try_hashjoin_path(), and tries to push down
HashPath under the AppendPath.

To do so, this function does following operations.

1. Check if this Hash-join can be pushed down under AppendPath
2. To avoid an influence on other Path making operation,
copy inner path's RelOptInfo and make new SeqScan path from it.
At here, get CHECK() constraints from OUTER path, and convert its
Var node according to join condition. And also convert Var nodes
in join condition itself.
3. Create new HashPath nodes between each sub-paths of AppendPath and
inner path made above.
4. When operations from 1 to 3 are done for each sub-paths,
create new AppendPath which sub-paths are HashPath nodes made above.

get_replaced_clause_constr() is called from
try_hashjoin_pushdown(), and get_var_nodes_recurse() is called from get_replaced_cluase_constr().
These 2 functions help above operations.
(I may revise this part to use expression_tree_walker() and
expression_tree_mutator().)

In patch attached, it has the following limitations.
o It only works for hash-join operation.
(I want to support not only hash-join but also other logic.)
o Join conditions must be "=" operator with int4 variables.
o Inner path must be SeqScan.
(I want to support other path-node.)
o And now, planner may not choose this plan,
because estimated costs are usually larger than original
(non-pushdown)

plan.

And also 1 internal (static) function, get_relation_constraints()
defined in plancat.c, is changed to global. This function will be
called from
get_replaced_clause_constr() to get CHECK() constraints.

Usage
-----
To use this feature, create partition tables and small table to
join, and run select operation with joining these tables.

For your convenience, I attach DDL and DML script.
And I also attach the result of EXPLAIN.

Any comments are welcome. But, at first, I need your advices to
correct this patch's behavior.

At least, I think it has to expand array of RangeTblEntry and
other arrays defined in PlannerInfo to register new RelOptInfos
for new Path nodes

mentioned above.

Or, is it better choice to modify query parser to implement this
feature more further?

Remarks :
[1]
/messages/by-id/9A28C8860F777E439AA12E8AEA769
4F80
10F672
B@BPXM15GP.gisp.nec.co.jp

Best regards,
--
Taiki Kondo

NEC Solution Innovators, Ltd.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To
make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To
make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
NEC Business Creation Division / PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Taiki Kondo (#7)
Re: [Proposal] Table partition + join pushdown

Hello.

I tried to read this and had some random comments on this.

-- general

I got some warning on compilation on unused variables and wrong
arguemtn type.

I failed to have a query that this patch works on. Could you let
me have some specific example for this patch?

This patch needs more comments. Please put comment about not only
what it does but also the reason and other things for it.

-- about namings

Names for functions and variables are needed to be more
appropriate, in other words, needed to be what properly informs
what they are. The followings are the examples of such names.

"added_restrictlist"'s widely distributed as many function
arguemnts and JoinPathExtraData makes me feel
dizzy.. create_mergejoin_path takes it as "filtering_clauses",
which looks far better.

try_join_pushdown() is also the name with much wider
meaning. This patch tries to move hashjoins on inheritance parent
to under append paths. It could be generically called 'pushdown'
but this would be better be called such like 'transform appended
hashjoin' or 'hashjoin distribution'. The latter would be better.
(The function name would be try_distribute_hashjoin for the
case.)

The name make_restrictinfos_from_check_contr() also tells me
wrong thing. For example,
extract_constraints_for_hashjoin_distribution() would inform me
about what it returns.

-- about what make_restrictinfos_from_check_constr() does

In make_restrictinfos_from_check_constr, the function returns
modified constraint predicates correspond to vars under
hashjoinable join clauses. I don't think expression_tree_mutator
is suitable to do that since it could allow unwanted result when
constraint predicates or join clauses are not simple OpExpr's.

Could you try more simple and straight-forward way to do that?
Otherwise could you give me clear explanation on what it does?

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Taiki Kondo
tai-kondo@yk.jp.nec.com
In reply to: Kyotaro HORIGUCHI (#8)
1 attachment(s)
Re: [Proposal] Table partition + join pushdown

Hello, Horiguchi-san.

Thank you for your comment.

I got some warning on compilation on unused variables and wrong
arguemtn type.

OK, I'll fix it.

I failed to have a query that this patch works on. Could you let
me have some specific example for this patch?

Please find attached.
And also make sure that setting of work_mem is '64kB' (not 64MB).

If there is the large work_mem enough to create hash table for
relation after appending, its cost may be better than pushed-down
plan's cost, then planner will not choose pushed-down plan this patch makes.
So, to make this patch working fine, work_mem size must be smaller than
the hash table size for relation after appending.

This patch needs more comments. Please put comment about not only
what it does but also the reason and other things for it.

OK, I'll put more comments in the code.
But it will take a long time, maybe...

-- about namings

Names for functions and variables are needed to be more
appropriate, in other words, needed to be what properly informs
what they are. The followings are the examples of such names.

Thank you for your suggestion.

I also think these names are not good much.
I'll try to make names better , but it maybe take a long time...
Of course, I will use your suggestion as reference.

"added_restrictlist"'s widely distributed as many function
arguemnts and JoinPathExtraData makes me feel
dizzy..

"added_restrictinfo" will be deleted from almost functions
other than try_join_pushdown() in next (v2) patch because
the place of filtering using this info will be changed
from Join node to Scan node and not have to place it
into other than try_join_pushdown().

In make_restrictinfos_from_check_constr, the function returns
modified constraint predicates correspond to vars under
hashjoinable join clauses. I don't think expression_tree_mutator
is suitable to do that since it could allow unwanted result when
constraint predicates or join clauses are not simple OpExpr's.

Do you have any example of this situation?
I am trying to find unwanted results you mentioned, but I don't have
any results in this timing. I have a hunch that it will allow unwanted
results because I have thought only about very simple situation for
this function.

Otherwise could you give me clear explanation on what it does?

This function transfers CHECK() constraint to filter expression by following
procedures.
(1) Get outer table's CHECK() constraint by using get_relation_constraints().
(2) Walk through expression tree got in (1) by using expression_tree_mutator()
with check_constraint_mutator() and change only outer's Var node to
inner's one according to join clause.

For example, when CHECK() constraint of table A is "num % 4 = 0" and
join clause between table A and B is "A.num = B.data",
then we can get "B.data % 4 = 0" for filtering purpose.

This also accepts more complex join clause like "A.num = B.data * 2",
then we can get "(B.data * 2) % 4 = 0".

In procedure (2), to decide whether to use each join clause for changing
Var node or not, I implement check_constraint_mutator() to judge whether
join clause is hash-joinable or not.

Actually, I want to judge whether OpExpr as top expression tree of
join clause means "=" or not, but I can't find how to do it.

If you know how to do it, please let me know.

Best regards,
--
Taiki Kondo

NEC Solution Innovators, Ltd.

-----Original Message-----
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
Sent: Tuesday, October 06, 2015 8:35 PM
To: tai-kondo@yk.jp.nec.com
Cc: kaigai@ak.jp.nec.com; aki-iwaasa@vt.jp.nec.com; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] [Proposal] Table partition + join pushdown

Hello.

I tried to read this and had some random comments on this.

-- general

I got some warning on compilation on unused variables and wrong arguemtn type.

I failed to have a query that this patch works on. Could you let me have some specific example for this patch?

This patch needs more comments. Please put comment about not only what it does but also the reason and other things for it.

-- about namings

Names for functions and variables are needed to be more appropriate, in other words, needed to be what properly informs what they are. The followings are the examples of such names.

"added_restrictlist"'s widely distributed as many function arguemnts and JoinPathExtraData makes me feel dizzy.. create_mergejoin_path takes it as "filtering_clauses", which looks far better.

try_join_pushdown() is also the name with much wider meaning. This patch tries to move hashjoins on inheritance parent to under append paths. It could be generically called 'pushdown'
but this would be better be called such like 'transform appended hashjoin' or 'hashjoin distribution'. The latter would be better.
(The function name would be try_distribute_hashjoin for the
case.)

The name make_restrictinfos_from_check_contr() also tells me wrong thing. For example,
extract_constraints_for_hashjoin_distribution() would inform me about what it returns.

-- about what make_restrictinfos_from_check_constr() does

In make_restrictinfos_from_check_constr, the function returns modified constraint predicates correspond to vars under hashjoinable join clauses. I don't think expression_tree_mutator is suitable to do that since it could allow unwanted result when constraint predicates or join clauses are not simple OpExpr's.

Could you try more simple and straight-forward way to do that?
Otherwise could you give me clear explanation on what it does?

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

pushdown_test.v1.sqlapplication/octet-stream; name=pushdown_test.v1.sqlDownload
#10Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Taiki Kondo (#9)
Re: [Proposal] Table partition + join pushdown

Hello, thank you for the example.

I could see this patch working with it.

In make_restrictinfos_from_check_constr, the function returns
modified constraint predicates correspond to vars under
hashjoinable join clauses. I don't think expression_tree_mutator
is suitable to do that since it could allow unwanted result when
constraint predicates or join clauses are not simple OpExpr's.

Do you have any example of this situation?

As a rather simple case on the test environment made by the
provided script, the following query,

explain analyze
select data_x, data_y, num from check_test_div join inner_t on check_test_div.id + 1 = inner_t.id;

Makes the mutation fail then result in an assertion failure.

| TRAP: FailedAssertion("!(list_length(check_constr) == list_length(result))", File: "joinpath.c", Line: 1608)

This is because both 'check_test_div.id + 1' and inner_t.id don't
match the var-side of the constraints.

I don't see clearly what to do for the situation for now but this
is the one of the most important function for this feature and
should be cleanly designed.

At Thu, 8 Oct 2015 08:28:04 +0000, Taiki Kondo <tai-kondo@yk.jp.nec.com> wrote in <12A9442FBAE80D4E8953883E0B84E0885F9913@BPXM01GP.gisp.nec.co.jp>

Hello, Horiguchi-san.

Thank you for your comment.

I got some warning on compilation on unused variables and wrong
arguemtn type.

OK, I'll fix it.

I failed to have a query that this patch works on. Could you let
me have some specific example for this patch?

Please find attached.
And also make sure that setting of work_mem is '64kB' (not 64MB).

If there is the large work_mem enough to create hash table for
relation after appending, its cost may be better than pushed-down
plan's cost, then planner will not choose pushed-down plan this patch makes.
So, to make this patch working fine, work_mem size must be smaller than
the hash table size for relation after appending.

This patch needs more comments. Please put comment about not only
what it does but also the reason and other things for it.

OK, I'll put more comments in the code.
But it will take a long time, maybe...

Sure.

-- about namings

Names for functions and variables are needed to be more
appropriate, in other words, needed to be what properly informs
what they are. The followings are the examples of such names.

Thank you for your suggestion.

I also think these names are not good much.
I'll try to make names better , but it maybe take a long time...
Of course, I will use your suggestion as reference.

Thanks.

"added_restrictlist"'s widely distributed as many function
arguemnts and JoinPathExtraData makes me feel
dizzy..

"added_restrictinfo" will be deleted from almost functions
other than try_join_pushdown() in next (v2) patch because
the place of filtering using this info will be changed
from Join node to Scan node and not have to place it
into other than try_join_pushdown().

I'm looking forward to see it.

In make_restrictinfos_from_check_constr, the function returns
modified constraint predicates correspond to vars under
hashjoinable join clauses. I don't think expression_tree_mutator
is suitable to do that since it could allow unwanted result when
constraint predicates or join clauses are not simple OpExpr's.

Do you have any example of this situation?
I am trying to find unwanted results you mentioned, but I don't have
any results in this timing. I have a hunch that it will allow unwanted
results because I have thought only about very simple situation for
this function.

As mentioned above.

Otherwise could you give me clear explanation on what it does?

This function transfers CHECK() constraint to filter expression by following
procedures.
(1) Get outer table's CHECK() constraint by using get_relation_constraints().
(2) Walk through expression tree got in (1) by using expression_tree_mutator()
with check_constraint_mutator() and change only outer's Var node to
inner's one according to join clause.

For example, when CHECK() constraint of table A is "num % 4 = 0" and
join clause between table A and B is "A.num = B.data",
then we can get "B.data % 4 = 0" for filtering purpose.

This also accepts more complex join clause like "A.num = B.data * 2",
then we can get "(B.data * 2) % 4 = 0".

In procedure (2), to decide whether to use each join clause for changing
Var node or not, I implement check_constraint_mutator() to judge whether
join clause is hash-joinable or not.

Thanks for the explanation. I think that the function has been
made considering only rather plain calses. We should put more
thought to making the logic clearer so that we can define the
desired/possible capability and limitations clearly.

Actually, I want to judge whether OpExpr as top expression tree of
join clause means "=" or not, but I can't find how to do it.

If you know how to do it, please let me know.

I don't see for now, too :p

But we at least should put more consideration on the mechanism to
obtain the expressions.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Kouhei Kaigai
kaigai@ak.jp.nec.com
In reply to: Taiki Kondo (#9)
Re: [Proposal] Table partition + join pushdown

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Taiki Kondo
Sent: Thursday, October 08, 2015 5:28 PM
To: Kyotaro HORIGUCHI
Cc: Kaigai Kouhei(海外 浩平); Iwaasa Akio(岩浅 晃郎);
pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] [Proposal] Table partition + join pushdown

Hello, Horiguchi-san.

Thank you for your comment.

I got some warning on compilation on unused variables and wrong
arguemtn type.

OK, I'll fix it.

I failed to have a query that this patch works on. Could you let
me have some specific example for this patch?

Please find attached.
And also make sure that setting of work_mem is '64kB' (not 64MB).

If there is the large work_mem enough to create hash table for
relation after appending, its cost may be better than pushed-down
plan's cost, then planner will not choose pushed-down plan this patch makes.
So, to make this patch working fine, work_mem size must be smaller than
the hash table size for relation after appending.

This patch needs more comments. Please put comment about not only
what it does but also the reason and other things for it.

OK, I'll put more comments in the code.
But it will take a long time, maybe...

People (including me) can help. Even though your English capability
is not enough, it is significant to put intention of the code.

-- about namings

Names for functions and variables are needed to be more
appropriate, in other words, needed to be what properly informs
what they are. The followings are the examples of such names.

Thank you for your suggestion.

I also think these names are not good much.
I'll try to make names better , but it maybe take a long time...
Of course, I will use your suggestion as reference.

"added_restrictlist"'s widely distributed as many function
arguemnts and JoinPathExtraData makes me feel
dizzy..

"added_restrictinfo" will be deleted from almost functions
other than try_join_pushdown() in next (v2) patch because
the place of filtering using this info will be changed
from Join node to Scan node and not have to place it
into other than try_join_pushdown().

This restrictinfo intends to filter out obviously unrelated rows
in this join, due to the check constraint of other side of the join.
So, correct but redundant name is:
restrictlist_to_drop_unrelated_rows_because_of_check_constraint

How about 'restrictlist_by_constraint' instead?

In make_restrictinfos_from_check_constr, the function returns
modified constraint predicates correspond to vars under
hashjoinable join clauses. I don't think expression_tree_mutator
is suitable to do that since it could allow unwanted result when
constraint predicates or join clauses are not simple OpExpr's.

Do you have any example of this situation?
I am trying to find unwanted results you mentioned, but I don't have
any results in this timing. I have a hunch that it will allow unwanted
results because I have thought only about very simple situation for
this function.

check_constraint_mutator makes modified restrictlist with relacing
Var node only when join clause is hash-joinable.
It implies <expr> = <expr> form, thus we can safely replace the
expression by the other side.

Of course, we still have cases we cannot replace expressions simply.
- If function (or function called by operators) has volatile attribute
(who use volatile function on CHECK constraint of partitioning?)
- If it is uncertain whether expression returns always same result.
(is it possible to contain SubLink in the constraint?)

I'd like to suggest to use white-list approach in this mutator routine.
It means that only immutable expression node are allowed to include
the modified restrictlist.

Things to do is:

check_constraint_mutator(...)
{
if (node == NULL)
return NULL;
if (IsA(node, Var))
{
:
}
else if (node is not obviously immutable)
{
context->is_mutated = false; <-- prohibit to make if expression
} contains uncertain node.
return expression_tree_mutator(...)
}

Otherwise could you give me clear explanation on what it does?

This function transfers CHECK() constraint to filter expression by following
procedures.
(1) Get outer table's CHECK() constraint by using get_relation_constraints().
(2) Walk through expression tree got in (1) by using expression_tree_mutator()
with check_constraint_mutator() and change only outer's Var node to
inner's one according to join clause.

For example, when CHECK() constraint of table A is "num % 4 = 0" and
join clause between table A and B is "A.num = B.data",
then we can get "B.data % 4 = 0" for filtering purpose.

This also accepts more complex join clause like "A.num = B.data * 2",
then we can get "(B.data * 2) % 4 = 0".

In procedure (2), to decide whether to use each join clause for changing
Var node or not, I implement check_constraint_mutator() to judge whether
join clause is hash-joinable or not.

Actually, I want to judge whether OpExpr as top expression tree of
join clause means "=" or not, but I can't find how to do it.

If you know how to do it, please let me know.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

-----Original Message-----
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
Sent: Tuesday, October 06, 2015 8:35 PM
To: tai-kondo@yk.jp.nec.com
Cc: kaigai@ak.jp.nec.com; aki-iwaasa@vt.jp.nec.com;
pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] [Proposal] Table partition + join pushdown

Hello.

I tried to read this and had some random comments on this.

-- general

I got some warning on compilation on unused variables and wrong arguemtn type.

I failed to have a query that this patch works on. Could you let me have some
specific example for this patch?

This patch needs more comments. Please put comment about not only what it does
but also the reason and other things for it.

-- about namings

Names for functions and variables are needed to be more appropriate, in other
words, needed to be what properly informs what they are. The followings are the
examples of such names.

"added_restrictlist"'s widely distributed as many function arguemnts and
JoinPathExtraData makes me feel dizzy.. create_mergejoin_path takes it as
"filtering_clauses", which looks far better.

try_join_pushdown() is also the name with much wider meaning. This patch tries
to move hashjoins on inheritance parent to under append paths. It could be
generically called 'pushdown'
but this would be better be called such like 'transform appended hashjoin' or
'hashjoin distribution'. The latter would be better.
(The function name would be try_distribute_hashjoin for the
case.)

The name make_restrictinfos_from_check_contr() also tells me wrong thing. For
example,
extract_constraints_for_hashjoin_distribution() would inform me about what it
returns.

-- about what make_restrictinfos_from_check_constr() does

In make_restrictinfos_from_check_constr, the function returns modified
constraint predicates correspond to vars under hashjoinable join clauses. I don't
think expression_tree_mutator is suitable to do that since it could allow unwanted
result when constraint predicates or join clauses are not simple OpExpr's.

Could you try more simple and straight-forward way to do that?
Otherwise could you give me clear explanation on what it does?

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Taiki Kondo
tai-kondo@yk.jp.nec.com
In reply to: Kyotaro HORIGUCHI (#10)
Re: [Proposal] Table partition + join pushdown

Hello, Horiguchi-san.

Sorry for late reply.

explain analyze
select data_x, data_y, num from check_test_div join inner_t on check_test_div.id + 1 = inner_t.id;

Makes the mutation fail then result in an assertion failure.

| TRAP: FailedAssertion("!(list_length(check_constr) ==
| list_length(result))", File: "joinpath.c", Line: 1608)

This is because both 'check_test_div.id + 1' and inner_t.id don't
match the var-side of the constraints.

Thank you for picking up the failure example.
This is exactly a bug. I'll fix it.

I don't see clearly what to do for the situation for now but this
is the one of the most important function for this feature and
should be cleanly designed.

Yes, this function is one of the important features of this patch.

This function makes new filtering conditions from CHECK() constraints.
This is to reduce number of rows for making hash table smaller (or
making sorting faster for MergeJoin) to fit to smaller work_mem environment.

Maybe, I must collect realistic examples of CHECK() constraints,
which are used for table partitioning, for designing more cleanly.

Best regards,

--
Taiki Kondo

NEC Solution Innovators, Ltd.

-----Original Message-----
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
Sent: Thursday, October 08, 2015 7:04 PM
To: tai-kondo@yk.jp.nec.com
Cc: kaigai@ak.jp.nec.com; aki-iwaasa@vt.jp.nec.com; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] [Proposal] Table partition + join pushdown

Hello, thank you for the example.

I could see this patch working with it.

In make_restrictinfos_from_check_constr, the function returns
modified constraint predicates correspond to vars under hashjoinable
join clauses. I don't think expression_tree_mutator is suitable to
do that since it could allow unwanted result when constraint
predicates or join clauses are not simple OpExpr's.

Do you have any example of this situation?

As a rather simple case on the test environment made by the provided script, the following query,

explain analyze
select data_x, data_y, num from check_test_div join inner_t on check_test_div.id + 1 = inner_t.id;

Makes the mutation fail then result in an assertion failure.

| TRAP: FailedAssertion("!(list_length(check_constr) ==
| list_length(result))", File: "joinpath.c", Line: 1608)

This is because both 'check_test_div.id + 1' and inner_t.id don't match the var-side of the constraints.

I don't see clearly what to do for the situation for now but this is the one of the most important function for this feature and should be cleanly designed.

At Thu, 8 Oct 2015 08:28:04 +0000, Taiki Kondo <tai-kondo@yk.jp.nec.com> wrote in <12A9442FBAE80D4E8953883E0B84E0885F9913@BPXM01GP.gisp.nec.co.jp>

Hello, Horiguchi-san.

Thank you for your comment.

I got some warning on compilation on unused variables and wrong
arguemtn type.

OK, I'll fix it.

I failed to have a query that this patch works on. Could you let me
have some specific example for this patch?

Please find attached.
And also make sure that setting of work_mem is '64kB' (not 64MB).

If there is the large work_mem enough to create hash table for
relation after appending, its cost may be better than pushed-down
plan's cost, then planner will not choose pushed-down plan this patch makes.
So, to make this patch working fine, work_mem size must be smaller
than the hash table size for relation after appending.

This patch needs more comments. Please put comment about not only
what it does but also the reason and other things for it.

OK, I'll put more comments in the code.
But it will take a long time, maybe...

Sure.

-- about namings

Names for functions and variables are needed to be more appropriate,
in other words, needed to be what properly informs what they are.
The followings are the examples of such names.

Thank you for your suggestion.

I also think these names are not good much.
I'll try to make names better , but it maybe take a long time...
Of course, I will use your suggestion as reference.

Thanks.

"added_restrictlist"'s widely distributed as many function arguemnts
and JoinPathExtraData makes me feel dizzy..

"added_restrictinfo" will be deleted from almost functions other than
try_join_pushdown() in next (v2) patch because the place of filtering
using this info will be changed from Join node to Scan node and not
have to place it into other than try_join_pushdown().

I'm looking forward to see it.

In make_restrictinfos_from_check_constr, the function returns
modified constraint predicates correspond to vars under hashjoinable
join clauses. I don't think expression_tree_mutator is suitable to
do that since it could allow unwanted result when constraint
predicates or join clauses are not simple OpExpr's.

Do you have any example of this situation?
I am trying to find unwanted results you mentioned, but I don't have
any results in this timing. I have a hunch that it will allow unwanted
results because I have thought only about very simple situation for
this function.

As mentioned above.

Otherwise could you give me clear explanation on what it does?

This function transfers CHECK() constraint to filter expression by
following procedures.
(1) Get outer table's CHECK() constraint by using get_relation_constraints().
(2) Walk through expression tree got in (1) by using expression_tree_mutator()
with check_constraint_mutator() and change only outer's Var node to
inner's one according to join clause.

For example, when CHECK() constraint of table A is "num % 4 = 0" and
join clause between table A and B is "A.num = B.data", then we can get
"B.data % 4 = 0" for filtering purpose.

This also accepts more complex join clause like "A.num = B.data * 2",
then we can get "(B.data * 2) % 4 = 0".

In procedure (2), to decide whether to use each join clause for
changing Var node or not, I implement check_constraint_mutator() to
judge whether join clause is hash-joinable or not.

Thanks for the explanation. I think that the function has been made considering only rather plain calses. We should put more thought to making the logic clearer so that we can define the desired/possible capability and limitations clearly.

Actually, I want to judge whether OpExpr as top expression tree of
join clause means "=" or not, but I can't find how to do it.

If you know how to do it, please let me know.

I don't see for now, too :p

But we at least should put more consideration on the mechanism to obtain the expressions.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Taiki Kondo
tai-kondo@yk.jp.nec.com
In reply to: Kouhei Kaigai (#11)
3 attachment(s)
Re: [Proposal] Table partition + join pushdown

Hello, KaiGai-san and Horiguchi-san.

I created v2 patch. Please find attached.
I believe this patch will fix the most of issues mentioned by
Horiguchi-san except naming.

In this v2 patch, scan node which is originally inner relation of
Join node must be SeqScan (or SampleScan). This limitation is
due to implementation of try_join_pushdown(), which copies Path nodes
to attach new filtering conditions converted from CHECK() constraints.

It uses copyObject() for this purpose, so I must implement copy functions
for scan Path nodes like IndexPath, BitmapHeapPath, TidPath and so on.

By the way, let me introduce the performance of this feature.
Here are the results I tested in my environment.
These results were got by "pushdown_test.v1.large.sql"
running on the environment that "work_mem" set to "1536kB".
(This file is also attached in this mail.)

[Normal]
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=1851.02..14638.11 rows=300004 width=20) (actual time=88.188..453.926 rows=299992 loops=1)
Hash Cond: (check_test_div.id = inner_t.id)
-> Append (cost=0.00..4911.03 rows=300004 width=20) (actual time=0.089..133.456 rows=300003 loops=1)
-> Seq Scan on check_test_div (cost=0.00..0.00 rows=1 width=20) (actual time=0.003..0.003 rows=0 loops=1)
-> Seq Scan on check_test_div_0 (cost=0.00..1637.01 rows=100001 width=20) (actual time=0.085..40.741 rows=100001 loops=1)
-> Seq Scan on check_test_div_1 (cost=0.00..1637.01 rows=100001 width=20) (actual time=0.023..29.213 rows=100001 loops=1)
-> Seq Scan on check_test_div_2 (cost=0.00..1637.01 rows=100001 width=20) (actual time=0.021..28.592 rows=100001 loops=1)
-> Hash (cost=866.01..866.01 rows=60001 width=8) (actual time=87.970..87.970 rows=60001 loops=1)
Buckets: 32768 Batches: 2 Memory Usage: 1446kB
-> Seq Scan on inner_t (cost=0.00..866.01 rows=60001 width=8) (actual time=0.030..39.133 rows=60001 loops=1)
Planning time: 0.867 ms
Execution time: 470.269 ms
(12 rows)

[With this feature]
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Append (cost=0.01..10651.37 rows=300004 width=20) (actual time=55.548..377.615 rows=299992 loops=1)
-> Hash Join (cost=0.01..1091.04 rows=1 width=20) (actual time=0.017..0.017 rows=0 loops=1)
Hash Cond: (inner_t.id = check_test_div.id)
-> Seq Scan on inner_t (cost=0.00..866.01 rows=60001 width=8) (never executed)
-> Hash (cost=0.00..0.00 rows=1 width=20) (actual time=0.003..0.003 rows=0 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 8kB
-> Seq Scan on check_test_div (cost=0.00..0.00 rows=1 width=20) (actual time=0.002..0.002 rows=0 loops=1)
-> Hash Join (cost=1169.76..3186.78 rows=100001 width=20) (actual time=55.530..149.205 rows=100001 loops=1)
Hash Cond: (check_test_div_0.id = inner_t.id)
-> Seq Scan on check_test_div_0 (cost=0.00..1637.01 rows=100001 width=20) (actual time=0.058..34.268 rows=100001 loops=1)
-> Hash (cost=1166.01..1166.01 rows=300 width=8) (actual time=55.453..55.453 rows=20001 loops=1)
Buckets: 32768 (originally 1024) Batches: 1 (originally 1) Memory Usage: 1038kB
-> Seq Scan on inner_t (cost=0.00..1166.01 rows=300 width=8) (actual time=0.031..43.590 rows=20001 loops=1)
Filter: ((id % 3) = 0)
Rows Removed by Filter: 40000
-> Hash Join (cost=1169.76..3186.78 rows=100001 width=20) (actual time=27.942..97.582 rows=99996 loops=1)
Hash Cond: (check_test_div_1.id = inner_t.id)
-> Seq Scan on check_test_div_1 (cost=0.00..1637.01 rows=100001 width=20) (actual time=0.030..25.514 rows=100001 loops=1)
-> Hash (cost=1166.01..1166.01 rows=300 width=8) (actual time=27.890..27.890 rows=20000 loops=1)
Buckets: 32768 (originally 1024) Batches: 1 (originally 1) Memory Usage: 1038kB
-> Seq Scan on inner_t (cost=0.00..1166.01 rows=300 width=8) (actual time=0.014..21.688 rows=20000 loops=1)
Filter: ((id % 3) = 1)
Rows Removed by Filter: 40001
-> Hash Join (cost=1169.76..3186.78 rows=100001 width=20) (actual time=27.651..97.755 rows=99995 loops=1)
Hash Cond: (check_test_div_2.id = inner_t.id)
-> Seq Scan on check_test_div_2 (cost=0.00..1637.01 rows=100001 width=20) (actual time=0.026..25.620 rows=100001 loops=1)
-> Hash (cost=1166.01..1166.01 rows=300 width=8) (actual time=27.599..27.599 rows=20000 loops=1)
Buckets: 32768 (originally 1024) Batches: 1 (originally 1) Memory Usage: 1038kB
-> Seq Scan on inner_t (cost=0.00..1166.01 rows=300 width=8) (actual time=0.017..21.307 rows=20000 loops=1)
Filter: ((id % 3) = 2)
Rows Removed by Filter: 40001
Planning time: 1.876 ms
Execution time: 394.007 ms
(33 rows)

The value of "Batches" is 2 on Hash node in normal,
but these values are 1 on all Hash nodes in "with this feature".

This means that the hash table is not split because of this feature.

Therefore, PostgreSQL with this feature is faster than the normal one in this case.
(470.269 ms @ normal vs 394.007 ms @ this feature)

I think this point is large benefit of this feature.

Best regards,
--
Taiki Kondo

NEC Solution Innovators, Ltd.

-----Original Message-----
From: Kaigai Kouhei(海外 浩平) [mailto:kaigai@ak.jp.nec.com]
Sent: Thursday, October 15, 2015 10:21 AM
To: Kondo Taiki(近藤 太樹); Kyotaro HORIGUCHI
Cc: Iwaasa Akio(岩浅 晃郎); pgsql-hackers@postgresql.org
Subject: RE: [HACKERS] [Proposal] Table partition + join pushdown

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Taiki Kondo
Sent: Thursday, October 08, 2015 5:28 PM
To: Kyotaro HORIGUCHI
Cc: Kaigai Kouhei(海外 浩平); Iwaasa Akio(岩浅 晃郎);
pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] [Proposal] Table partition + join pushdown

Hello, Horiguchi-san.

Thank you for your comment.

I got some warning on compilation on unused variables and wrong
arguemtn type.

OK, I'll fix it.

I failed to have a query that this patch works on. Could you let me
have some specific example for this patch?

Please find attached.
And also make sure that setting of work_mem is '64kB' (not 64MB).

If there is the large work_mem enough to create hash table for
relation after appending, its cost may be better than pushed-down
plan's cost, then planner will not choose pushed-down plan this patch makes.
So, to make this patch working fine, work_mem size must be smaller
than the hash table size for relation after appending.

This patch needs more comments. Please put comment about not only
what it does but also the reason and other things for it.

OK, I'll put more comments in the code.
But it will take a long time, maybe...

People (including me) can help. Even though your English capability is not enough, it is significant to put intention of the code.

-- about namings

Names for functions and variables are needed to be more appropriate,
in other words, needed to be what properly informs what they are.
The followings are the examples of such names.

Thank you for your suggestion.

I also think these names are not good much.
I'll try to make names better , but it maybe take a long time...
Of course, I will use your suggestion as reference.

"added_restrictlist"'s widely distributed as many function arguemnts
and JoinPathExtraData makes me feel dizzy..

"added_restrictinfo" will be deleted from almost functions other than
try_join_pushdown() in next (v2) patch because the place of filtering
using this info will be changed from Join node to Scan node and not
have to place it into other than try_join_pushdown().

This restrictinfo intends to filter out obviously unrelated rows in this join, due to the check constraint of other side of the join.
So, correct but redundant name is:
restrictlist_to_drop_unrelated_rows_because_of_check_constraint

How about 'restrictlist_by_constraint' instead?

In make_restrictinfos_from_check_constr, the function returns
modified constraint predicates correspond to vars under hashjoinable
join clauses. I don't think expression_tree_mutator is suitable to
do that since it could allow unwanted result when constraint
predicates or join clauses are not simple OpExpr's.

Do you have any example of this situation?
I am trying to find unwanted results you mentioned, but I don't have
any results in this timing. I have a hunch that it will allow unwanted
results because I have thought only about very simple situation for
this function.

check_constraint_mutator makes modified restrictlist with relacing Var node only when join clause is hash-joinable.
It implies <expr> = <expr> form, thus we can safely replace the expression by the other side.

Of course, we still have cases we cannot replace expressions simply.
- If function (or function called by operators) has volatile attribute
(who use volatile function on CHECK constraint of partitioning?)
- If it is uncertain whether expression returns always same result.
(is it possible to contain SubLink in the constraint?)

I'd like to suggest to use white-list approach in this mutator routine.
It means that only immutable expression node are allowed to include the modified restrictlist.

Things to do is:

check_constraint_mutator(...)
{
if (node == NULL)
return NULL;
if (IsA(node, Var))
{
:
}
else if (node is not obviously immutable)
{
context->is_mutated = false; <-- prohibit to make if expression
} contains uncertain node.
return expression_tree_mutator(...)
}

Otherwise could you give me clear explanation on what it does?

This function transfers CHECK() constraint to filter expression by
following procedures.
(1) Get outer table's CHECK() constraint by using get_relation_constraints().
(2) Walk through expression tree got in (1) by using expression_tree_mutator()
with check_constraint_mutator() and change only outer's Var node to
inner's one according to join clause.

For example, when CHECK() constraint of table A is "num % 4 = 0" and
join clause between table A and B is "A.num = B.data", then we can get
"B.data % 4 = 0" for filtering purpose.

This also accepts more complex join clause like "A.num = B.data * 2",
then we can get "(B.data * 2) % 4 = 0".

In procedure (2), to decide whether to use each join clause for
changing Var node or not, I implement check_constraint_mutator() to
judge whether join clause is hash-joinable or not.

Actually, I want to judge whether OpExpr as top expression tree of
join clause means "=" or not, but I can't find how to do it.

If you know how to do it, please let me know.

Thanks,
--
NEC Business Creation Division / PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com>

Show quoted text

-----Original Message-----
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
Sent: Tuesday, October 06, 2015 8:35 PM
To: tai-kondo@yk.jp.nec.com
Cc: kaigai@ak.jp.nec.com; aki-iwaasa@vt.jp.nec.com;
pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] [Proposal] Table partition + join pushdown

Hello.

I tried to read this and had some random comments on this.

-- general

I got some warning on compilation on unused variables and wrong arguemtn type.

I failed to have a query that this patch works on. Could you let me
have some specific example for this patch?

This patch needs more comments. Please put comment about not only what
it does but also the reason and other things for it.

-- about namings

Names for functions and variables are needed to be more appropriate,
in other words, needed to be what properly informs what they are. The
followings are the examples of such names.

"added_restrictlist"'s widely distributed as many function arguemnts
and JoinPathExtraData makes me feel dizzy.. create_mergejoin_path
takes it as "filtering_clauses", which looks far better.

try_join_pushdown() is also the name with much wider meaning. This
patch tries to move hashjoins on inheritance parent to under append
paths. It could be generically called 'pushdown'
but this would be better be called such like 'transform appended
hashjoin' or 'hashjoin distribution'. The latter would be better.
(The function name would be try_distribute_hashjoin for the
case.)

The name make_restrictinfos_from_check_contr() also tells me wrong
thing. For example,
extract_constraints_for_hashjoin_distribution() would inform me about
what it returns.

-- about what make_restrictinfos_from_check_constr() does

In make_restrictinfos_from_check_constr, the function returns modified
constraint predicates correspond to vars under hashjoinable join
clauses. I don't think expression_tree_mutator is suitable to do that
since it could allow unwanted result when constraint predicates or join clauses are not simple OpExpr's.

Could you try more simple and straight-forward way to do that?
Otherwise could you give me clear explanation on what it does?

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

pushdown_test.v1.large.sqlapplication/octet-stream; name=pushdown_test.v1.large.sqlDownload
pushdown_test.v1.sqlapplication/octet-stream; name=pushdown_test.v1.sqlDownload
join_pushdown.v2.patchapplication/octet-stream; name=join_pushdown.v2.patchDownload
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index c176ff9..63402cd 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -1963,12 +1963,78 @@ _copyOnConflictExpr(const OnConflictExpr *from)
 /* ****************************************************************
  *						relation.h copy functions
  *
- * We don't support copying RelOptInfo, IndexOptInfo, or Path nodes.
+ * We don't support copying RelOptInfo, IndexOptInfo or Path node.
  * There are some subsidiary structs that are useful to copy, though.
  * ****************************************************************
  */
 
 /*
+ * CopyPathFields
+ */
+static void
+CopyPathFields(const Path *from, Path *newnode)
+{
+	COPY_SCALAR_FIELD(pathtype);
+
+	/*
+	 * We use COPY_SCALAR_FIELDS() for parent instead of COPY_NODE_FIELDS()
+	 * because RelOptInfo contains Path which is made from, so
+	 * jump into the infinite loop.
+	 */
+	COPY_SCALAR_FIELD(parent);
+
+	COPY_SCALAR_FIELD(param_info);
+
+	COPY_SCALAR_FIELD(rows);
+	COPY_SCALAR_FIELD(startup_cost);
+	COPY_SCALAR_FIELD(total_cost);
+
+	COPY_NODE_FIELD(pathkeys);
+}
+
+/*
+ * _copyPath
+ */
+static Path *
+_copyPath(const Path *from)
+{
+	Path *newnode = makeNode(Path);
+
+	CopyPathFields(from, newnode);
+
+	return newnode;
+}
+
+/*
+ * _copyIndexPath
+ * XXX Need to make copy function for IndexOptInfo, etc.
+ */
+static IndexPath *
+_copyIndexPath(const IndexPath *from)
+{
+	IndexPath *newnode = makeNode(IndexPath);
+
+	CopyPathFields(&from->path, &newnode->path);
+
+	COPY_NODE_FIELD(indexinfo);
+	COPY_NODE_FIELD(indexclauses);
+	COPY_NODE_FIELD(indexquals);
+	COPY_NODE_FIELD(indexqualcols);
+	COPY_NODE_FIELD(indexorderbys);
+	COPY_NODE_FIELD(indexorderbycols);
+	COPY_SCALAR_FIELD(indexscandir);
+	COPY_SCALAR_FIELD(indextotalcost);
+	COPY_SCALAR_FIELD(indexselectivity);
+
+	return newnode;
+}
+
+/*
+ * XXX Need to make copy function for BitmapHeapPath
+ * and TidPath.
+ */
+
+/*
  * _copyPathKey
  */
 static PathKey *
@@ -4506,6 +4572,12 @@ copyObject(const void *from)
 			/*
 			 * RELATION NODES
 			 */
+		case T_Path:
+			retval = _copyPath(from);
+			break;
+		case T_IndexPath:
+			retval = _copyIndexPath(from);
+			break;
 		case T_PathKey:
 			retval = _copyPathKey(from);
 			break;
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index a35c881..dd5d38c 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -18,9 +18,22 @@
 
 #include "executor/executor.h"
 #include "foreign/fdwapi.h"
+#include "nodes/nodeFuncs.h"
+#include "nodes/nodes.h"
+#include "optimizer/clauses.h"
 #include "optimizer/cost.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
+#include "optimizer/plancat.h"
+#include "optimizer/restrictinfo.h"
+#include "rewrite/rewriteManip.h"
+#include "utils/lsyscache.h"
+
+typedef struct
+{
+	List	*joininfo;
+	bool	 is_substituted;
+} substitution_node_context;
 
 /* Hook for plugins to get control in add_paths_to_joinrel() */
 set_join_pathlist_hook_type set_join_pathlist_hook = NULL;
@@ -45,6 +58,11 @@ static List *select_mergejoin_clauses(PlannerInfo *root,
 						 JoinType jointype,
 						 bool *mergejoin_allowed);
 
+static void try_join_pushdown(PlannerInfo *root,
+						  RelOptInfo *joinrel, RelOptInfo *outer_rel,
+						  RelOptInfo *inner_rel,
+						  List *restrictlist);
+
 
 /*
  * add_paths_to_joinrel
@@ -82,6 +100,14 @@ add_paths_to_joinrel(PlannerInfo *root,
 	bool		mergejoin_allowed = true;
 	ListCell   *lc;
 
+	/*
+	 * Try to push Join down under Append
+	 */
+	if (!IS_OUTER_JOIN(jointype))
+	{
+		try_join_pushdown(root, joinrel, outerrel, innerrel, restrictlist);
+	}
+
 	extra.restrictlist = restrictlist;
 	extra.mergeclause_list = NIL;
 	extra.sjinfo = sjinfo;
@@ -1474,3 +1500,468 @@ select_mergejoin_clauses(PlannerInfo *root,
 
 	return result_list;
 }
+
+/*
+ * Try to substitute Var node according to join conditions.
+ * This process is from following steps.
+ *
+ * 1. Try to find whether Var node matches to left/right Var node of
+ *    one join condition.
+ * 2. If found, replace Var node with the opposite expression node of
+ *    the join condition.
+ *
+ * For example, let's assume that we have following expression and
+ * join condition.
+ * Expression       : A.num % 4 = 1
+ * Join condition   : A.num = B.data + 2
+ * In this case, we can get following expression.
+ *    (B.data + 2) % 4 = 1
+ */
+static Node *
+substitute_node_with_join_cond(Node *node, substitution_node_context *context)
+{
+	/* Failed to substitute. Abort. */
+	if (!context->is_substituted)
+		return (Node *) copyObject(node);
+
+	if (node == NULL)
+		return NULL;
+
+	if (IsA(node, Var))
+	{
+		List		*join_cond = context->joininfo;
+		ListCell	*lc;
+
+		Assert(list_length(join_cond) > 0);
+
+		foreach (lc, join_cond)
+		{
+			RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+			Expr *expr = rinfo->clause;
+
+			/*
+			 * Make sure whether OpExpr of Join clause means "=".
+			 */
+			if (!rinfo->can_join ||
+				!IsA(expr, OpExpr) ||
+				!op_hashjoinable(((OpExpr *) expr)->opno,
+								exprType(get_leftop(expr))))
+				continue;
+
+			if (equal(get_leftop(expr), node))
+			{
+				/*
+				 * This node is equal to LEFT node of join condition,
+				 * thus will be replaced with RIGHT clause.
+				 */
+				return (Node *) copyObject(get_rightop(expr));
+			}
+			else
+			if (equal(get_rightop(expr), node))
+			{
+				/*
+				 * This node is equal to RIGHT node of join condition,
+				 * thus will be replaced with LEFT clause.
+				 */
+				return (Node *) copyObject(get_leftop(expr));
+			}
+		}
+
+		/* Unfortunately, substituting is failed. */
+		context->is_substituted = false;
+		return (Node *) copyObject(node);
+	}
+
+	return expression_tree_mutator(node, substitute_node_with_join_cond, context);
+}
+
+/*
+ * Create RestrictInfo_List from CHECK() constraints.
+ *
+ * This function creates list of RestrictInfo from CHECK() constraints
+ * according to expression of join clause.
+ *
+ * For example, let's assume that we have following CHECK() constraints
+ * for table A and join clause between table A and B.
+ * CHECK of table A      : 0 <= num AND num <= 100
+ * JOIN CLAUSE           : A.num = B.data
+ * In this conditions, we can get below by mathematical substituting.
+ *    0 <= B.data AND B.data <= 100
+ *
+ * We can use this restrictions to reduce result rows.
+ * This means that we can make Sort faster by reducing rows in MergeJoin,
+ * and also means that we can make HashTable smaller in HashJoin to fit
+ * to smaller work_mem environments.
+ */
+static List *
+create_rinfo_from_check_constr(PlannerInfo *root, List *joininfo,
+									 RelOptInfo *outer_rel, bool *succeed)
+{
+	List			*result = NIL;
+	RangeTblEntry	*childRTE = root->simple_rte_array[outer_rel->relid];
+	List			*check_constr =
+						get_relation_constraints(root, childRTE->relid,
+													outer_rel, false);
+	ListCell		*lc;
+	substitution_node_context	context;
+
+	if (list_length(check_constr) <= 0)
+	{
+		*succeed = true;
+		return NIL;
+	}
+
+	context.joininfo = joininfo;
+	context.is_substituted = true;
+
+	/*
+	 * Try to convert CHECK() constraints to filter expressions.
+	 */
+	foreach(lc, check_constr)
+	{
+		Node *substituted =
+				expression_tree_mutator((Node *) lfirst(lc),
+										substitute_node_with_join_cond,
+										(void *) &context);
+
+		if (!context.is_substituted)
+		{
+			*succeed = false;
+			list_free_deep(check_constr);
+			return NIL;
+		}
+		result = lappend(result, substituted);
+	}
+
+	Assert(list_length(check_constr) == list_length(result));
+	list_free_deep(check_constr);
+
+	return make_restrictinfos_from_actual_clauses(root, result);
+}
+
+/*
+ * Convert parent's join clauses to child's.
+ */
+static List *
+convert_parent_joinclauses_to_child(PlannerInfo *root, List *join_clauses,
+									RelOptInfo *outer_rel)
+{
+	Index		parent_relid =
+					find_childrel_appendrelinfo(root, outer_rel)->parent_relid;
+	List		*clauses_parent = get_actual_clauses(join_clauses);
+	List		*clauses_child = NIL;
+	ListCell	*lc;
+
+	foreach(lc, clauses_parent)
+	{
+		Node	*one_clause_child = (Node *) copyObject(lfirst(lc));
+
+		ChangeVarNodes(one_clause_child, parent_relid, outer_rel->relid, 0);
+		clauses_child = lappend(clauses_child, one_clause_child);
+	}
+
+	return make_restrictinfos_from_actual_clauses(root, clauses_child);
+}
+
+static inline List *
+extract_join_clauses(List *restrictlist, RelOptInfo *outer_prel,
+						RelOptInfo *inner_rel)
+{
+	List		*result = NIL;
+	ListCell	*lc;
+
+	foreach (lc, restrictlist)
+	{
+		RestrictInfo	*rinfo = (RestrictInfo *) lfirst(lc);
+
+		if (clause_sides_match_join(rinfo, outer_prel, inner_rel))
+			result = lappend(result, rinfo);
+	}
+
+	return result;
+}
+
+/*
+ * try_join_pushdown
+ *
+ * When outer-path of JOIN is AppendPath, we can rewrite path-tree with
+ * relocation of JoinPath across AppendPath, to generate equivalent
+ * results, like a diagram below.
+ * This adjustment gives us a few performance benefits when the relations
+ * scaned by sub-plan of Append-node have CHECK() constraints - typically,
+ * configured as partitioned table.
+ *
+ * In case of INNER JOIN with equivalent join condition, like A = B, we
+ * can exclude a part of inner rows that are obviously unreferenced, if
+ * outer side has CHECK() constraints that contains join keys.
+ * The CHECK() constraints ensures all the rows within outer relation
+ * satisfies the condition, in other words, any inner rows that does not
+ * satisfies the condition (with adjustment using equivalence of join keys)
+ * never match any outer rows.
+ *
+ * Once we can reduce number of inner rows, here are two beneficial scenario.
+ * 1. HashJoin may avoid split of hash-table even if preload of entire
+ *    inner relation exceeds work_mem.
+ * 2. MergeJoin may be able to take smaller scale of Sort, because quick-sort
+ *    is O(NlogN) scale problem. Reduction of rows to be sorted on both side
+ *    reduces CPU cost more than liner.
+ *
+ * [BEFORE]
+ * JoinPath ... (parent.X = inner.Y)
+ *  -> AppendPath on parent
+ *    -> ScanPath on child_1 ... CHECK(hash(X) % 3 = 0)
+ *    -> ScanPath on child_2 ... CHECK(hash(X) % 3 = 1)
+ *    -> ScanPath on child_3 ... CHECK(hash(X) % 3 = 2)
+ *  -> ScanPath on inner
+ *
+ * [AFTER]
+ * AppendPath
+ *  -> JoinPath ... (child_1.X = inner.Y)
+ *    -> ScanPath on child_1 ... CHECK(hash(X) % 3 = 0)
+ *    -> ScanPath on inner ... filter (hash(Y) % 3 = 0)
+ *  -> JoinPath ... (child_2.X = inner.Y)
+ *    -> ScanPath on child_2 ... CHECK(hash(X) % 3 = 1)
+ *    -> ScanPath on inner ... filter (hash(Y) % 3 = 1)
+ *  -> JoinPath ... (child_3.X = inner.Y)
+ *    -> ScanPath on child_3 ... CHECK(hash(X) % 3 = 2)
+ *    -> ScanPath on inner ... filter (hash(Y) % 3 = 2)
+ *
+ * Point to be focused on is filter condition attached on child relation's
+ * scan. It is clause of CHECK() constraint, but X is replaced by Y using
+ * equivalence join condition.
+ */
+static void
+try_join_pushdown(PlannerInfo *root,
+				  RelOptInfo *joinrel, RelOptInfo *outer_rel,
+				  RelOptInfo *inner_rel,
+				  List *restrictlist)
+{
+	AppendPath	*outer_path;
+	ListCell	*lc;
+	List		*joinclauses_parent;
+	List		*alter_append_subpaths = NIL;
+
+	Assert(outer_rel->cheapest_total_path != NULL);
+
+	/* When specified outer path is not an AppendPath, nothing to do here. */
+	if (!IsA(outer_rel->cheapest_total_path, AppendPath))
+	{
+		elog(DEBUG1, "Outer path is not an AppendPath. Do nothing.");
+		return;
+	}
+
+	outer_path = (AppendPath *) outer_rel->cheapest_total_path;
+
+	if (outer_rel->rtekind != RTE_RELATION)
+	{
+		elog(DEBUG1, "Outer Relation is not for table scan. Give up.");
+		return;
+	}
+
+	switch (inner_rel->cheapest_total_path->pathtype)
+	{
+	case T_SeqScan :
+	case T_SampleScan :
+	case T_IndexScan :
+	case T_IndexOnlyScan :
+	case T_BitmapHeapScan :
+	case T_TidScan :
+		/* Do nothing. No-op */
+		break;
+	default :
+		{
+			elog(DEBUG1, "Type of Inner path is not supported yet. Give up.");
+			return;
+		}
+	}
+
+	/*
+	 * Extract join clauses to convert CHECK() constraints.
+	 * We don't have to clobber this list to convert CHECK() constraints,
+	 * so we need to do only once.
+	 */
+	joinclauses_parent = extract_join_clauses(restrictlist, outer_rel, inner_rel);
+	if (list_length(joinclauses_parent) <= 0)
+	{
+		elog(DEBUG1, "No join clauses specified. Give up.");
+		return;
+	}
+
+	if (list_length(inner_rel->ppilist) > 0)
+	{
+		elog(DEBUG1, "ParamPathInfo is already set in inner_rel. Can't pushdown.");
+		return;
+	}
+
+	/*
+	 * Make new joinrel between each of outer path's sub-paths and inner path.
+	 */
+	foreach(lc, outer_path->subpaths)
+	{
+		RelOptInfo	*orig_outer_rel = ((Path *) lfirst(lc))->parent;
+		RelOptInfo	*alter_outer_rel;
+		Path		*alter_path = NULL;
+		List		*joinclauses_child;
+		List		*restrictlist_by_check_constr;
+		bool		is_valid;
+		List		**join_rel_level;
+
+		Assert(!IS_DUMMY_REL(orig_outer_rel));
+
+		/*
+		 * Join clause points parent's relid,
+		 * so we must change it to child's one.
+		 */
+		joinclauses_child = convert_parent_joinclauses_to_child(root, joinclauses_parent,
+													orig_outer_rel);
+
+		/*
+		 * Make RestrictInfo list from CHECK() constraints of outer table.
+		 * "is_valid" indicates whether making RestrictInfo list succeeded or not.
+		 */
+		restrictlist_by_check_constr =
+				create_rinfo_from_check_constr(root, joinclauses_child,
+													orig_outer_rel, &is_valid);
+
+		if (!is_valid)
+		{
+			elog(DEBUG1, "Join clause doesn't match with CHECK() constraint. "
+					"Can't pushdown.");
+			list_free_deep(alter_append_subpaths);
+			list_free(joinclauses_parent);
+			return;
+		}
+
+		if (list_length(restrictlist_by_check_constr) > 0)
+		{
+			/* Prepare ParamPathInfo for RestrictInfos by CHECK constraints. */
+			ParamPathInfo *newppi = makeNode(ParamPathInfo);
+
+			newppi->ppi_req_outer = NULL;
+			newppi->ppi_rows =
+					get_parameterized_baserel_size(root,
+													inner_rel,
+													restrictlist_by_check_constr);
+			newppi->ppi_clauses = restrictlist_by_check_constr;
+
+			/* Copy Path of inner relation, and specify newppi to it. */
+			alter_path = copyObject(inner_rel->cheapest_total_path);
+			alter_path->param_info = newppi;
+
+			/* Re-calculate costs of alter_path */
+			switch (alter_path->pathtype)
+			{
+			case T_SeqScan :
+				cost_seqscan(alter_path, root, inner_rel, newppi);
+				break;
+			case T_SampleScan :
+				cost_samplescan(alter_path, root, inner_rel, newppi);
+				break;
+			case T_IndexScan :
+			case T_IndexOnlyScan :
+				{
+					IndexPath *ipath = (IndexPath *) alter_path;
+
+					cost_index(ipath, root, 1.0);
+				}
+				break;
+			case T_BitmapHeapScan :
+				{
+					BitmapHeapPath *bpath = (BitmapHeapPath *) alter_path;
+
+					cost_bitmap_heap_scan(&bpath->path, root, inner_rel,
+							newppi, bpath->bitmapqual, 1.0);
+				}
+				break;
+			case T_TidScan :
+				{
+					TidPath *tpath = (TidPath *) alter_path;
+
+					cost_tidscan(&tpath->path, root, inner_rel,
+							tpath->tidquals, newppi);
+				}
+				break;
+			default:
+				break;
+			}
+
+			/*
+			 * Append this path to pathlist temporary.
+			 * This path will be removed after returning from make_join_rel().
+			 */
+			inner_rel->pathlist = lappend(inner_rel->pathlist, alter_path);
+			set_cheapest(inner_rel);
+		}
+
+		/*
+		 * NOTE: root->join_rel_level is used to track candidate of join
+		 * relations for each level, then these relations are consolidated
+		 * to one relation.
+		 * (See the comment in standard_join_search)
+		 *
+		 * Even though we construct RelOptInfo of child relations of the
+		 * Append node, these relations should not appear as candidate of
+		 * relations join in the later stage. So, we once save the list
+		 * during make_join_rel() for the child relations.
+		 */
+		join_rel_level = root->join_rel_level;
+		root->join_rel_level = NULL;
+
+		/*
+		 * Create new joinrel (as a sub-path of Append).
+		 */
+		alter_outer_rel =
+				make_join_rel(root, orig_outer_rel, inner_rel);
+
+		/* restore the join_rel_level */
+		root->join_rel_level = join_rel_level;
+
+		Assert(alter_outer_rel != NULL);
+
+		if (alter_path)
+		{
+			/*
+			 * Remove (temporary added) alter_path from pathlist.
+			 * The alter_path may be inner/outer path of JoinPath made
+			 * by make_join_rel() above, thus we must not free alter_path itself.
+			 */
+			inner_rel->pathlist = list_delete_ptr(inner_rel->pathlist, alter_path);
+			set_cheapest(inner_rel);
+		}
+
+		if (IS_DUMMY_REL(alter_outer_rel))
+		{
+			pfree(alter_outer_rel);
+			continue;
+		}
+
+		/*
+		 * We must check if alter_outer_rel has one or more paths.
+		 * add_path() sometime rejects to add new path to parent RelOptInfo.
+		 */
+		if (list_length(alter_outer_rel->pathlist) <= 0)
+		{
+			/*
+			 * Sadly, No paths added. This means that pushdown is failed,
+			 * thus clean up here.
+			 */
+			list_free_deep(alter_append_subpaths);
+			pfree(alter_outer_rel);
+			list_free(joinclauses_parent);
+			elog(DEBUG1, "Join pushdown failed.");
+			return;
+		}
+
+		set_cheapest(alter_outer_rel);
+		Assert(alter_outer_rel->cheapest_total_path != NULL);
+		alter_append_subpaths = lappend(alter_append_subpaths,
+									alter_outer_rel->cheapest_total_path);
+	}
+
+	/* Join Pushdown is succeeded. Add path to original joinrel. */
+	add_path(joinrel,
+			(Path *) create_append_path(joinrel, alter_append_subpaths, NULL));
+
+	list_free(joinclauses_parent);
+	elog(DEBUG1, "Join pushdown succeeded.");
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 791b64e..4c18572 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -4230,8 +4230,14 @@ prepare_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree, List *pathkeys,
 				/*
 				 * Ignore child members unless they match the rel being
 				 * sorted.
+				 *
+				 * If this is called from make_sort_from_pathkeys(),
+				 * relids may be NULL. In this case, we must not ignore child
+				 * members because inner/outer plan of pushed-down merge join is
+				 * always child table.
 				 */
-				if (em->em_is_child &&
+				if (relids != NULL &&
+					em->em_is_child &&
 					!bms_equal(em->em_relids, relids))
 					continue;
 
@@ -4344,8 +4350,13 @@ find_ec_member_for_tle(EquivalenceClass *ec,
 
 		/*
 		 * Ignore child members unless they match the rel being sorted.
+		 *
+		 * If this is called from make_sort_from_pathkeys(), relids may be NULL.
+		 * In this case, we must not ignore child members because inner/outer
+		 * plan of pushed-down merge join is always child table.
 		 */
-		if (em->em_is_child &&
+		if (relids != NULL &&
+			em->em_is_child &&
 			!bms_equal(em->em_relids, relids))
 			continue;
 
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 9442e5f..c137b09 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -54,9 +54,6 @@ get_relation_info_hook_type get_relation_info_hook = NULL;
 static bool infer_collation_opclass_match(InferenceElem *elem, Relation idxRel,
 							  List *idxExprs);
 static int32 get_rel_data_width(Relation rel, int32 *attr_widths);
-static List *get_relation_constraints(PlannerInfo *root,
-						 Oid relationObjectId, RelOptInfo *rel,
-						 bool include_notnull);
 static List *build_index_tlist(PlannerInfo *root, IndexOptInfo *index,
 				  Relation heapRelation);
 
@@ -1022,7 +1019,7 @@ get_relation_data_width(Oid relid, int32 *attr_widths)
  * run, and in many cases it won't be invoked at all, so there seems no
  * point in caching the data in RelOptInfo.
  */
-static List *
+List *
 get_relation_constraints(PlannerInfo *root,
 						 Oid relationObjectId, RelOptInfo *rel,
 						 bool include_notnull)
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index 68a93a1..f60ef98 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -496,19 +496,24 @@ build_joinrel_tlist(PlannerInfo *root, RelOptInfo *joinrel,
 {
 	Relids		relids = joinrel->relids;
 	ListCell   *vars;
+	int			nth = 0;
 
 	foreach(vars, input_rel->reltargetlist)
 	{
 		Var		   *var = (Var *) lfirst(vars);
 		RelOptInfo *baserel;
 		int			ndx;
+		bool		is_needed = false;
 
 		/*
 		 * Ignore PlaceHolderVars in the input tlists; we'll make our own
 		 * decisions about whether to copy them.
 		 */
 		if (IsA(var, PlaceHolderVar))
+		{
+			nth++;
 			continue;
+		}
 
 		/*
 		 * Otherwise, anything in a baserel or joinrel targetlist ought to be
@@ -521,15 +526,84 @@ build_joinrel_tlist(PlannerInfo *root, RelOptInfo *joinrel,
 
 		/* Get the Var's original base rel */
 		baserel = find_base_rel(root, var->varno);
+		ndx = var->varattno - baserel->min_attr;
+
+		/*
+		 * We must handle case of join pushdown.
+		 */
+		if (input_rel->reloptkind == RELOPT_OTHER_MEMBER_REL)
+		{
+			/* Get the Var's PARENT base rel */
+			Index	parent_relid =
+						find_childrel_appendrelinfo(root, input_rel)->parent_relid;
+			RelOptInfo *parent_rel = find_base_rel(root, parent_relid);
+			Var		*parent_var =
+						(Var *) list_nth(parent_rel->reltargetlist, nth);
+			int		parent_ndx = parent_var->varattno - parent_rel->min_attr;
+			/* Relids have included parent_rel's instead of input_rel's. */
+			Relids	relids_tmp =
+					bms_del_members(bms_copy(relids), input_rel->relids);
+
+			relids_tmp = bms_union(relids_tmp, parent_rel->relids);
+
+			Assert(ndx == parent_ndx);
+
+			is_needed =
+					(bms_nonempty_difference(
+							parent_rel->attr_needed[parent_ndx],
+							relids_tmp));
+
+			bms_free(relids_tmp);
+		}
+		else
+		{
+			Relids	relids_tmp =
+					bms_del_members(bms_copy(relids), input_rel->relids);
+			int		another_relid = -1;
+
+			/* Try to detect Inner relation of pushed-down join. */
+			if (bms_get_singleton_member(relids_tmp, &another_relid))
+			{
+				RelOptInfo	*another_rel =
+						find_base_rel(root, another_relid);
+
+				if (another_rel->reloptkind == RELOPT_OTHER_MEMBER_REL)
+				{
+					/* This may be inner relation of pushed-down join. */
+					Index	parent_relid =
+								find_childrel_appendrelinfo(root, another_rel)->parent_relid;
+					RelOptInfo *parent_rel = find_base_rel(root, parent_relid);
+
+					bms_free(relids_tmp);
+					relids_tmp =
+							bms_union(input_rel->relids, parent_rel->relids);
+				}
+			}
+
+			if (!bms_is_subset(input_rel->relids, relids_tmp))
+			{
+				/* Can't detect inner relation of pushed-down join */
+				bms_free(relids_tmp);
+				relids_tmp = bms_copy(relids);
+			}
+
+			is_needed =
+					(bms_nonempty_difference(
+							baserel->attr_needed[ndx],
+							relids_tmp));
+
+			bms_free(relids_tmp);
+		}
 
 		/* Is it still needed above this joinrel? */
-		ndx = var->varattno - baserel->min_attr;
-		if (bms_nonempty_difference(baserel->attr_needed[ndx], relids))
+		if (is_needed)
 		{
 			/* Yup, add it to the output */
 			joinrel->reltargetlist = lappend(joinrel->reltargetlist, var);
 			joinrel->width += baserel->attr_widths[ndx];
 		}
+
+		nth++;
 	}
 }
 
diff --git a/src/include/optimizer/plancat.h b/src/include/optimizer/plancat.h
index 11e7d4d..f799a5b 100644
--- a/src/include/optimizer/plancat.h
+++ b/src/include/optimizer/plancat.h
@@ -28,6 +28,10 @@ extern PGDLLIMPORT get_relation_info_hook_type get_relation_info_hook;
 extern void get_relation_info(PlannerInfo *root, Oid relationObjectId,
 				  bool inhparent, RelOptInfo *rel);
 
+extern List *get_relation_constraints(PlannerInfo *root,
+						 Oid relationObjectId, RelOptInfo *rel,
+						 bool include_notnull);
+
 extern List *infer_arbiter_indexes(PlannerInfo *root);
 
 extern void estimate_rel_size(Relation rel, int32 *attr_widths,
#14Kouhei Kaigai
kaigai@ak.jp.nec.com
In reply to: Taiki Kondo (#13)
Re: [Proposal] Table partition + join pushdown

Hi, I put my comments towards the patch as follows.

Overall comments
----------------
* I think the enhancement in copyfuncs.c shall be in the separate
patch; it is more graceful manner. At this moment, here is less
than 20 Path delivered type definition. It is much easier works
than entire Plan node support as we did recently.
(How about other folk's opinion?)

* Can you integrate the attached test cases as regression test?
It is more generic way, and allows people to detect problems
if relevant feature gets troubled in the future updates.

* Naming of "join pushdown" is a bit misleading because other
component also uses this term, but different purpose.
I'd like to suggest try_pullup_append_across_join.
Any ideas from native English speaker?

Patch review
------------

At try_join_pushdown:
+   /* When specified outer path is not an AppendPath, nothing to do here. */
+   if (!IsA(outer_rel->cheapest_total_path, AppendPath))
+   {
+       elog(DEBUG1, "Outer path is not an AppendPath. Do nothing.");
+       return;
+   }
It checks whether the cheapest_total_path is AppendPath at the head
of this function. It ought to be a loop to walk on the pathlist of
RelOptInfo, because multiple path-nodes might be still alive but
also might not be cheapest_total_path.
+   switch (inner_rel->cheapest_total_path->pathtype)
+
Also, we can construct the new Append node if one of the path-node
within pathlist of inner_rel are at least supported.
+   if (list_length(inner_rel->ppilist) > 0)
+   {
+       elog(DEBUG1, "ParamPathInfo is already set in inner_rel. Can't pushdown.");
+       return;
+   }
+
You may need to introduce why this feature uses ParamPathInfos here.
It seems to me a good hack to attach additional qualifiers on the
underlying inner scan node, even if it is not a direct child of
inner relation.
However, people may have different opinion.
+static List *
+convert_parent_joinclauses_to_child(PlannerInfo *root, List *join_clauses,
+                                   RelOptInfo *outer_rel)
+{
+   Index       parent_relid =
+                   find_childrel_appendrelinfo(root, outer_rel)->parent_relid;
+   List        *clauses_parent = get_actual_clauses(join_clauses);
+   List        *clauses_child = NIL;
+   ListCell    *lc;
+
+   foreach(lc, clauses_parent)
+   {
+       Node    *one_clause_child = (Node *) copyObject(lfirst(lc));
+
+       ChangeVarNodes(one_clause_child, parent_relid, outer_rel->relid, 0);
+       clauses_child = lappend(clauses_child, one_clause_child);
+   }
+
+   return make_restrictinfos_from_actual_clauses(root, clauses_child);
+}

Is ChangeVarNodes() right routine to replace var-node of parent relation
by relevant var-node of child relation?
It may look sufficient, however, nobody can ensure varattno of child
relation is identical with parent relation's one.
For example, which attribute number shall be assigned on 'z' here?
CREATE TABLE tbl_parent(x int);
CREATE TABLE tbl_child(y int) INHERITS(tbl_parent);
ALTER TABLE tbl_parent ADD COLUMN z int;

--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -4230,8 +4230,14 @@ prepare_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree, List *pathkeys,
                /*
                 * Ignore child members unless they match the rel being
                 * sorted.
+                *
+                * If this is called from make_sort_from_pathkeys(),
+                * relids may be NULL. In this case, we must not ignore child
+                * members because inner/outer plan of pushed-down merge join is
+                * always child table.
                 */
-               if (em->em_is_child &&
+               if (relids != NULL &&
+                   em->em_is_child &&
                    !bms_equal(em->em_relids, relids))
                    continue;

It is a little bit hard to understand why this modification is needed.
Could you add source code comment that focus on the reason why.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Taiki Kondo
Sent: Wednesday, October 21, 2015 8:07 PM
To: Kaigai Kouhei(海外 浩平); Kyotaro HORIGUCHI
Cc: pgsql-hackers@postgresql.org; Yanagisawa Hiroshi(柳澤 博)
Subject: Re: [HACKERS] [Proposal] Table partition + join pushdown

Hello, KaiGai-san and Horiguchi-san.

I created v2 patch. Please find attached.
I believe this patch will fix the most of issues mentioned by
Horiguchi-san except naming.

In this v2 patch, scan node which is originally inner relation of
Join node must be SeqScan (or SampleScan). This limitation is
due to implementation of try_join_pushdown(), which copies Path nodes
to attach new filtering conditions converted from CHECK() constraints.

It uses copyObject() for this purpose, so I must implement copy functions
for scan Path nodes like IndexPath, BitmapHeapPath, TidPath and so on.

By the way, let me introduce the performance of this feature.
Here are the results I tested in my environment.
These results were got by "pushdown_test.v1.large.sql"
running on the environment that "work_mem" set to "1536kB".
(This file is also attached in this mail.)

[Normal]
QUERY PLAN
----------------------------------------------------------------------------
---------------------------------------------------------
Hash Join (cost=1851.02..14638.11 rows=300004 width=20) (actual
time=88.188..453.926 rows=299992 loops=1)
Hash Cond: (check_test_div.id = inner_t.id)
-> Append (cost=0.00..4911.03 rows=300004 width=20) (actual
time=0.089..133.456 rows=300003 loops=1)
-> Seq Scan on check_test_div (cost=0.00..0.00 rows=1 width=20)
(actual time=0.003..0.003 rows=0 loops=1)
-> Seq Scan on check_test_div_0 (cost=0.00..1637.01 rows=100001
width=20) (actual time=0.085..40.741 rows=100001 loops=1)
-> Seq Scan on check_test_div_1 (cost=0.00..1637.01 rows=100001
width=20) (actual time=0.023..29.213 rows=100001 loops=1)
-> Seq Scan on check_test_div_2 (cost=0.00..1637.01 rows=100001
width=20) (actual time=0.021..28.592 rows=100001 loops=1)
-> Hash (cost=866.01..866.01 rows=60001 width=8) (actual
time=87.970..87.970 rows=60001 loops=1)
Buckets: 32768 Batches: 2 Memory Usage: 1446kB
-> Seq Scan on inner_t (cost=0.00..866.01 rows=60001 width=8)
(actual time=0.030..39.133 rows=60001 loops=1)
Planning time: 0.867 ms
Execution time: 470.269 ms
(12 rows)

[With this feature]
QUERY PLAN
----------------------------------------------------------------------------
---------------------------------------------------------
Append (cost=0.01..10651.37 rows=300004 width=20) (actual
time=55.548..377.615 rows=299992 loops=1)
-> Hash Join (cost=0.01..1091.04 rows=1 width=20) (actual
time=0.017..0.017 rows=0 loops=1)
Hash Cond: (inner_t.id = check_test_div.id)
-> Seq Scan on inner_t (cost=0.00..866.01 rows=60001 width=8) (never
executed)
-> Hash (cost=0.00..0.00 rows=1 width=20) (actual time=0.003..0.003
rows=0 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 8kB
-> Seq Scan on check_test_div (cost=0.00..0.00 rows=1
width=20) (actual time=0.002..0.002 rows=0 loops=1)
-> Hash Join (cost=1169.76..3186.78 rows=100001 width=20) (actual
time=55.530..149.205 rows=100001 loops=1)
Hash Cond: (check_test_div_0.id = inner_t.id)
-> Seq Scan on check_test_div_0 (cost=0.00..1637.01 rows=100001
width=20) (actual time=0.058..34.268 rows=100001 loops=1)
-> Hash (cost=1166.01..1166.01 rows=300 width=8) (actual
time=55.453..55.453 rows=20001 loops=1)
Buckets: 32768 (originally 1024) Batches: 1 (originally 1)
Memory Usage: 1038kB
-> Seq Scan on inner_t (cost=0.00..1166.01 rows=300 width=8)
(actual time=0.031..43.590 rows=20001 loops=1)
Filter: ((id % 3) = 0)
Rows Removed by Filter: 40000
-> Hash Join (cost=1169.76..3186.78 rows=100001 width=20) (actual
time=27.942..97.582 rows=99996 loops=1)
Hash Cond: (check_test_div_1.id = inner_t.id)
-> Seq Scan on check_test_div_1 (cost=0.00..1637.01 rows=100001
width=20) (actual time=0.030..25.514 rows=100001 loops=1)
-> Hash (cost=1166.01..1166.01 rows=300 width=8) (actual
time=27.890..27.890 rows=20000 loops=1)
Buckets: 32768 (originally 1024) Batches: 1 (originally 1)
Memory Usage: 1038kB
-> Seq Scan on inner_t (cost=0.00..1166.01 rows=300 width=8)
(actual time=0.014..21.688 rows=20000 loops=1)
Filter: ((id % 3) = 1)
Rows Removed by Filter: 40001
-> Hash Join (cost=1169.76..3186.78 rows=100001 width=20) (actual
time=27.651..97.755 rows=99995 loops=1)
Hash Cond: (check_test_div_2.id = inner_t.id)
-> Seq Scan on check_test_div_2 (cost=0.00..1637.01 rows=100001
width=20) (actual time=0.026..25.620 rows=100001 loops=1)
-> Hash (cost=1166.01..1166.01 rows=300 width=8) (actual
time=27.599..27.599 rows=20000 loops=1)
Buckets: 32768 (originally 1024) Batches: 1 (originally 1)
Memory Usage: 1038kB
-> Seq Scan on inner_t (cost=0.00..1166.01 rows=300 width=8)
(actual time=0.017..21.307 rows=20000 loops=1)
Filter: ((id % 3) = 2)
Rows Removed by Filter: 40001
Planning time: 1.876 ms
Execution time: 394.007 ms
(33 rows)

The value of "Batches" is 2 on Hash node in normal,
but these values are 1 on all Hash nodes in "with this feature".

This means that the hash table is not split because of this feature.

Therefore, PostgreSQL with this feature is faster than the normal one in this
case.
(470.269 ms @ normal vs 394.007 ms @ this feature)

I think this point is large benefit of this feature.

Best regards,
--
Taiki Kondo

NEC Solution Innovators, Ltd.

-----Original Message-----
From: Kaigai Kouhei(海外 浩平) [mailto:kaigai@ak.jp.nec.com]
Sent: Thursday, October 15, 2015 10:21 AM
To: Kondo Taiki(近藤 太樹); Kyotaro HORIGUCHI
Cc: Iwaasa Akio(岩浅 晃郎); pgsql-hackers@postgresql.org
Subject: RE: [HACKERS] [Proposal] Table partition + join pushdown

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Taiki Kondo
Sent: Thursday, October 08, 2015 5:28 PM
To: Kyotaro HORIGUCHI
Cc: Kaigai Kouhei(海外 浩平); Iwaasa Akio(岩浅 晃郎);
pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] [Proposal] Table partition + join pushdown

Hello, Horiguchi-san.

Thank you for your comment.

I got some warning on compilation on unused variables and wrong
arguemtn type.

OK, I'll fix it.

I failed to have a query that this patch works on. Could you let me
have some specific example for this patch?

Please find attached.
And also make sure that setting of work_mem is '64kB' (not 64MB).

If there is the large work_mem enough to create hash table for
relation after appending, its cost may be better than pushed-down
plan's cost, then planner will not choose pushed-down plan this patch makes.
So, to make this patch working fine, work_mem size must be smaller
than the hash table size for relation after appending.

This patch needs more comments. Please put comment about not only
what it does but also the reason and other things for it.

OK, I'll put more comments in the code.
But it will take a long time, maybe...

People (including me) can help. Even though your English capability is not enough,
it is significant to put intention of the code.

-- about namings

Names for functions and variables are needed to be more appropriate,
in other words, needed to be what properly informs what they are.
The followings are the examples of such names.

Thank you for your suggestion.

I also think these names are not good much.
I'll try to make names better , but it maybe take a long time...
Of course, I will use your suggestion as reference.

"added_restrictlist"'s widely distributed as many function arguemnts
and JoinPathExtraData makes me feel dizzy..

"added_restrictinfo" will be deleted from almost functions other than
try_join_pushdown() in next (v2) patch because the place of filtering
using this info will be changed from Join node to Scan node and not
have to place it into other than try_join_pushdown().

This restrictinfo intends to filter out obviously unrelated rows in this join,
due to the check constraint of other side of the join.
So, correct but redundant name is:
restrictlist_to_drop_unrelated_rows_because_of_check_constraint

How about 'restrictlist_by_constraint' instead?

In make_restrictinfos_from_check_constr, the function returns
modified constraint predicates correspond to vars under hashjoinable
join clauses. I don't think expression_tree_mutator is suitable to
do that since it could allow unwanted result when constraint
predicates or join clauses are not simple OpExpr's.

Do you have any example of this situation?
I am trying to find unwanted results you mentioned, but I don't have
any results in this timing. I have a hunch that it will allow unwanted
results because I have thought only about very simple situation for
this function.

check_constraint_mutator makes modified restrictlist with relacing Var node only
when join clause is hash-joinable.
It implies <expr> = <expr> form, thus we can safely replace the expression by
the other side.

Of course, we still have cases we cannot replace expressions simply.
- If function (or function called by operators) has volatile attribute
(who use volatile function on CHECK constraint of partitioning?)
- If it is uncertain whether expression returns always same result.
(is it possible to contain SubLink in the constraint?)

I'd like to suggest to use white-list approach in this mutator routine.
It means that only immutable expression node are allowed to include the modified
restrictlist.

Things to do is:

check_constraint_mutator(...)
{
if (node == NULL)
return NULL;
if (IsA(node, Var))
{
:
}
else if (node is not obviously immutable)
{
context->is_mutated = false; <-- prohibit to make if expression
} contains uncertain node.
return expression_tree_mutator(...)
}

Otherwise could you give me clear explanation on what it does?

This function transfers CHECK() constraint to filter expression by
following procedures.
(1) Get outer table's CHECK() constraint by using get_relation_constraints().
(2) Walk through expression tree got in (1) by using expression_tree_mutator()
with check_constraint_mutator() and change only outer's Var node to
inner's one according to join clause.

For example, when CHECK() constraint of table A is "num % 4 = 0" and
join clause between table A and B is "A.num = B.data", then we can get
"B.data % 4 = 0" for filtering purpose.

This also accepts more complex join clause like "A.num = B.data * 2",
then we can get "(B.data * 2) % 4 = 0".

In procedure (2), to decide whether to use each join clause for
changing Var node or not, I implement check_constraint_mutator() to
judge whether join clause is hash-joinable or not.

Actually, I want to judge whether OpExpr as top expression tree of
join clause means "=" or not, but I can't find how to do it.

If you know how to do it, please let me know.

Thanks,
--
NEC Business Creation Division / PG-Strom Project KaiGai Kohei
<kaigai@ak.jp.nec.com>

-----Original Message-----
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
Sent: Tuesday, October 06, 2015 8:35 PM
To: tai-kondo@yk.jp.nec.com
Cc: kaigai@ak.jp.nec.com; aki-iwaasa@vt.jp.nec.com;
pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] [Proposal] Table partition + join pushdown

Hello.

I tried to read this and had some random comments on this.

-- general

I got some warning on compilation on unused variables and wrong arguemtn type.

I failed to have a query that this patch works on. Could you let me
have some specific example for this patch?

This patch needs more comments. Please put comment about not only what
it does but also the reason and other things for it.

-- about namings

Names for functions and variables are needed to be more appropriate,
in other words, needed to be what properly informs what they are. The
followings are the examples of such names.

"added_restrictlist"'s widely distributed as many function arguemnts
and JoinPathExtraData makes me feel dizzy.. create_mergejoin_path
takes it as "filtering_clauses", which looks far better.

try_join_pushdown() is also the name with much wider meaning. This
patch tries to move hashjoins on inheritance parent to under append
paths. It could be generically called 'pushdown'
but this would be better be called such like 'transform appended
hashjoin' or 'hashjoin distribution'. The latter would be better.
(The function name would be try_distribute_hashjoin for the
case.)

The name make_restrictinfos_from_check_contr() also tells me wrong
thing. For example,
extract_constraints_for_hashjoin_distribution() would inform me about
what it returns.

-- about what make_restrictinfos_from_check_constr() does

In make_restrictinfos_from_check_constr, the function returns modified
constraint predicates correspond to vars under hashjoinable join
clauses. I don't think expression_tree_mutator is suitable to do that
since it could allow unwanted result when constraint predicates or join clauses

are not simple OpExpr's.

Could you try more simple and straight-forward way to do that?
Otherwise could you give me clear explanation on what it does?

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Taiki Kondo
tai-kondo@yk.jp.nec.com
In reply to: Kouhei Kaigai (#14)
2 attachment(s)
Re: [Proposal] Table partition + join pushdown

Hello, KaiGai-san.

Thank you for your reply, and sorry for late response.

I created v3 patch for this feature, and v1 patch for regression tests.
Please find attached.

Reply for your comments is below.

Overall comments
----------------
* I think the enhancement in copyfuncs.c shall be in the separate
patch; it is more graceful manner. At this moment, here is less
than 20 Path delivered type definition. It is much easier works
than entire Plan node support as we did recently.
(How about other folk's opinion?)

I also would like to wait for other fork's opinion.
So I don't divide this part from this patch yet.

* Can you integrate the attached test cases as regression test?
It is more generic way, and allows people to detect problems
if relevant feature gets troubled in the future updates.

Ok, done. Please find attached.

* Naming of "join pushdown" is a bit misleading because other
component also uses this term, but different purpose.
I'd like to suggest try_pullup_append_across_join.
Any ideas from native English speaker?

Thank you for your suggestion.

I change its name to "try_append_pullup_accross_join",
which is matched with the order of the previous name.

However, this change is just temporary.
I also would like to wait for other fork's opinion
for the naming.

Patch review
------------

At try_join_pushdown:
+   /* When specified outer path is not an AppendPath, nothing to do here. */
+   if (!IsA(outer_rel->cheapest_total_path, AppendPath))
+   {
+       elog(DEBUG1, "Outer path is not an AppendPath. Do nothing.");
+       return;
+   }
It checks whether the cheapest_total_path is AppendPath at the head
of this function. It ought to be a loop to walk on the pathlist of
RelOptInfo, because multiple path-nodes might be still alive
but also might not be cheapest_total_path.

Ok, done.

+   switch (inner_rel->cheapest_total_path->pathtype)
+
Also, we can construct the new Append node if one of the path-node
within pathlist of inner_rel are at least supported.

Done.
But, this change will create nested loop between inner_rel's pathlist
and outer_rel's pathlist. It means that planning time is increased more.

I think it is adequate to check only for cheapest_total_path
because checking only for cheapest_total_path is implemented in other
parts, like make_join_rel().

How about your (and also other people's) opinion?

+   if (list_length(inner_rel->ppilist) > 0)
+   {
+       elog(DEBUG1, "ParamPathInfo is already set in inner_rel. Can't pushdown.");
+       return;
+   }
+
You may need to introduce why this feature uses ParamPathInfos here.
It seems to me a good hack to attach additional qualifiers on
the underlying inner scan node, even if it is not a direct child of
inner relation.
However, people may have different opinion.

Ok, added comment in source.
Please find from attached patch.

+static List *
+convert_parent_joinclauses_to_child(PlannerInfo *root, List *join_clauses,
+                                   RelOptInfo *outer_rel) {
+   Index       parent_relid =
+                   find_childrel_appendrelinfo(root, outer_rel)->parent_relid;
+   List        *clauses_parent = get_actual_clauses(join_clauses);
+   List        *clauses_child = NIL;
+   ListCell    *lc;
+
+   foreach(lc, clauses_parent)
+   {
+       Node    *one_clause_child = (Node *) copyObject(lfirst(lc));
+
+       ChangeVarNodes(one_clause_child, parent_relid, outer_rel->relid, 0);
+       clauses_child = lappend(clauses_child, one_clause_child);
+   }
+
+   return make_restrictinfos_from_actual_clauses(root, clauses_child); 
+}

Is ChangeVarNodes() right routine to replace var-node of parent relation
by relevant var-node of child relation?
It may look sufficient, however, nobody can ensure varattno of child
relation is identical with parent relation's one.
For example, which attribute number shall be assigned on 'z' here?
CREATE TABLE tbl_parent(x int);
CREATE TABLE tbl_child(y int) INHERITS(tbl_parent);
ALTER TABLE tbl_parent ADD COLUMN z int;

Maybe you're right, so I agree with you.
I use adjust_appendrel_attrs() instead of ChangeVarNodes()
for this purpose.

--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -4230,8 +4230,14 @@ prepare_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree, List *pathkeys,
/*
* Ignore child members unless they match the rel being
* sorted.
+                *
+                * If this is called from make_sort_from_pathkeys(),
+                * relids may be NULL. In this case, we must not ignore child
+                * members because inner/outer plan of pushed-down merge join is
+                * always child table.
*/
-               if (em->em_is_child &&
+               if (relids != NULL &&
+                   em->em_is_child &&
!bms_equal(em->em_relids, relids))
continue;

It is a little bit hard to understand why this modification is needed.
Could you add source code comment that focus on the reason why.

Ok, added comment in source.
Please find from attached patch.

Best regards,
--
Taiki Kondo

NEC Solution Innovators, Ltd.

-----Original Message-----
From: Kaigai Kouhei(海外 浩平) [mailto:kaigai@ak.jp.nec.com]
Sent: Tuesday, November 10, 2015 11:59 PM
To: Kondo Taiki(近藤 太樹); Kyotaro HORIGUCHI
Cc: pgsql-hackers@postgresql.org; Yanagisawa Hiroshi(柳澤 博)
Subject: RE: [HACKERS] [Proposal] Table partition + join pushdown

Hi, I put my comments towards the patch as follows.

Overall comments
----------------
* I think the enhancement in copyfuncs.c shall be in the separate
patch; it is more graceful manner. At this moment, here is less
than 20 Path delivered type definition. It is much easier works
than entire Plan node support as we did recently.
(How about other folk's opinion?)

* Can you integrate the attached test cases as regression test?
It is more generic way, and allows people to detect problems
if relevant feature gets troubled in the future updates.

* Naming of "join pushdown" is a bit misleading because other
component also uses this term, but different purpose.
I'd like to suggest try_pullup_append_across_join.
Any ideas from native English speaker?

Patch review
------------

At try_join_pushdown:
+   /* When specified outer path is not an AppendPath, nothing to do here. */
+   if (!IsA(outer_rel->cheapest_total_path, AppendPath))
+   {
+       elog(DEBUG1, "Outer path is not an AppendPath. Do nothing.");
+       return;
+   }
It checks whether the cheapest_total_path is AppendPath at the head of this function. It ought to be a loop to walk on the pathlist of RelOptInfo, because multiple path-nodes might be still alive but also might not be cheapest_total_path.
+   switch (inner_rel->cheapest_total_path->pathtype)
+
Also, we can construct the new Append node if one of the path-node within pathlist of inner_rel are at least supported.
+   if (list_length(inner_rel->ppilist) > 0)
+   {
+       elog(DEBUG1, "ParamPathInfo is already set in inner_rel. Can't pushdown.");
+       return;
+   }
+
You may need to introduce why this feature uses ParamPathInfos here.
It seems to me a good hack to attach additional qualifiers on the underlying inner scan node, even if it is not a direct child of inner relation.
However, people may have different opinion.
+static List *
+convert_parent_joinclauses_to_child(PlannerInfo *root, List *join_clauses,
+                                   RelOptInfo *outer_rel) {
+   Index       parent_relid =
+                   find_childrel_appendrelinfo(root, outer_rel)->parent_relid;
+   List        *clauses_parent = get_actual_clauses(join_clauses);
+   List        *clauses_child = NIL;
+   ListCell    *lc;
+
+   foreach(lc, clauses_parent)
+   {
+       Node    *one_clause_child = (Node *) copyObject(lfirst(lc));
+
+       ChangeVarNodes(one_clause_child, parent_relid, outer_rel->relid, 0);
+       clauses_child = lappend(clauses_child, one_clause_child);
+   }
+
+   return make_restrictinfos_from_actual_clauses(root, clauses_child); 
+}

Is ChangeVarNodes() right routine to replace var-node of parent relation by relevant var-node of child relation?
It may look sufficient, however, nobody can ensure varattno of child relation is identical with parent relation's one.
For example, which attribute number shall be assigned on 'z' here?
CREATE TABLE tbl_parent(x int);
CREATE TABLE tbl_child(y int) INHERITS(tbl_parent);
ALTER TABLE tbl_parent ADD COLUMN z int;

--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -4230,8 +4230,14 @@ prepare_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree, List *pathkeys,
                /*
                 * Ignore child members unless they match the rel being
                 * sorted.
+                *
+                * If this is called from make_sort_from_pathkeys(),
+                * relids may be NULL. In this case, we must not ignore child
+                * members because inner/outer plan of pushed-down merge join is
+                * always child table.
                 */
-               if (em->em_is_child &&
+               if (relids != NULL &&
+                   em->em_is_child &&
                    !bms_equal(em->em_relids, relids))
                    continue;

It is a little bit hard to understand why this modification is needed.
Could you add source code comment that focus on the reason why.

Thanks,
--
NEC Business Creation Division / PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com>

Show quoted text

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Taiki Kondo
Sent: Wednesday, October 21, 2015 8:07 PM
To: Kaigai Kouhei(海外 浩平); Kyotaro HORIGUCHI
Cc: pgsql-hackers@postgresql.org; Yanagisawa Hiroshi(柳澤 博)
Subject: Re: [HACKERS] [Proposal] Table partition + join pushdown

Hello, KaiGai-san and Horiguchi-san.

I created v2 patch. Please find attached.
I believe this patch will fix the most of issues mentioned by
Horiguchi-san except naming.

In this v2 patch, scan node which is originally inner relation of Join
node must be SeqScan (or SampleScan). This limitation is due to
implementation of try_join_pushdown(), which copies Path nodes to
attach new filtering conditions converted from CHECK() constraints.

It uses copyObject() for this purpose, so I must implement copy
functions for scan Path nodes like IndexPath, BitmapHeapPath, TidPath and so on.

By the way, let me introduce the performance of this feature.
Here are the results I tested in my environment.
These results were got by "pushdown_test.v1.large.sql"
running on the environment that "work_mem" set to "1536kB".
(This file is also attached in this mail.)

[Normal]
QUERY
PLAN
----------------------------------------------------------------------
------
---------------------------------------------------------
Hash Join (cost=1851.02..14638.11 rows=300004 width=20) (actual
time=88.188..453.926 rows=299992 loops=1)
Hash Cond: (check_test_div.id = inner_t.id)
-> Append (cost=0.00..4911.03 rows=300004 width=20) (actual
time=0.089..133.456 rows=300003 loops=1)
-> Seq Scan on check_test_div (cost=0.00..0.00 rows=1
width=20) (actual time=0.003..0.003 rows=0 loops=1)
-> Seq Scan on check_test_div_0 (cost=0.00..1637.01
rows=100001
width=20) (actual time=0.085..40.741 rows=100001 loops=1)
-> Seq Scan on check_test_div_1 (cost=0.00..1637.01
rows=100001
width=20) (actual time=0.023..29.213 rows=100001 loops=1)
-> Seq Scan on check_test_div_2 (cost=0.00..1637.01
rows=100001
width=20) (actual time=0.021..28.592 rows=100001 loops=1)
-> Hash (cost=866.01..866.01 rows=60001 width=8) (actual
time=87.970..87.970 rows=60001 loops=1)
Buckets: 32768 Batches: 2 Memory Usage: 1446kB
-> Seq Scan on inner_t (cost=0.00..866.01 rows=60001
width=8) (actual time=0.030..39.133 rows=60001 loops=1) Planning
time: 0.867 ms Execution time: 470.269 ms
(12 rows)

[With this feature]
QUERY
PLAN
----------------------------------------------------------------------
------
---------------------------------------------------------
Append (cost=0.01..10651.37 rows=300004 width=20) (actual
time=55.548..377.615 rows=299992 loops=1)
-> Hash Join (cost=0.01..1091.04 rows=1 width=20) (actual
time=0.017..0.017 rows=0 loops=1)
Hash Cond: (inner_t.id = check_test_div.id)
-> Seq Scan on inner_t (cost=0.00..866.01 rows=60001
width=8) (never
executed)
-> Hash (cost=0.00..0.00 rows=1 width=20) (actual
time=0.003..0.003
rows=0 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 8kB
-> Seq Scan on check_test_div (cost=0.00..0.00 rows=1
width=20) (actual time=0.002..0.002 rows=0 loops=1)
-> Hash Join (cost=1169.76..3186.78 rows=100001 width=20) (actual
time=55.530..149.205 rows=100001 loops=1)
Hash Cond: (check_test_div_0.id = inner_t.id)
-> Seq Scan on check_test_div_0 (cost=0.00..1637.01
rows=100001
width=20) (actual time=0.058..34.268 rows=100001 loops=1)
-> Hash (cost=1166.01..1166.01 rows=300 width=8) (actual
time=55.453..55.453 rows=20001 loops=1)
Buckets: 32768 (originally 1024) Batches: 1
(originally 1) Memory Usage: 1038kB
-> Seq Scan on inner_t (cost=0.00..1166.01 rows=300
width=8) (actual time=0.031..43.590 rows=20001 loops=1)
Filter: ((id % 3) = 0)
Rows Removed by Filter: 40000
-> Hash Join (cost=1169.76..3186.78 rows=100001 width=20) (actual
time=27.942..97.582 rows=99996 loops=1)
Hash Cond: (check_test_div_1.id = inner_t.id)
-> Seq Scan on check_test_div_1 (cost=0.00..1637.01
rows=100001
width=20) (actual time=0.030..25.514 rows=100001 loops=1)
-> Hash (cost=1166.01..1166.01 rows=300 width=8) (actual
time=27.890..27.890 rows=20000 loops=1)
Buckets: 32768 (originally 1024) Batches: 1
(originally 1) Memory Usage: 1038kB
-> Seq Scan on inner_t (cost=0.00..1166.01 rows=300
width=8) (actual time=0.014..21.688 rows=20000 loops=1)
Filter: ((id % 3) = 1)
Rows Removed by Filter: 40001
-> Hash Join (cost=1169.76..3186.78 rows=100001 width=20) (actual
time=27.651..97.755 rows=99995 loops=1)
Hash Cond: (check_test_div_2.id = inner_t.id)
-> Seq Scan on check_test_div_2 (cost=0.00..1637.01
rows=100001
width=20) (actual time=0.026..25.620 rows=100001 loops=1)
-> Hash (cost=1166.01..1166.01 rows=300 width=8) (actual
time=27.599..27.599 rows=20000 loops=1)
Buckets: 32768 (originally 1024) Batches: 1
(originally 1) Memory Usage: 1038kB
-> Seq Scan on inner_t (cost=0.00..1166.01 rows=300
width=8) (actual time=0.017..21.307 rows=20000 loops=1)
Filter: ((id % 3) = 2)
Rows Removed by Filter: 40001 Planning time:
1.876 ms Execution time: 394.007 ms
(33 rows)

The value of "Batches" is 2 on Hash node in normal, but these values
are 1 on all Hash nodes in "with this feature".

This means that the hash table is not split because of this feature.

Therefore, PostgreSQL with this feature is faster than the normal one
in this case.
(470.269 ms @ normal vs 394.007 ms @ this feature)

I think this point is large benefit of this feature.

Best regards,
--
Taiki Kondo

NEC Solution Innovators, Ltd.

-----Original Message-----
From: Kaigai Kouhei(海外 浩平) [mailto:kaigai@ak.jp.nec.com]
Sent: Thursday, October 15, 2015 10:21 AM
To: Kondo Taiki(近藤 太樹); Kyotaro HORIGUCHI
Cc: Iwaasa Akio(岩浅 晃郎); pgsql-hackers@postgresql.org
Subject: RE: [HACKERS] [Proposal] Table partition + join pushdown

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Taiki Kondo
Sent: Thursday, October 08, 2015 5:28 PM
To: Kyotaro HORIGUCHI
Cc: Kaigai Kouhei(海外 浩平); Iwaasa Akio(岩浅 晃郎);
pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] [Proposal] Table partition + join pushdown

Hello, Horiguchi-san.

Thank you for your comment.

I got some warning on compilation on unused variables and wrong
arguemtn type.

OK, I'll fix it.

I failed to have a query that this patch works on. Could you let
me have some specific example for this patch?

Please find attached.
And also make sure that setting of work_mem is '64kB' (not 64MB).

If there is the large work_mem enough to create hash table for
relation after appending, its cost may be better than pushed-down
plan's cost, then planner will not choose pushed-down plan this patch makes.
So, to make this patch working fine, work_mem size must be smaller
than the hash table size for relation after appending.

This patch needs more comments. Please put comment about not only
what it does but also the reason and other things for it.

OK, I'll put more comments in the code.
But it will take a long time, maybe...

People (including me) can help. Even though your English capability is
not enough, it is significant to put intention of the code.

-- about namings

Names for functions and variables are needed to be more
appropriate, in other words, needed to be what properly informs what they are.
The followings are the examples of such names.

Thank you for your suggestion.

I also think these names are not good much.
I'll try to make names better , but it maybe take a long time...
Of course, I will use your suggestion as reference.

"added_restrictlist"'s widely distributed as many function
arguemnts and JoinPathExtraData makes me feel dizzy..

"added_restrictinfo" will be deleted from almost functions other
than
try_join_pushdown() in next (v2) patch because the place of
filtering using this info will be changed from Join node to Scan
node and not have to place it into other than try_join_pushdown().

This restrictinfo intends to filter out obviously unrelated rows in
this join, due to the check constraint of other side of the join.
So, correct but redundant name is:
restrictlist_to_drop_unrelated_rows_because_of_check_constraint

How about 'restrictlist_by_constraint' instead?

In make_restrictinfos_from_check_constr, the function returns
modified constraint predicates correspond to vars under
hashjoinable join clauses. I don't think expression_tree_mutator
is suitable to do that since it could allow unwanted result when
constraint predicates or join clauses are not simple OpExpr's.

Do you have any example of this situation?
I am trying to find unwanted results you mentioned, but I don't have
any results in this timing. I have a hunch that it will allow
unwanted results because I have thought only about very simple
situation for this function.

check_constraint_mutator makes modified restrictlist with relacing Var
node only when join clause is hash-joinable.
It implies <expr> = <expr> form, thus we can safely replace the
expression by the other side.

Of course, we still have cases we cannot replace expressions simply.
- If function (or function called by operators) has volatile attribute
(who use volatile function on CHECK constraint of partitioning?)
- If it is uncertain whether expression returns always same result.
(is it possible to contain SubLink in the constraint?)

I'd like to suggest to use white-list approach in this mutator routine.
It means that only immutable expression node are allowed to include
the modified restrictlist.

Things to do is:

check_constraint_mutator(...)
{
if (node == NULL)
return NULL;
if (IsA(node, Var))
{
:
}
else if (node is not obviously immutable)
{
context->is_mutated = false; <-- prohibit to make if expression
} contains uncertain node.
return expression_tree_mutator(...)
}

Otherwise could you give me clear explanation on what it does?

This function transfers CHECK() constraint to filter expression by
following procedures.
(1) Get outer table's CHECK() constraint by using get_relation_constraints().
(2) Walk through expression tree got in (1) by using expression_tree_mutator()
with check_constraint_mutator() and change only outer's Var node to
inner's one according to join clause.

For example, when CHECK() constraint of table A is "num % 4 = 0" and
join clause between table A and B is "A.num = B.data", then we can
get "B.data % 4 = 0" for filtering purpose.

This also accepts more complex join clause like "A.num = B.data *
2", then we can get "(B.data * 2) % 4 = 0".

In procedure (2), to decide whether to use each join clause for
changing Var node or not, I implement check_constraint_mutator() to
judge whether join clause is hash-joinable or not.

Actually, I want to judge whether OpExpr as top expression tree of
join clause means "=" or not, but I can't find how to do it.

If you know how to do it, please let me know.

Thanks,
--
NEC Business Creation Division / PG-Strom Project KaiGai Kohei
<kaigai@ak.jp.nec.com>

-----Original Message-----
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
Sent: Tuesday, October 06, 2015 8:35 PM
To: tai-kondo@yk.jp.nec.com
Cc: kaigai@ak.jp.nec.com; aki-iwaasa@vt.jp.nec.com;
pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] [Proposal] Table partition + join pushdown

Hello.

I tried to read this and had some random comments on this.

-- general

I got some warning on compilation on unused variables and wrong arguemtn type.

I failed to have a query that this patch works on. Could you let me
have some specific example for this patch?

This patch needs more comments. Please put comment about not only
what it does but also the reason and other things for it.

-- about namings

Names for functions and variables are needed to be more appropriate,
in other words, needed to be what properly informs what they are.
The followings are the examples of such names.

"added_restrictlist"'s widely distributed as many function arguemnts
and JoinPathExtraData makes me feel dizzy.. create_mergejoin_path
takes it as "filtering_clauses", which looks far better.

try_join_pushdown() is also the name with much wider meaning. This
patch tries to move hashjoins on inheritance parent to under append
paths. It could be generically called 'pushdown'
but this would be better be called such like 'transform appended
hashjoin' or 'hashjoin distribution'. The latter would be better.
(The function name would be try_distribute_hashjoin for the
case.)

The name make_restrictinfos_from_check_contr() also tells me wrong
thing. For example,
extract_constraints_for_hashjoin_distribution() would inform me
about what it returns.

-- about what make_restrictinfos_from_check_constr() does

In make_restrictinfos_from_check_constr, the function returns
modified constraint predicates correspond to vars under hashjoinable
join clauses. I don't think expression_tree_mutator is suitable to
do that since it could allow unwanted result when constraint
predicates or join clauses

are not simple OpExpr's.

Could you try more simple and straight-forward way to do that?
Otherwise could you give me clear explanation on what it does?

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

append_pullup.main_v3.patchapplication/octet-stream; name=append_pullup.main_v3.patchDownload
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 26264cb..27f7348 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -1964,12 +1964,78 @@ _copyOnConflictExpr(const OnConflictExpr *from)
 /* ****************************************************************
  *						relation.h copy functions
  *
- * We don't support copying RelOptInfo, IndexOptInfo, or Path nodes.
+ * We don't support copying RelOptInfo, IndexOptInfo or Path node.
  * There are some subsidiary structs that are useful to copy, though.
  * ****************************************************************
  */
 
 /*
+ * CopyPathFields
+ */
+static void
+CopyPathFields(const Path *from, Path *newnode)
+{
+	COPY_SCALAR_FIELD(pathtype);
+
+	/*
+	 * We use COPY_SCALAR_FIELDS() for parent instead of COPY_NODE_FIELDS()
+	 * because RelOptInfo contains Path which is made from, so
+	 * jump into the infinite loop.
+	 */
+	COPY_SCALAR_FIELD(parent);
+
+	COPY_SCALAR_FIELD(param_info);
+
+	COPY_SCALAR_FIELD(rows);
+	COPY_SCALAR_FIELD(startup_cost);
+	COPY_SCALAR_FIELD(total_cost);
+
+	COPY_NODE_FIELD(pathkeys);
+}
+
+/*
+ * _copyPath
+ */
+static Path *
+_copyPath(const Path *from)
+{
+	Path *newnode = makeNode(Path);
+
+	CopyPathFields(from, newnode);
+
+	return newnode;
+}
+
+/*
+ * _copyIndexPath
+ * XXX Need to make copy function for IndexOptInfo, etc.
+ */
+static IndexPath *
+_copyIndexPath(const IndexPath *from)
+{
+	IndexPath *newnode = makeNode(IndexPath);
+
+	CopyPathFields(&from->path, &newnode->path);
+
+	COPY_NODE_FIELD(indexinfo);
+	COPY_NODE_FIELD(indexclauses);
+	COPY_NODE_FIELD(indexquals);
+	COPY_NODE_FIELD(indexqualcols);
+	COPY_NODE_FIELD(indexorderbys);
+	COPY_NODE_FIELD(indexorderbycols);
+	COPY_SCALAR_FIELD(indexscandir);
+	COPY_SCALAR_FIELD(indextotalcost);
+	COPY_SCALAR_FIELD(indexselectivity);
+
+	return newnode;
+}
+
+/*
+ * XXX Need to make copy function for BitmapHeapPath, TidPath
+ * and GatherPath.
+ */
+
+/*
  * _copyPathKey
  */
 static PathKey *
@@ -4507,6 +4573,12 @@ copyObject(const void *from)
 			/*
 			 * RELATION NODES
 			 */
+		case T_Path:
+			retval = _copyPath(from);
+			break;
+		case T_IndexPath:
+			retval = _copyIndexPath(from);
+			break;
 		case T_PathKey:
 			retval = _copyPathKey(from);
 			break;
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index a35c881..6dec33c 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -18,9 +18,22 @@
 
 #include "executor/executor.h"
 #include "foreign/fdwapi.h"
+#include "nodes/nodeFuncs.h"
+#include "nodes/nodes.h"
+#include "optimizer/clauses.h"
 #include "optimizer/cost.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
+#include "optimizer/plancat.h"
+#include "optimizer/prep.h"
+#include "optimizer/restrictinfo.h"
+#include "utils/lsyscache.h"
+
+typedef struct
+{
+	List	*joininfo;
+	bool	 is_substituted;
+} substitution_node_context;
 
 /* Hook for plugins to get control in add_paths_to_joinrel() */
 set_join_pathlist_hook_type set_join_pathlist_hook = NULL;
@@ -45,6 +58,11 @@ static List *select_mergejoin_clauses(PlannerInfo *root,
 						 JoinType jointype,
 						 bool *mergejoin_allowed);
 
+static void try_append_pullup_across_join(PlannerInfo *root,
+						  RelOptInfo *joinrel, RelOptInfo *outer_rel,
+						  RelOptInfo *inner_rel,
+						  List *restrictlist);
+
 
 /*
  * add_paths_to_joinrel
@@ -82,6 +100,18 @@ add_paths_to_joinrel(PlannerInfo *root,
 	bool		mergejoin_allowed = true;
 	ListCell   *lc;
 
+	/*
+	 * Try to pull-up Append across Join
+	 */
+	if (!IS_OUTER_JOIN(jointype))
+	{
+		try_append_pullup_across_join(root,
+									  joinrel,
+									  outerrel,
+									  innerrel,
+									  restrictlist);
+	}
+
 	extra.restrictlist = restrictlist;
 	extra.mergeclause_list = NIL;
 	extra.sjinfo = sjinfo;
@@ -1474,3 +1504,616 @@ select_mergejoin_clauses(PlannerInfo *root,
 
 	return result_list;
 }
+
+/*
+ * Try to substitute Var node according to join conditions.
+ * This process is from following steps.
+ *
+ * 1. Try to find whether Var node matches to left/right Var node of
+ *    one join condition.
+ * 2. If found, replace Var node with the opposite expression node of
+ *    the join condition.
+ *
+ * For example, let's assume that we have following expression and
+ * join condition.
+ * Expression       : A.num % 4 = 1
+ * Join condition   : A.num = B.data + 2
+ * In this case, we can get following expression.
+ *    (B.data + 2) % 4 = 1
+ */
+static Node *
+substitute_node_with_join_cond(Node *node, substitution_node_context *context)
+{
+	/* Failed to substitute. Abort. */
+	if (!context->is_substituted)
+		return (Node *) copyObject(node);
+
+	if (node == NULL)
+		return NULL;
+
+	if (IsA(node, Var))
+	{
+		List		*join_cond = context->joininfo;
+		ListCell	*lc;
+
+		Assert(list_length(join_cond) > 0);
+
+		foreach (lc, join_cond)
+		{
+			RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+			Expr *expr = rinfo->clause;
+
+			/*
+			 * Make sure whether OpExpr of Join clause means "=".
+			 */
+			if (!rinfo->can_join ||
+				!IsA(expr, OpExpr) ||
+				!op_hashjoinable(((OpExpr *) expr)->opno,
+								exprType(get_leftop(expr))))
+				continue;
+
+			if (equal(get_leftop(expr), node))
+			{
+				/*
+				 * This node is equal to LEFT node of join condition,
+				 * thus will be replaced with RIGHT clause.
+				 */
+				return (Node *) copyObject(get_rightop(expr));
+			}
+			else
+			if (equal(get_rightop(expr), node))
+			{
+				/*
+				 * This node is equal to RIGHT node of join condition,
+				 * thus will be replaced with LEFT clause.
+				 */
+				return (Node *) copyObject(get_leftop(expr));
+			}
+		}
+
+		/* Unfortunately, substituting is failed. */
+		context->is_substituted = false;
+		return (Node *) copyObject(node);
+	}
+
+	return expression_tree_mutator(node, substitute_node_with_join_cond, context);
+}
+
+/*
+ * Create RestrictInfo_List from CHECK() constraints.
+ *
+ * This function creates list of RestrictInfo from CHECK() constraints
+ * according to expression of join clause.
+ *
+ * For example, let's assume that we have following CHECK() constraints
+ * for table A and join clause between table A and B.
+ * CHECK of table A      : 0 <= num AND num <= 100
+ * JOIN CLAUSE           : A.num = B.data
+ * In this conditions, we can get below by mathematical substituting.
+ *    0 <= B.data AND B.data <= 100
+ *
+ * We can use this restrictions to reduce result rows.
+ * This means that we can make Sort faster by reducing rows in MergeJoin,
+ * and also means that we can make HashTable smaller in HashJoin to fit
+ * to smaller work_mem environments.
+ */
+static List *
+create_rinfo_from_check_constr(PlannerInfo *root, List *joininfo,
+									 RelOptInfo *outer_rel, bool *succeed)
+{
+	List			*result = NIL;
+	RangeTblEntry	*childRTE = root->simple_rte_array[outer_rel->relid];
+	List			*check_constr =
+						get_relation_constraints(root, childRTE->relid,
+													outer_rel, false);
+	ListCell		*lc;
+	substitution_node_context	context;
+
+	if (list_length(check_constr) <= 0)
+	{
+		*succeed = true;
+		return NIL;
+	}
+
+	context.joininfo = joininfo;
+	context.is_substituted = true;
+
+	/*
+	 * Try to convert CHECK() constraints to filter expressions.
+	 */
+	foreach(lc, check_constr)
+	{
+		Node *substituted =
+				expression_tree_mutator((Node *) lfirst(lc),
+										substitute_node_with_join_cond,
+										(void *) &context);
+
+		if (!context.is_substituted)
+		{
+			*succeed = false;
+			list_free_deep(check_constr);
+			return NIL;
+		}
+		result = lappend(result, substituted);
+	}
+
+	Assert(list_length(check_constr) == list_length(result));
+	list_free_deep(check_constr);
+
+	return make_restrictinfos_from_actual_clauses(root, result);
+}
+
+/*
+ * Convert parent's join clauses to child's.
+ */
+static List *
+convert_parent_joinclauses_to_child(PlannerInfo *root, List *join_clauses,
+									RelOptInfo *outer_rel)
+{
+	AppendRelInfo	*appinfo = find_childrel_appendrelinfo(root, outer_rel);
+	List			*clauses_parent = get_actual_clauses(join_clauses);
+	List			*clauses_child = NIL;
+	ListCell		*lc;
+
+	foreach(lc, clauses_parent)
+	{
+		Node	*one_clause_child =
+					adjust_appendrel_attrs(root, lfirst(lc), appinfo);
+		clauses_child = lappend(clauses_child, one_clause_child);
+	}
+
+	return make_restrictinfos_from_actual_clauses(root, clauses_child);
+}
+
+static inline List *
+extract_join_clauses(List *restrictlist, RelOptInfo *outer_prel,
+						RelOptInfo *inner_rel)
+{
+	List		*result = NIL;
+	ListCell	*lc;
+
+	foreach (lc, restrictlist)
+	{
+		RestrictInfo	*rinfo = (RestrictInfo *) lfirst(lc);
+
+		if (clause_sides_match_join(rinfo, outer_prel, inner_rel))
+			result = lappend(result, rinfo);
+	}
+
+	return result;
+}
+
+/*
+ * Copy path node for try_append_pullup_across_join()
+ *
+ * This includes following steps.
+ * (a) Prepare ParamPathInfo for RestrictInfos by CHECK constraints.
+ *     (See comment below.)
+ * (b) Copy path node and specify ParamPathInfo node made at (a) to it.
+ * (c) Re-calculate costs for path node copied at (b).
+ *
+ * NOTE : "nworkers" argument is used for the (Parallel)SeqScan node under
+ * the Gather node, which calls this function recursively. Therefore,
+ * "nworkers" argument should be 0, except for recursive call for
+ * the Gather node.
+ */
+static Path *
+copy_inner_path_for_append_pullup(PlannerInfo *root,
+		Path *orig_inner_path, RelOptInfo *inner_rel,
+		List *restrictlist_by_check_constr, int nworkers)
+{
+	/*
+	 * Prepare ParamPathInfo for RestrictInfos by CHECK constraints.
+	 *
+	 * For specifying additional restrictions to inner path,
+	 * we attach ParamPathInfo not to RelOptInfo but only to Path.
+	 *
+	 * PPI is generally used for parameterizing Scan node under Join node
+	 * for example, the purpose of this usage is to extract rows which
+	 * satisfies join conditions fixed one side.
+	 *
+	 * In this function, PPI is used for specifying additional restrictions
+	 * to inner path, and using join conditions and using converted CHECK()
+	 * constraints differ, however these don't differ from a point that
+	 * it extracts rows. Therefore it looks good to use PPI for this purpose.
+	 *
+	 */
+	ParamPathInfo	*newppi = makeNode(ParamPathInfo);
+	Path			*alter_inner_path;
+
+	newppi->ppi_req_outer = NULL;
+	newppi->ppi_rows =
+			get_parameterized_baserel_size(root,
+											inner_rel,
+											restrictlist_by_check_constr);
+	newppi->ppi_clauses = restrictlist_by_check_constr;
+
+	/* Copy Path of inner relation, and specify newppi to it. */
+	alter_inner_path = copyObject(orig_inner_path);
+	alter_inner_path->param_info = newppi;
+
+	/* Re-calculate costs of alter_inner_path */
+	switch (orig_inner_path->pathtype)
+	{
+	case T_SeqScan :
+		cost_seqscan(alter_inner_path, root, inner_rel, newppi, nworkers);
+		break;
+	case T_SampleScan :
+		cost_samplescan(alter_inner_path, root, inner_rel, newppi);
+		break;
+	case T_IndexScan :
+	case T_IndexOnlyScan :
+		{
+			IndexPath *ipath = (IndexPath *) alter_inner_path;
+
+			cost_index(ipath, root, 1.0);
+		}
+		break;
+	case T_BitmapHeapScan :
+		{
+			BitmapHeapPath *bpath =
+					(BitmapHeapPath *) alter_inner_path;
+
+			cost_bitmap_heap_scan(&bpath->path, root, inner_rel,
+					newppi, bpath->bitmapqual, 1.0);
+		}
+		break;
+	case T_TidScan :
+		{
+			TidPath *tpath = (TidPath *) alter_inner_path;
+
+			cost_tidscan(&tpath->path, root, inner_rel,
+					tpath->tidquals, newppi);
+		}
+		break;
+	case T_Gather :
+		{
+			GatherPath	*orig_gpath = (GatherPath *) orig_inner_path;
+			GatherPath	*alter_gpath = (GatherPath *) alter_inner_path;
+
+			Path	*alter_sub_path =
+					copy_inner_path_for_append_pullup(root,
+														orig_gpath->subpath,
+														inner_rel,
+														restrictlist_by_check_constr,
+														orig_gpath->num_workers);
+
+			alter_gpath->subpath = alter_sub_path;
+
+			cost_gather(alter_gpath, root, inner_rel, newppi);
+		}
+		break;
+	default:
+		Assert(false);
+		break;
+	}
+
+	return alter_inner_path;
+}
+
+/*
+ * try_append_pullup_across_join
+ *
+ * When outer-path of JOIN is AppendPath, we can rewrite path-tree with
+ * relocation of JoinPath across AppendPath, to generate equivalent
+ * results, like a diagram below.
+ * This adjustment gives us a few performance benefits when the relations
+ * scaned by sub-plan of Append-node have CHECK() constraints - typically,
+ * configured as partitioned table.
+ *
+ * In case of INNER JOIN with equivalent join condition, like A = B, we
+ * can exclude a part of inner rows that are obviously unreferenced, if
+ * outer side has CHECK() constraints that contains join keys.
+ * The CHECK() constraints ensures all the rows within outer relation
+ * satisfies the condition, in other words, any inner rows that does not
+ * satisfies the condition (with adjustment using equivalence of join keys)
+ * never match any outer rows.
+ *
+ * Once we can reduce number of inner rows, here are two beneficial scenario.
+ * 1. HashJoin may avoid split of hash-table even if preload of entire
+ *    inner relation exceeds work_mem.
+ * 2. MergeJoin may be able to take smaller scale of Sort, because quick-sort
+ *    is O(NlogN) scale problem. Reduction of rows to be sorted on both side
+ *    reduces CPU cost more than liner.
+ *
+ * [BEFORE]
+ * JoinPath ... (parent.X = inner.Y)
+ *  -> AppendPath on parent
+ *    -> ScanPath on child_1 ... CHECK(hash(X) % 3 = 0)
+ *    -> ScanPath on child_2 ... CHECK(hash(X) % 3 = 1)
+ *    -> ScanPath on child_3 ... CHECK(hash(X) % 3 = 2)
+ *  -> ScanPath on inner
+ *
+ * [AFTER]
+ * AppendPath
+ *  -> JoinPath ... (child_1.X = inner.Y)
+ *    -> ScanPath on child_1 ... CHECK(hash(X) % 3 = 0)
+ *    -> ScanPath on inner ... filter (hash(Y) % 3 = 0)
+ *  -> JoinPath ... (child_2.X = inner.Y)
+ *    -> ScanPath on child_2 ... CHECK(hash(X) % 3 = 1)
+ *    -> ScanPath on inner ... filter (hash(Y) % 3 = 1)
+ *  -> JoinPath ... (child_3.X = inner.Y)
+ *    -> ScanPath on child_3 ... CHECK(hash(X) % 3 = 2)
+ *    -> ScanPath on inner ... filter (hash(Y) % 3 = 2)
+ *
+ * Point to be focused on is filter condition attached on child relation's
+ * scan. It is clause of CHECK() constraint, but X is replaced by Y using
+ * equivalence join condition.
+ */
+static void
+try_append_pullup_across_join(PlannerInfo *root,
+				  RelOptInfo *joinrel, RelOptInfo *outer_rel,
+				  RelOptInfo *inner_rel,
+				  List *restrictlist)
+{
+	AppendPath	*outer_path;
+	ListCell	*lc_subpath;
+	ListCell	*lc_outer_path, *lc_inner_path;
+	List		*joinclauses_parent;
+	List		*alter_append_subpaths = NIL;
+	int			num_pathlist_join = list_length(joinrel->pathlist);
+
+	if (outer_rel->rtekind != RTE_RELATION)
+	{
+		elog(DEBUG1, "Outer Relation is not for table scan. Give up.");
+		return;
+	}
+
+	/*
+	 * Extract join clauses to convert CHECK() constraints.
+	 * We don't have to clobber this list to convert CHECK() constraints,
+	 * so we need to do only once.
+	 */
+	joinclauses_parent = extract_join_clauses(restrictlist, outer_rel, inner_rel);
+	if (list_length(joinclauses_parent) <= 0)
+	{
+		elog(DEBUG1, "No join clauses specified. Give up.");
+		return;
+	}
+
+	/*
+	 * We use ParamPathInfo for specifying additional RestrictInfos
+	 * created from CHECK constraints to inner relation. Therefore,
+	 * we can NOT perform append pull-up when PPI has already specified
+	 * to inner relation.
+	 */
+	if (list_length(inner_rel->ppilist) > 0)
+	{
+		elog(DEBUG1, "ParamPathInfo is already set in inner_rel. Can't pull-up.");
+		return;
+	}
+
+	foreach(lc_outer_path, outer_rel->pathlist)
+	{
+		/* When specified outer path is not an AppendPath, nothing to do here. */
+		if (!IsA(lfirst(lc_outer_path), AppendPath))
+		{
+			elog(DEBUG1, "Outer path is not an AppendPath. Do nothing.");
+			continue;
+		}
+
+		outer_path = (AppendPath *) lfirst(lc_outer_path);
+
+		foreach(lc_inner_path, inner_rel->pathlist)
+		{
+			switch (((Path *) lfirst(lc_inner_path))->pathtype)
+			{
+			case T_SeqScan :
+			case T_SampleScan :
+			case T_IndexScan :
+			case T_IndexOnlyScan :
+			case T_BitmapHeapScan :
+			case T_TidScan :
+			case T_Gather :
+				/* These types are supported. Pass through. */
+				break;
+			default :
+				{
+					elog(DEBUG1, "Type of Inner path is not supported yet."
+								" Give up.");
+					continue;
+				}
+			}
+
+			/*
+			 * Make new joinrel between each of outer path's sub-paths and
+			 * inner path.
+			 */
+			foreach(lc_subpath, outer_path->subpaths)
+			{
+				RelOptInfo	*orig_outer_sub_rel =
+						((Path *) lfirst(lc_subpath))->parent;
+				RelOptInfo	*alter_outer_sub_rel;
+				Path		*alter_inner_path = NULL;
+				List		*joinclauses_child;
+				List		*restrictlist_by_check_constr;
+				bool		is_valid;
+				List		**join_rel_level;
+
+				ListCell	*parentvars, *childvars;
+
+				Assert(!IS_DUMMY_REL(orig_outer_sub_rel));
+
+				/*
+				 * Join clause points parent's relid,
+				 * so we must change it to child's one.
+				 */
+				joinclauses_child =
+						convert_parent_joinclauses_to_child(root,
+													joinclauses_parent,
+													orig_outer_sub_rel);
+
+				/*
+				 * Make RestrictInfo list from CHECK() constraints of outer table.
+				 * "is_valid" indicates whether making RestrictInfo list succeeded
+				 * or not.
+				 */
+				restrictlist_by_check_constr =
+						create_rinfo_from_check_constr(root, joinclauses_child,
+													orig_outer_sub_rel, &is_valid);
+
+				if (!is_valid)
+				{
+					elog(DEBUG1, "Join clause doesn't match with CHECK() constraint. "
+									"Can't pull-up.");
+					list_free_deep(alter_append_subpaths);
+					list_free(joinclauses_parent);
+					return;
+				}
+
+				if (list_length(restrictlist_by_check_constr) > 0)
+				{
+					/* Copy Path of inner relation, and specify newppi to it. */
+					alter_inner_path = copy_inner_path_for_append_pullup(root,
+													(Path *) lfirst(lc_inner_path),
+													inner_rel,
+													restrictlist_by_check_constr,
+													0);
+
+					/*
+					 * Append this path to pathlist temporary.
+					 * This path will be removed after returning from make_join_rel().
+					 */
+					inner_rel->pathlist =
+							lappend(inner_rel->pathlist, alter_inner_path);
+					set_cheapest(inner_rel);
+				}
+
+				/*
+				 * Add relids, which are marked as needed not in child's attribute
+				 * but in parent's one, to child's attribute.
+				 *
+				 * attr_needed[] fields of all RelOptInfo under the Append node
+				 * are originally empty sets, therefore unintentional target list
+				 * is made by build_rel_tlist() for new joinrel; because
+				 * bms_noempty_difference() always returns false for outer
+				 * relation, no target is enumerated for it.
+				 *
+				 * We make really needed relids from parent RelOptInfo, and add
+				 * these relids to child's attr_needed[] to get intended target
+				 * list for new joinrel.
+				 *
+				 * This behavior may be harmless for considering other paths,
+				 * so we don't remove these relids from child after processing
+				 * append pulling-up.
+				 */
+				forboth(parentvars, outer_rel->reltargetlist,
+						childvars, orig_outer_sub_rel->reltargetlist)
+				{
+					Var		*parentvar = (Var *) lfirst(parentvars);
+					Var		*childvar = (Var *) lfirst(childvars);
+					int		p_ndx;
+					Relids	required_relids;
+
+					if (!IsA(parentvar, Var) || !IsA(childvar, Var))
+						continue;
+
+					Assert(find_base_rel(root, parentvar->varno) == outer_rel);
+					p_ndx = parentvar->varattno - outer_rel->min_attr;
+
+					required_relids = bms_del_members(
+							bms_copy(outer_rel->attr_needed[p_ndx]),
+							joinrel->relids);
+
+					if (!bms_is_empty(required_relids))
+					{
+						RelOptInfo	*baserel =
+								find_base_rel(root, childvar->varno);
+						int			c_ndx =
+								childvar->varattno - baserel->min_attr;
+
+						baserel->attr_needed[c_ndx] = bms_add_members(
+								baserel->attr_needed[c_ndx],
+								required_relids);
+					}
+				}
+
+				/*
+				 * NOTE: root->join_rel_level is used to track candidate of join
+				 * relations for each level, then these relations are consolidated
+				 * to one relation.
+				 * (See the comment in standard_join_search)
+				 *
+				 * Even though we construct RelOptInfo of child relations of the
+				 * Append node, these relations should not appear as candidate of
+				 * relations join in the later stage. So, we once save the list
+				 * during make_join_rel() for the child relations.
+				 */
+				join_rel_level = root->join_rel_level;
+				root->join_rel_level = NULL;
+
+				/*
+				 * Create new joinrel (as a sub-path of Append).
+				 */
+				alter_outer_sub_rel =
+						make_join_rel(root, orig_outer_sub_rel, inner_rel);
+
+				/* restore the join_rel_level */
+				root->join_rel_level = join_rel_level;
+
+				Assert(alter_outer_sub_rel != NULL);
+
+				if (alter_inner_path)
+				{
+					/*
+					 * Remove (temporary added) alter_inner_path from pathlist.
+					 *
+					 * The alter_inner_path may be inner/outer path of JoinPath
+					 * made by make_join_rel() above, thus we MUST NOT free
+					 * alter_inner_path itself.
+					 */
+					inner_rel->pathlist =
+							list_delete_ptr(inner_rel->pathlist, alter_inner_path);
+					set_cheapest(inner_rel);
+				}
+
+				if (IS_DUMMY_REL(alter_outer_sub_rel))
+				{
+					pfree(alter_outer_sub_rel);
+					continue;
+				}
+
+				/*
+				 * We must check if alter_outer_sub_rel has one or more paths.
+				 * add_path() sometime rejects to add new path to parent RelOptInfo.
+				 */
+				if (list_length(alter_outer_sub_rel->pathlist) <= 0)
+				{
+					/*
+					 * Sadly, No paths added. This means that pull-up is failed,
+					 * thus clean up here.
+					 */
+					list_free_deep(alter_append_subpaths);
+					pfree(alter_outer_sub_rel);
+					list_free(joinclauses_parent);
+					elog(DEBUG1, "Append pull-up failed.");
+					return;
+				}
+
+				set_cheapest(alter_outer_sub_rel);
+				Assert(alter_outer_sub_rel->cheapest_total_path != NULL);
+				alter_append_subpaths = lappend(alter_append_subpaths,
+											alter_outer_sub_rel->cheapest_total_path);
+			} /* End of foreach(outer_path->subpaths) */
+
+			/* Append pull-up is succeeded. Add path to original joinrel. */
+			add_path(joinrel,
+					(Path *) create_append_path(joinrel, alter_append_subpaths, NULL));
+
+			list_free(joinclauses_parent);
+			elog(DEBUG1, "Append pull-up succeeded.");
+		} /* End of foreach(inner_path->pathlist) */
+
+		/*
+		 * We check length of joinrel's pathlist here.
+		 * If it is equal to or lesser than before trying above,
+		 * all inner_paths are not suitable for append pulling-up,
+		 * thus we decide to abort trying anymore.
+		 */
+		if (list_length(joinrel->pathlist) > num_pathlist_join)
+		{
+			elog(DEBUG1, "No paths are added. Abort now.");
+			return;
+		}
+	} /* End of foreach(outer_path->pathlist) */
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 411b36c..b088ba9 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -4235,8 +4235,18 @@ prepare_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree, List *pathkeys,
 				/*
 				 * Ignore child members unless they match the rel being
 				 * sorted.
+				 *
+				 * For append pull-up, we must not ignore child members
+				 * when this is called from make_sort_from_pathkeys().
+				 * Because "em_is_child" fields of all "ec_members" are true
+				 * in this case, thus it may fail to find pathkey
+				 * (and raise an error).
+				 *
+				 * In this condition, "relids" field may be NULL. So we don't
+				 * ignore child members when "relids" field is NULL.
 				 */
-				if (em->em_is_child &&
+				if (relids != NULL &&
+					em->em_is_child &&
 					!bms_equal(em->em_relids, relids))
 					continue;
 
@@ -4349,8 +4359,17 @@ find_ec_member_for_tle(EquivalenceClass *ec,
 
 		/*
 		 * Ignore child members unless they match the rel being sorted.
+		 *
+		 * For append pull-up, we must not ignore child members when this is
+		 * called from make_sort_from_pathkeys(). Because "em_is_child" fields
+		 * of all "ec_members" are true in this case, thus it will fail to
+		 * find pathkey (and raise an error).
+		 *
+		 * In this condition, "relids" field may be NULL. So we don't ignore
+		 * child members when "relids" field is NULL.
 		 */
-		if (em->em_is_child &&
+		if (relids != NULL &&
+			em->em_is_child &&
 			!bms_equal(em->em_relids, relids))
 			continue;
 
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 9442e5f..c137b09 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -54,9 +54,6 @@ get_relation_info_hook_type get_relation_info_hook = NULL;
 static bool infer_collation_opclass_match(InferenceElem *elem, Relation idxRel,
 							  List *idxExprs);
 static int32 get_rel_data_width(Relation rel, int32 *attr_widths);
-static List *get_relation_constraints(PlannerInfo *root,
-						 Oid relationObjectId, RelOptInfo *rel,
-						 bool include_notnull);
 static List *build_index_tlist(PlannerInfo *root, IndexOptInfo *index,
 				  Relation heapRelation);
 
@@ -1022,7 +1019,7 @@ get_relation_data_width(Oid relid, int32 *attr_widths)
  * run, and in many cases it won't be invoked at all, so there seems no
  * point in caching the data in RelOptInfo.
  */
-static List *
+List *
 get_relation_constraints(PlannerInfo *root,
 						 Oid relationObjectId, RelOptInfo *rel,
 						 bool include_notnull)
append_pullup.test_v1.patchapplication/octet-stream; name=append_pullup.test_v1.patchDownload
diff --git a/src/test/regress/expected/append_pullup.out b/src/test/regress/expected/append_pullup.out
new file mode 100644
index 0000000..614a826
--- /dev/null
+++ b/src/test/regress/expected/append_pullup.out
@@ -0,0 +1,209 @@
+--
+-- Append pull-up across Join
+--
+--
+-- Build a table for testing
+--
+-- CREATE Partition Table (Modulation is used for dividing)
+create temp table check_test_div (
+id integer,
+data_x float8,
+data_y float8
+);
+create temp table check_test_div_0 (
+check(id % 3 = 0)
+) inherits(check_test_div);
+create temp table check_test_div_1 (
+check(id % 3 = 1)
+) inherits(check_test_div);
+create temp table check_test_div_2 (
+check(id % 3 = 2)
+) inherits(check_test_div);
+-- CREATE table for inner relation
+create temp table inner_t as
+select generate_series(0,3000)::integer as id, ceil(random()*10000)::integer as num;
+begin;
+insert INTO check_test_div_0
+select (ceil(random()*1000)*3)::integer as id, random(), random() as data
+from generate_series(0,5000);
+insert INTO check_test_div_1
+select (ceil(random()*1000)*3+1)::integer as id, random(), random() as data
+from generate_series(0,5000);
+insert INTO check_test_div_2
+select (ceil(random()*1000)*3+2)::integer as id, random(), random() as data
+from generate_series(0,5000);
+commit;
+-- CREATE table for verifying
+create temp table test_appended (
+data_x float8,
+data_y float8,
+num integer
+);
+begin;
+insert into test_appended (data_x, data_y, num)
+select data_x, data_y, num from only check_test_div join inner_t on check_test_div.id = inner_t.id;
+insert into test_appended (data_x, data_y, num)
+select data_x, data_y, num from check_test_div_0 join inner_t on check_test_div_0.id = inner_t.id;
+insert into test_appended (data_x, data_y, num)
+select data_x, data_y, num from check_test_div_1 join inner_t on check_test_div_1.id = inner_t.id;
+insert into test_appended (data_x, data_y, num)
+select data_x, data_y, num from check_test_div_2 join inner_t on check_test_div_2.id = inner_t.id;
+commit;
+set enable_hashjoin to on;
+set enable_mergejoin to off;
+set enable_nestloop to off;
+--
+-- Check plan
+--
+explain (costs off)
+select data_x, data_y, num from check_test_div join inner_t on check_test_div.id = inner_t.id;
+                      QUERY PLAN                       
+-------------------------------------------------------
+ Append
+   ->  Hash Join
+         Hash Cond: (inner_t.id = check_test_div.id)
+         ->  Seq Scan on inner_t
+         ->  Hash
+               ->  Seq Scan on check_test_div
+   ->  Hash Join
+         Hash Cond: (check_test_div_0.id = inner_t.id)
+         ->  Seq Scan on check_test_div_0
+         ->  Hash
+               ->  Seq Scan on inner_t
+                     Filter: ((id % 3) = 0)
+   ->  Hash Join
+         Hash Cond: (check_test_div_1.id = inner_t.id)
+         ->  Seq Scan on check_test_div_1
+         ->  Hash
+               ->  Seq Scan on inner_t
+                     Filter: ((id % 3) = 1)
+   ->  Hash Join
+         Hash Cond: (check_test_div_2.id = inner_t.id)
+         ->  Seq Scan on check_test_div_2
+         ->  Hash
+               ->  Seq Scan on inner_t
+                     Filter: ((id % 3) = 2)
+(24 rows)
+
+--
+-- Verify its results
+--
+select data_x, data_y, num from check_test_div join inner_t on check_test_div.id = inner_t.id
+except (select * from test_appended);
+ data_x | data_y | num 
+--------+--------+-----
+(0 rows)
+
+select * from test_appended
+except (
+select data_x, data_y, num from check_test_div join inner_t on check_test_div.id = inner_t.id
+);
+ data_x | data_y | num 
+--------+--------+-----
+(0 rows)
+
+drop table check_test_div cascade;
+NOTICE:  drop cascades to 3 other objects
+DETAIL:  drop cascades to table check_test_div_0
+drop cascades to table check_test_div_1
+drop cascades to table check_test_div_2
+drop table test_appended;
+--
+-- Build a table for testing
+--
+-- CREATE Partition Table (Simple; Greater-than/Less-than marks are used for dividing)
+create temp table check_test_div (
+id integer,
+data_x float8,
+data_y float8
+);
+create temp table check_test_div_0 (
+check(id < 1000)
+) inherits(check_test_div);
+create temp table check_test_div_1 (
+check(id between 1000 and 1999)
+) inherits(check_test_div);
+create temp table check_test_div_2 (
+check(id > 1999)
+) inherits(check_test_div);
+-- Table for inner relation is already created.
+begin;
+insert INTO check_test_div_0
+select (ceil(random()*999))::integer as id, random(), random() as data
+from generate_series(0,5000);
+insert INTO check_test_div_1
+select (ceil(random()*999)+1000)::integer as id, random(), random() as data
+from generate_series(0,5000);
+insert INTO check_test_div_2
+select (ceil(random()*999)+2000)::integer as id, random(), random() as data
+from generate_series(0,5000);
+commit;
+-- CREATE table for verifying
+create temp table test_appended (
+data_x float8,
+data_y float8,
+num integer
+);
+begin;
+insert into test_appended (data_x, data_y, num)
+select data_x, data_y, num from only check_test_div join inner_t on check_test_div.id = inner_t.id;
+insert into test_appended (data_x, data_y, num)
+select data_x, data_y, num from check_test_div_0 join inner_t on check_test_div_0.id = inner_t.id;
+insert into test_appended (data_x, data_y, num)
+select data_x, data_y, num from check_test_div_1 join inner_t on check_test_div_1.id = inner_t.id;
+insert into test_appended (data_x, data_y, num)
+select data_x, data_y, num from check_test_div_2 join inner_t on check_test_div_2.id = inner_t.id;
+commit;
+set enable_hashjoin to on;
+set enable_mergejoin to off;
+set enable_nestloop to off;
+--
+-- Check plan
+--
+explain (costs off)
+select data_x, data_y, num from check_test_div join inner_t on check_test_div.id = inner_t.id;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Hash Join
+         Hash Cond: (inner_t.id = check_test_div.id)
+         ->  Seq Scan on inner_t
+         ->  Hash
+               ->  Seq Scan on check_test_div
+   ->  Hash Join
+         Hash Cond: (check_test_div_0.id = inner_t.id)
+         ->  Seq Scan on check_test_div_0
+         ->  Hash
+               ->  Seq Scan on inner_t
+                     Filter: (id < 1000)
+   ->  Hash Join
+         Hash Cond: (check_test_div_1.id = inner_t.id)
+         ->  Seq Scan on check_test_div_1
+         ->  Hash
+               ->  Seq Scan on inner_t
+                     Filter: ((id >= 1000) AND (id <= 1999))
+   ->  Hash Join
+         Hash Cond: (check_test_div_2.id = inner_t.id)
+         ->  Seq Scan on check_test_div_2
+         ->  Hash
+               ->  Seq Scan on inner_t
+                     Filter: (id > 1999)
+(24 rows)
+
+--
+-- Verify its results
+--
+select data_x, data_y, num from check_test_div join inner_t on check_test_div.id = inner_t.id
+except (select * from test_appended);
+ data_x | data_y | num 
+--------+--------+-----
+(0 rows)
+
+select * from test_appended
+except (
+select data_x, data_y, num from check_test_div join inner_t on check_test_div.id = inner_t.id
+);
+ data_x | data_y | num 
+--------+--------+-----
+(0 rows)
+
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 3987b4c..eb2ee84 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -60,7 +60,7 @@ test: create_index create_view
 # ----------
 # Another group of parallel tests
 # ----------
-test: create_aggregate create_function_3 create_cast constraints triggers inherit create_table_like typed_table vacuum drop_if_exists updatable_views rolenames roleattributes
+test: create_aggregate create_function_3 create_cast constraints triggers inherit append_pullup create_table_like typed_table vacuum drop_if_exists updatable_views rolenames roleattributes
 
 # ----------
 # sanity_check does a vacuum, affecting the sort order of SELECT *
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 379f272..ec37de3 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -67,6 +67,7 @@ test: create_cast
 test: constraints
 test: triggers
 test: inherit
+test: append_pullup
 test: create_table_like
 test: typed_table
 test: vacuum
diff --git a/src/test/regress/sql/append_pullup.sql b/src/test/regress/sql/append_pullup.sql
new file mode 100644
index 0000000..51a8606
--- /dev/null
+++ b/src/test/regress/sql/append_pullup.sql
@@ -0,0 +1,172 @@
+--
+-- Append pull-up across Join
+--
+
+--
+-- Build a table for testing
+--
+-- CREATE Partition Table (Modulation is used for dividing)
+create temp table check_test_div (
+id integer,
+data_x float8,
+data_y float8
+);
+
+create temp table check_test_div_0 (
+check(id % 3 = 0)
+) inherits(check_test_div);
+
+create temp table check_test_div_1 (
+check(id % 3 = 1)
+) inherits(check_test_div);
+
+create temp table check_test_div_2 (
+check(id % 3 = 2)
+) inherits(check_test_div);
+
+-- CREATE table for inner relation
+create temp table inner_t as
+select generate_series(0,3000)::integer as id, ceil(random()*10000)::integer as num;
+
+begin;
+
+insert INTO check_test_div_0
+select (ceil(random()*1000)*3)::integer as id, random(), random() as data
+from generate_series(0,5000);
+
+insert INTO check_test_div_1
+select (ceil(random()*1000)*3+1)::integer as id, random(), random() as data
+from generate_series(0,5000);
+
+insert INTO check_test_div_2
+select (ceil(random()*1000)*3+2)::integer as id, random(), random() as data
+from generate_series(0,5000);
+
+commit;
+
+-- CREATE table for verifying
+create temp table test_appended (
+data_x float8,
+data_y float8,
+num integer
+);
+
+begin;
+insert into test_appended (data_x, data_y, num)
+select data_x, data_y, num from only check_test_div join inner_t on check_test_div.id = inner_t.id;
+
+insert into test_appended (data_x, data_y, num)
+select data_x, data_y, num from check_test_div_0 join inner_t on check_test_div_0.id = inner_t.id;
+
+insert into test_appended (data_x, data_y, num)
+select data_x, data_y, num from check_test_div_1 join inner_t on check_test_div_1.id = inner_t.id;
+
+insert into test_appended (data_x, data_y, num)
+select data_x, data_y, num from check_test_div_2 join inner_t on check_test_div_2.id = inner_t.id;
+commit;
+
+set enable_hashjoin to on;
+set enable_mergejoin to off;
+set enable_nestloop to off;
+
+--
+-- Check plan
+--
+explain (costs off)
+select data_x, data_y, num from check_test_div join inner_t on check_test_div.id = inner_t.id;
+
+--
+-- Verify its results
+--
+select data_x, data_y, num from check_test_div join inner_t on check_test_div.id = inner_t.id
+except (select * from test_appended);
+
+select * from test_appended
+except (
+select data_x, data_y, num from check_test_div join inner_t on check_test_div.id = inner_t.id
+);
+
+drop table check_test_div cascade;
+drop table test_appended;
+
+--
+-- Build a table for testing
+--
+-- CREATE Partition Table (Simple; Greater-than/Less-than marks are used for dividing)
+create temp table check_test_div (
+id integer,
+data_x float8,
+data_y float8
+);
+
+create temp table check_test_div_0 (
+check(id < 1000)
+) inherits(check_test_div);
+
+create temp table check_test_div_1 (
+check(id between 1000 and 1999)
+) inherits(check_test_div);
+
+create temp table check_test_div_2 (
+check(id > 1999)
+) inherits(check_test_div);
+
+-- Table for inner relation is already created.
+
+begin;
+
+insert INTO check_test_div_0
+select (ceil(random()*999))::integer as id, random(), random() as data
+from generate_series(0,5000);
+
+insert INTO check_test_div_1
+select (ceil(random()*999)+1000)::integer as id, random(), random() as data
+from generate_series(0,5000);
+
+insert INTO check_test_div_2
+select (ceil(random()*999)+2000)::integer as id, random(), random() as data
+from generate_series(0,5000);
+
+commit;
+
+-- CREATE table for verifying
+create temp table test_appended (
+data_x float8,
+data_y float8,
+num integer
+);
+
+begin;
+insert into test_appended (data_x, data_y, num)
+select data_x, data_y, num from only check_test_div join inner_t on check_test_div.id = inner_t.id;
+
+insert into test_appended (data_x, data_y, num)
+select data_x, data_y, num from check_test_div_0 join inner_t on check_test_div_0.id = inner_t.id;
+
+insert into test_appended (data_x, data_y, num)
+select data_x, data_y, num from check_test_div_1 join inner_t on check_test_div_1.id = inner_t.id;
+
+insert into test_appended (data_x, data_y, num)
+select data_x, data_y, num from check_test_div_2 join inner_t on check_test_div_2.id = inner_t.id;
+commit;
+
+set enable_hashjoin to on;
+set enable_mergejoin to off;
+set enable_nestloop to off;
+
+--
+-- Check plan
+--
+explain (costs off)
+select data_x, data_y, num from check_test_div join inner_t on check_test_div.id = inner_t.id;
+
+--
+-- Verify its results
+--
+select data_x, data_y, num from check_test_div join inner_t on check_test_div.id = inner_t.id
+except (select * from test_appended);
+
+select * from test_appended
+except (
+select data_x, data_y, num from check_test_div join inner_t on check_test_div.id = inner_t.id
+);
#16Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Taiki Kondo (#15)
1 attachment(s)
Re: [Proposal] Table partition + join pushdown

Hello, sorry for late response and thank you for the new patch.

At Fri, 20 Nov 2015 12:05:38 +0000, Taiki Kondo <tai-kondo@yk.jp.nec.com> wrote in <12A9442FBAE80D4E8953883E0B84E08863F115@BPXM01GP.gisp.nec.co.jp>

I created v3 patch for this feature, and v1 patch for regression tests.
Please find attached.

I think I understood what you intend to do by
substitute_node_with_join_cond. It replaces *all* vars in check
constraint by corresponding expressions containing inner vars. If
it is correct, it is predictable whether the check condition can
be successfully transformed. Addition to that,
substitute_node_with_join_cond uses any side of join clases that
matches the target var but it is somewhat different from what
exactly should be done, even if it works correctly.

For those conditions, substitute_node_with_join_cond and
create_rinfo_from_check_constr could be simpler and clearer as
following. Also this refactored code determines clearly what the
function does, I believe.

====
create_rinfo_from_check_constr(...)
{
pull_varattnos(check_constr, outer_rel->relid, &chkattnos);
replacements =
extract_replacements(joininfo, outer_rel->relid, &joinattnos);

/*
* exit if the join clauses cannot replace all vars in the check
* constraint
*/
if (!bms_is_subset(chkattnos, joinattnos))
return NULL;

foreach(lc, check_constr)
{
result = lappend(result, expression_tree_mutator(...);
}
====

The attached patch does this.
What do you think about this refactoring?

Reply for your comments is below.

Overall comments
----------------
* I think the enhancement in copyfuncs.c shall be in the separate
patch; it is more graceful manner. At this moment, here is less
than 20 Path delivered type definition. It is much easier works
than entire Plan node support as we did recently.
(How about other folk's opinion?)

I also would like to wait for other fork's opinion.
So I don't divide this part from this patch yet.

Other fork? It's Me?

_copyPathFields is independent from all other part of this patch
and it looks to be a generic function. I prefer that such
indenpendent features to be separate patches, too.

At this moment, here is less
than 20 Path delivered type definition. It is much easier works
than entire Plan node support as we did recently.
(How about other folk's opinion?)

It should be doable but I don't think we should provide all
possible _copyPath*. Curently it looks to be enough to provide
the function for at least the Paths listed in
try_append_pullup_across_join as shown below and others should
not be added if it won't be used for now.

T_SeqScan, T_SampleScan, T_IndexScan, T_IndexOnlyScan,
T_BitmapHeapScan, T_TidScan, T_Gather

I doubt that tid partitioning is used but there's no reason no
refuse to support it. By the way, would you add regressions for
these other types of path?

* Can you integrate the attached test cases as regression test?
It is more generic way, and allows people to detect problems
if relevant feature gets troubled in the future updates.

Ok, done. Please find attached.

* Naming of "join pushdown" is a bit misleading because other
component also uses this term, but different purpose.
I'd like to suggest try_pullup_append_across_join.
Any ideas from native English speaker?

Thank you for your suggestion.

I change its name to "try_append_pullup_accross_join",
which is matched with the order of the previous name.

However, this change is just temporary.
I also would like to wait for other fork's opinion
for the naming.

Patch review
------------

At try_join_pushdown:
+   /* When specified outer path is not an AppendPath, nothing to do here. */
+   if (!IsA(outer_rel->cheapest_total_path, AppendPath))
+   {
+       elog(DEBUG1, "Outer path is not an AppendPath. Do nothing.");
+       return;
+   }
It checks whether the cheapest_total_path is AppendPath at the head
of this function. It ought to be a loop to walk on the pathlist of
RelOptInfo, because multiple path-nodes might be still alive
but also might not be cheapest_total_path.

Ok, done.

+   switch (inner_rel->cheapest_total_path->pathtype)
+
Also, we can construct the new Append node if one of the path-node
within pathlist of inner_rel are at least supported.

Done.
But, this change will create nested loop between inner_rel's pathlist
and outer_rel's pathlist. It means that planning time is increased more.

I think it is adequate to check only for cheapest_total_path
because checking only for cheapest_total_path is implemented in other
parts, like make_join_rel().

How about your (and also other people's) opinion?

+   if (list_length(inner_rel->ppilist) > 0)
+   {
+       elog(DEBUG1, "ParamPathInfo is already set in inner_rel. Can't pushdown.");
+       return;
+   }
+
You may need to introduce why this feature uses ParamPathInfos here.
It seems to me a good hack to attach additional qualifiers on
the underlying inner scan node, even if it is not a direct child of
inner relation.
However, people may have different opinion.

Ok, added comment in source.
Please find from attached patch.

I suppose that the term 'parameter' used here is strictly defined
as condition and information about 'parameterized' paths, which
related to restrictions involving another relation. In contrast,
the PPI added here contains totally defferent from parameter
defined here, since it refers only to the relation, in other
words, not a join condition. I think such kind of restrictions
should be added to baserestrictinfo in RelOptInfo. The condition
derived from constraints can be simply added to it. It doesn't
need any additional explanation to do so.

Any opinions ?

+static List *
+convert_parent_joinclauses_to_child(PlannerInfo *root, List *join_clauses,
+                                   RelOptInfo *outer_rel) {
+   Index       parent_relid =
+                   find_childrel_appendrelinfo(root, outer_rel)->parent_relid;
+   List        *clauses_parent = get_actual_clauses(join_clauses);
+   List        *clauses_child = NIL;
+   ListCell    *lc;
+
+   foreach(lc, clauses_parent)
+   {
+       Node    *one_clause_child = (Node *) copyObject(lfirst(lc));
+
+       ChangeVarNodes(one_clause_child, parent_relid, outer_rel->relid, 0);
+       clauses_child = lappend(clauses_child, one_clause_child);
+   }
+
+   return make_restrictinfos_from_actual_clauses(root, clauses_child); 
+}

Is ChangeVarNodes() right routine to replace var-node of parent relation
by relevant var-node of child relation?
It may look sufficient, however, nobody can ensure varattno of child
relation is identical with parent relation's one.
For example, which attribute number shall be assigned on 'z' here?
CREATE TABLE tbl_parent(x int);
CREATE TABLE tbl_child(y int) INHERITS(tbl_parent);
ALTER TABLE tbl_parent ADD COLUMN z int;

Maybe you're right, so I agree with you.
I use adjust_appendrel_attrs() instead of ChangeVarNodes()
for this purpose.

--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -4230,8 +4230,14 @@ prepare_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree, List *pathkeys,
/*
* Ignore child members unless they match the rel being
* sorted.
+                *
+                * If this is called from make_sort_from_pathkeys(),
+                * relids may be NULL. In this case, we must not ignore child
+                * members because inner/outer plan of pushed-down merge join is
+                * always child table.
*/
-               if (em->em_is_child &&
+               if (relids != NULL &&
+                   em->em_is_child &&
!bms_equal(em->em_relids, relids))
continue;

It is a little bit hard to understand why this modification is needed.
Could you add source code comment that focus on the reason why.

Ok, added comment in source.
Please find from attached patch.

Could you show me an example to execute there with relids == NULL?

I don't know why prepare_sort_from_pathkeys is called with such
condition with this patch but the modification apparently changes
the behavior of this function. I guess we should find another way
to get the same result or more general explanation for the
validity to do so.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-Refactor-create_rinfo_from_check_constr-so-that-it-r.patchtext/x-patch; charset=us-asciiDownload
>From f6de7eabdeaebd8272126b210c2a0c3d93285004 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 26 Nov 2015 10:02:54 +0900
Subject: [PATCH] Refactor create_rinfo_from_check_constr so that it reflects
 what should be done exactly.

---
 src/backend/optimizer/path/joinpath.c | 161 ++++++++++++++++++----------------
 src/include/optimizer/plancat.h       |   4 +
 2 files changed, 91 insertions(+), 74 deletions(-)

diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index 6dec33c..8764e98 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -16,6 +16,7 @@
 
 #include <math.h>
 
+#include "access/sysattr.h"
 #include "executor/executor.h"
 #include "foreign/fdwapi.h"
 #include "nodes/nodeFuncs.h"
@@ -27,14 +28,9 @@
 #include "optimizer/plancat.h"
 #include "optimizer/prep.h"
 #include "optimizer/restrictinfo.h"
+#include "optimizer/var.h"
 #include "utils/lsyscache.h"
 
-typedef struct
-{
-	List	*joininfo;
-	bool	 is_substituted;
-} substitution_node_context;
-
 /* Hook for plugins to get control in add_paths_to_joinrel() */
 set_join_pathlist_hook_type set_join_pathlist_hook = NULL;
 
@@ -1506,77 +1502,89 @@ select_mergejoin_clauses(PlannerInfo *root,
 }
 
 /*
- * Try to substitute Var node according to join conditions.
- * This process is from following steps.
- *
- * 1. Try to find whether Var node matches to left/right Var node of
- *    one join condition.
- * 2. If found, replace Var node with the opposite expression node of
- *    the join condition.
- *
- * For example, let's assume that we have following expression and
- * join condition.
- * Expression       : A.num % 4 = 1
- * Join condition   : A.num = B.data + 2
- * In this case, we can get following expression.
- *    (B.data + 2) % 4 = 1
+ * Substitute vars with corresponding nodes.
  */
 static Node *
-substitute_node_with_join_cond(Node *node, substitution_node_context *context)
+substitute_nodes(Node *node, List *replacements)
 {
-	/* Failed to substitute. Abort. */
-	if (!context->is_substituted)
-		return (Node *) copyObject(node);
-
 	if (node == NULL)
 		return NULL;
 
 	if (IsA(node, Var))
 	{
-		List		*join_cond = context->joininfo;
 		ListCell	*lc;
+		Node		*replacement = NULL;
 
-		Assert(list_length(join_cond) > 0);
+		Assert(list_length(replacements) > 0);
 
-		foreach (lc, join_cond)
+		foreach (lc, replacements)
 		{
-			RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
-			Expr *expr = rinfo->clause;
+			List *ent = (List *) lfirst(lc);
+			Var *target = (Var *) linitial(ent);
 
-			/*
-			 * Make sure whether OpExpr of Join clause means "=".
-			 */
-			if (!rinfo->can_join ||
-				!IsA(expr, OpExpr) ||
-				!op_hashjoinable(((OpExpr *) expr)->opno,
-								exprType(get_leftop(expr))))
+			if (!equal(target, node))
 				continue;
 
-			if (equal(get_leftop(expr), node))
-			{
-				/*
-				 * This node is equal to LEFT node of join condition,
-				 * thus will be replaced with RIGHT clause.
-				 */
-				return (Node *) copyObject(get_rightop(expr));
-			}
-			else
-			if (equal(get_rightop(expr), node))
-			{
-				/*
-				 * This node is equal to RIGHT node of join condition,
-				 * thus will be replaced with LEFT clause.
-				 */
-				return (Node *) copyObject(get_leftop(expr));
-			}
+			replacement = (Node *) lsecond(ent);
+			break;
 		}
 
-		/* Unfortunately, substituting is failed. */
-		context->is_substituted = false;
-		return (Node *) copyObject(node);
+		/* All vars must be replaced  */
+		Assert(replacement != NULL);
+
+		return replacement;
 	}
 
-	return expression_tree_mutator(node, substitute_node_with_join_cond, context);
+	return expression_tree_mutator(node, substitute_nodes, replacements);
+}
+
+/*
+ * Extract replacements for the relation from joinoinfo
+ */
+static List *
+extract_replacements(List *joininfo, Index relid, Bitmapset **attnos)
+{
+	List	 *substitutes = NIL;
+	ListCell *lc;
+
+	/*
+	 * clauses in joininfo are assumed to be in conjunction so any of them can
+	 * be a substitution independently from others.
+	 */
+	foreach (lc, joininfo)
+	{
+		RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+		Expr *expr = rinfo->clause;
+		Var *leftop = (Var *)get_leftop(expr);
+		Var *rightop = (Var *)get_rightop(expr);
+
+		/*
+		 * We could solve join expression for the relation, but we don't and
+		 * just find matching vars in either side for now.
+		 */
+		if (!rinfo->can_join ||
+			!IsA(expr, OpExpr) ||
+			!(IsA(leftop, Var) || IsA(rightop, Var)) ||
+			!op_hashjoinable(((OpExpr *) expr)->opno, exprType((Node *)leftop)))
+			continue;
+
+		if (IsA(leftop, Var) && leftop->varno == relid)
+		{
+			substitutes = lappend(substitutes, 
+								  list_make2(leftop, rightop));
+			*attnos = bms_add_member(*attnos,
+					 leftop->varattno - FirstLowInvalidHeapAttributeNumber);
+		}
+		else if (IsA(rightop, Var) && rightop->varno == relid)
+		{
+			substitutes = lappend(substitutes, 
+								  list_make2(rightop, leftop));
+			*attnos = bms_add_member(*attnos,
+					 rightop->varattno - FirstLowInvalidHeapAttributeNumber);
+		}
+	}
+
+	return substitutes;
 }
 
 /*
@@ -1606,8 +1614,11 @@ create_rinfo_from_check_constr(PlannerInfo *root, List *joininfo,
 	List			*check_constr =
 						get_relation_constraints(root, childRTE->relid,
 													outer_rel, false);
+	List			*replacements = NIL;
+
 	ListCell		*lc;
-	substitution_node_context	context;
+	Bitmapset 		*chkattnos = NULL;
+	Bitmapset 		*joinattnos = NULL;
 
 	if (list_length(check_constr) <= 0)
 	{
@@ -1615,26 +1626,28 @@ create_rinfo_from_check_constr(PlannerInfo *root, List *joininfo,
 		return NIL;
 	}
 
-	context.joininfo = joininfo;
-	context.is_substituted = true;
+	pull_varattnos((Node *)check_constr, outer_rel->relid, &chkattnos);
+	replacements =
+		extract_replacements(joininfo, outer_rel->relid, &joinattnos);
 
 	/*
-	 * Try to convert CHECK() constraints to filter expressions.
+	 * exit if the join clauses cannot replace all vars in the check
+	 * constraint
+	 */
+	if (!bms_is_subset(chkattnos, joinattnos))
+		return NULL;
+
+	/*
+	 * Generate filter condition for the inner relation from check constraints
+	 * on the outer relation by substituting outer vars with inner equivalents
+	 * derived from the join condition.
 	 */
 	foreach(lc, check_constr)
 	{
-		Node *substituted =
-				expression_tree_mutator((Node *) lfirst(lc),
-										substitute_node_with_join_cond,
-										(void *) &context);
-
-		if (!context.is_substituted)
-		{
-			*succeed = false;
-			list_free_deep(check_constr);
-			return NIL;
-		}
-		result = lappend(result, substituted);
+		result = lappend(result,
+						 expression_tree_mutator((Node *) lfirst(lc),
+												 substitute_nodes,
+												 (void *) replacements));
 	}
 
 	Assert(list_length(check_constr) == list_length(result));
diff --git a/src/include/optimizer/plancat.h b/src/include/optimizer/plancat.h
index 11e7d4d..07f3b8d 100644
--- a/src/include/optimizer/plancat.h
+++ b/src/include/optimizer/plancat.h
@@ -35,6 +35,10 @@ extern void estimate_rel_size(Relation rel, int32 *attr_widths,
 
 extern int32 get_relation_data_width(Oid relid, int32 *attr_widths);
 
+extern List *get_relation_constraints(PlannerInfo *root,
+									  Oid relationObjectId, RelOptInfo *rel,
+									  bool include_notnull);
+
 extern bool relation_excluded_by_constraints(PlannerInfo *root,
 								 RelOptInfo *rel, RangeTblEntry *rte);
 
-- 
1.8.3.1

#17Michael Paquier
michael.paquier@gmail.com
In reply to: Taiki Kondo (#15)
Re: [Proposal] Table partition + join pushdown

On Fri, Nov 20, 2015 at 9:05 PM, Taiki Kondo <tai-kondo@yk.jp.nec.com> wrote:

I created v3 patch for this feature, and v1 patch for regression tests.
Please find attached.

[blah review and replies]

Please find from attached patch.

This new patch did not actually get a review, moved to next CF.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#17)
Re: [Proposal] Table partition + join pushdown

On Tue, Dec 22, 2015 at 8:36 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Nov 20, 2015 at 9:05 PM, Taiki Kondo <tai-kondo@yk.jp.nec.com> wrote:

I created v3 patch for this feature, and v1 patch for regression tests.
Please find attached.

[blah review and replies]

Please find from attached patch.

This new patch did not actually get a review, moved to next CF.

I think this patch is doomed. Suppose you join A to B on A.x = B.y.
The existence of a constraint on table A which says CHECK(P(x)) does
not imply that only rows from y where P(y) is true will join. For
example, suppose that x and y are numeric columns and P(x) is
length(x::text) == 3. Then you could have 1 in one table and 1.0 in
the table; they join, but P(x) is true for one and false for the
other. The fundamental problem is that equality according to the join
operator need not mean equality for all purposes. 1 and 1.0, as
numerics, are equal, but not the same. Since the whole patch is based
on this idea, I believe that means this patch is dead in the water.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Greg Stark
stark@mit.edu
In reply to: Robert Haas (#18)
Re: [Proposal] Table partition + join pushdown

On Mon, Jan 18, 2016 at 5:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:

For
example, suppose that x and y are numeric columns and P(x) is
length(x::text) == 3. Then you could have 1 in one table and 1.0 in
the table; they join, but P(x) is true for one and false for the
other.

Fwiw, ages ago there was some talk about having a property on
functions "equality preserving" or something like that. If a function,
or more likely a <function,operator> tuple had this property set then
x op y => f(x) op f(y). This would be most useful for things like
substring or hash functions which would allow partial indexes or
partition exclusion to be more generally useful.

Of course then you really want <f,op1,op2> to indicate that "a op1 b
=> f(a) op2 f(b)" so you can handle things like <substring,lt,lte > so
that "a < b => substring(a,n) <= substring(b,n)" and you need some way
to represent the extra arguments to substring and the whole thing
became too complex and got dropped.

But perhaps even a simpler property that only worked for equality and
single-argument functions would be useful since it would let us mark
hash functions Or perhaps we only need to mark the few functions that
expose properties that don't affect equality since I think there are
actually very few of them.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Robert Haas
robertmhaas@gmail.com
In reply to: Greg Stark (#19)
Re: [Proposal] Table partition + join pushdown

On Tue, Jan 19, 2016 at 7:59 AM, Greg Stark <stark@mit.edu> wrote:

On Mon, Jan 18, 2016 at 5:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:

For
example, suppose that x and y are numeric columns and P(x) is
length(x::text) == 3. Then you could have 1 in one table and 1.0 in
the table; they join, but P(x) is true for one and false for the
other.

Fwiw, ages ago there was some talk about having a property on
functions "equality preserving" or something like that. If a function,
or more likely a <function,operator> tuple had this property set then
x op y => f(x) op f(y). This would be most useful for things like
substring or hash functions which would allow partial indexes or
partition exclusion to be more generally useful.

Of course then you really want <f,op1,op2> to indicate that "a op1 b
=> f(a) op2 f(b)" so you can handle things like <substring,lt,lte > so
that "a < b => substring(a,n) <= substring(b,n)" and you need some way
to represent the extra arguments to substring and the whole thing
became too complex and got dropped.

But perhaps even a simpler property that only worked for equality and
single-argument functions would be useful since it would let us mark
hash functions Or perhaps we only need to mark the few functions that
expose properties that don't affect equality since I think there are
actually very few of them.

We could certainly mark operators that amount to testing binary
equality as such, and this optimization could be used for join
operators so marked. But I worry that would become a crutch, with
people implementing optimizations that work for such operators and
leaving numeric (for example) out in the cold. Of course, we could
worry about such problems when and if they happen, and accept the idea
of markings for now. However, I'm inclined to think that there's a
better way to optimize the case Taiki Kondo and Kouhei Kagai are
targeting.

If we get declarative partitioning, an oft-requested feature that has
been worked on by various people over the years and currently by Amit
Langote, and specifically if we get hash partitioning, then we'll
presumably use the hash function for the default operator class of the
partitioning column's datatype to partition the table. Then, if we do
a join against some other table and consider a hash join, we'll be
using the same hash function on our side, and either the same operator
or a compatible operator for some other datatype in the same opfamily
on the other side. At that point, if we push down the join, we can
add a filter on the inner side of the join that the hash value of the
matching column has to map to the partition it's being joined against.
And we don't get a recurrence of this problem in that case, because
we're not dealing with an arbitrary predicate - we're dealing with a
hash function whose equality semantics are defined to be compatible
with the join operator.

That approach works with any data type that has a default hash
operator class, which covers pretty much everything anybody is likely
to care about, including numeric.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers