Remove no-op PlaceHolderVars

Started by Richard Guoover 1 year ago3 messages
#1Richard Guo
guofenglinux@gmail.com
1 attachment(s)

In [1]/messages/by-id/CAMbWs4-9dYrF44pkpkFrPoLXc_YH15DL8xT8J-oHggp_WvsLLA@mail.gmail.com there was a short discussion about removing no-op
PlaceHolderVars. This thread is for continuing that discussion.

We may insert PlaceHolderVars when pulling up a subquery that is
within the nullable side of an outer join: if subquery pullup needs to
replace a subquery-referencing Var that has nonempty varnullingrels
with an expression that is not simply a Var, we may need to wrap the
replacement expression into a PlaceHolderVar. However, if the outer
join above is later reduced to an inner join, the PHVs would become
no-ops with no phnullingrels bits.

I think it's always desirable to remove these no-op PlaceHolderVars
because they can constrain optimization opportunities. The issue
reported in [1]/messages/by-id/CAMbWs4-9dYrF44pkpkFrPoLXc_YH15DL8xT8J-oHggp_WvsLLA@mail.gmail.com shows that they can block subexpression folding, which
can prevent an expression from being matched to an index. I provided
another example in that thread which shows that no-op PlaceHolderVars
can force join orders, potentially forcing us to resort to nestloop
join in some cases.

As explained in the comment of remove_nulling_relids_mutator, PHVs are
used in some cases to enforce separate identity of subexpressions. In
other cases, however, it should be safe to remove a PHV if its
phnullingrels becomes empty.

Attached is a WIP patch that marks PHVs that need to be kept because
they are serving to isolate subexpressions, and removes all other PHVs
in remove_nulling_relids_mutator if their phnullingrels bits become
empty. One problem with it is that a PHV's contained expression may
not have been fully preprocessed. For example if the PHV is used as a
qual clause, its contained expression is not processed for AND/OR
flattening, which could trigger the Assert in make_restrictinfo that
the clause shouldn't be an AND clause.

For example:

create table t (a boolean, b boolean);

select * from t t1 left join
lateral (select (t1.a and t1.b) as x, * from t t2) s on true
where s.x;

The other problem with this is that it breaks 17 test cases. We need
to modify them one by one to ensure that they still test what they are
supposed to, which is not a trivial task.

Before delving into these two problems, I'd like to know whether this
optimization is worthwhile, and whether I'm going in the right
direction.

[1]: /messages/by-id/CAMbWs4-9dYrF44pkpkFrPoLXc_YH15DL8xT8J-oHggp_WvsLLA@mail.gmail.com

Thanks
Richard

Attachments:

v1-0001-Remove-no-op-PlaceHolderVars.patchapplication/octet-stream; name=v1-0001-Remove-no-op-PlaceHolderVars.patchDownload
From 4cc4c985141c302a1579514472eb38942637464d Mon Sep 17 00:00:00 2001
From: Richard Guo <guofenglinux@gmail.com>
Date: Mon, 2 Sep 2024 17:01:20 +0900
Subject: [PATCH v1] Remove no-op PlaceHolderVars

We may insert PlaceHolderVars when pulling up a subquery that is
within the nullable side of an outer join.  However, if the outer join
is later reduced to an inner join, the PHVs would become no-ops with
no phnullingrels bits.

It's always desirable to remove these no-op PlaceHolderVars because
they can constrain optimization opportunities, such as blocking
subexpression folding, or forcing join order.  There is one case where
we need to keep a PHV even if its phnullingrels becomes empty: the PHV
is used to enforce separate identity of subexpressions.

This patch removes no-op PlaceHolderVars by marking PHVs that need to
be kept because they are serving to isolate subexpressions.
---
 src/backend/optimizer/prep/prepjointree.c | 17 +++++++++++++++++
 src/backend/optimizer/util/placeholder.c  |  7 ++++---
 src/backend/optimizer/util/restrictinfo.c |  9 +++++++--
 src/backend/rewrite/rewriteManip.c        | 15 ++++++++++++---
 src/include/nodes/pathnodes.h             |  7 +++++++
 5 files changed, 47 insertions(+), 8 deletions(-)

diff --git a/src/backend/optimizer/prep/prepjointree.c b/src/backend/optimizer/prep/prepjointree.c
index 34fbf8ee23..25a4e826c5 100644
--- a/src/backend/optimizer/prep/prepjointree.c
+++ b/src/backend/optimizer/prep/prepjointree.c
@@ -2442,6 +2442,14 @@ pullup_replace_vars_callback(Var *var,
 				make_placeholder_expr(rcon->root,
 									  (Expr *) newnode,
 									  bms_make_singleton(rcon->varno));
+
+			/*
+			 * If the PHV is used to isolate subexpressions, mark it as
+			 * needing to be kept.
+			 */
+			if (rcon->wrap_non_vars)
+				((PlaceHolderVar *) newnode)->needkeep = true;
+
 			/* cache it with the PHV, and with phlevelsup etc not set yet */
 			rcon->rv_cache[InvalidAttrNumber] = copyObject(newnode);
 		}
@@ -2462,6 +2470,7 @@ pullup_replace_vars_callback(Var *var,
 		if (need_phv)
 		{
 			bool		wrap;
+			bool		isolate_exprs = false;
 
 			if (newnode && IsA(newnode, Var) &&
 				((Var *) newnode)->varlevelsup == 0)
@@ -2494,6 +2503,7 @@ pullup_replace_vars_callback(Var *var,
 			{
 				/* Caller told us to wrap all non-Vars in a PlaceHolderVar */
 				wrap = true;
+				isolate_exprs = true;
 			}
 			else
 			{
@@ -2541,6 +2551,13 @@ pullup_replace_vars_callback(Var *var,
 										  (Expr *) newnode,
 										  bms_make_singleton(rcon->varno));
 
+				/*
+				 * If the PHV is used to isolate subexpressions, mark it as
+				 * needing to be kept.
+				 */
+				if (isolate_exprs)
+					((PlaceHolderVar *) newnode)->needkeep = true;
+
 				/*
 				 * Cache it if possible (ie, if the attno is in range, which
 				 * it probably always should be).
diff --git a/src/backend/optimizer/util/placeholder.c b/src/backend/optimizer/util/placeholder.c
index 81abadd6db..5c678db22d 100644
--- a/src/backend/optimizer/util/placeholder.c
+++ b/src/backend/optimizer/util/placeholder.c
@@ -44,9 +44,9 @@ static bool contain_placeholder_references_walker(Node *node,
  * phrels is the syntactic location (as a set of relids) to attribute
  * to the expression.
  *
- * The caller is responsible for adjusting phlevelsup and phnullingrels
- * as needed.  Because we do not know here which query level the PHV
- * will be associated with, it's important that this function touches
+ * The caller is responsible for adjusting phlevelsup, phnullingrels and
+ * needkeep as needed.  Because we do not know here which query level the
+ * PHV will be associated with, it's important that this function touches
  * only root->glob; messing with other parts of PlannerInfo would be
  * likely to do the wrong thing.
  */
@@ -60,6 +60,7 @@ make_placeholder_expr(PlannerInfo *root, Expr *expr, Relids phrels)
 	phv->phnullingrels = NULL;	/* caller may change this later */
 	phv->phid = ++(root->glob->lastPHId);
 	phv->phlevelsup = 0;		/* caller may change this later */
+	phv->needkeep = false;		/* caller may change this later */
 
 	return phv;
 }
diff --git a/src/backend/optimizer/util/restrictinfo.c b/src/backend/optimizer/util/restrictinfo.c
index 0b406e9334..8ec579e86e 100644
--- a/src/backend/optimizer/util/restrictinfo.c
+++ b/src/backend/optimizer/util/restrictinfo.c
@@ -87,8 +87,13 @@ make_restrictinfo(PlannerInfo *root,
 													   incompatible_relids,
 													   outer_relids);
 
-	/* Shouldn't be an AND clause, else AND/OR flattening messed up */
-	Assert(!is_andclause(clause));
+	/*
+	 * XXX Normally it shouldn't be an AND clause, else AND/OR flattening
+	 * messed up.  An exception occurs if the clause was initially wrapped in
+	 * a PlaceHolderVar and the PlaceHolderVar is removed afterward.  In this
+	 * case the clause may not have been processed for AND/OR flattening by
+	 * preprocess_expression.
+	 */
 
 	return make_restrictinfo_internal(root,
 									  clause,
diff --git a/src/backend/rewrite/rewriteManip.c b/src/backend/rewrite/rewriteManip.c
index b20625fbd2..58b3e67a49 100644
--- a/src/backend/rewrite/rewriteManip.c
+++ b/src/backend/rewrite/rewriteManip.c
@@ -1281,9 +1281,10 @@ remove_nulling_relids_mutator(Node *node,
 		{
 			/*
 			 * Note: it might seem desirable to remove the PHV altogether if
-			 * phnullingrels goes to empty.  Currently we dare not do that
-			 * because we use PHVs in some cases to enforce separate identity
-			 * of subexpressions; see wrap_non_vars usages in prepjointree.c.
+			 * phnullingrels goes to empty.  Currently we only dare to do that
+			 * if the PHV is not marked as needing to be kept, because we use
+			 * PHVs in some cases to enforce separate identity of
+			 * subexpressions; see wrap_non_vars usages in prepjointree.c.
 			 */
 			/* Copy the PlaceHolderVar and mutate what's below ... */
 			phv = (PlaceHolderVar *)
@@ -1297,6 +1298,14 @@ remove_nulling_relids_mutator(Node *node,
 			phv->phrels = bms_difference(phv->phrels,
 										 context->removable_relids);
 			Assert(!bms_is_empty(phv->phrels));
+
+			/*
+			 * Remove the PHV altogether if it is not marked as needing to be
+			 * kept and phnullingrels goes to empty.
+			 */
+			if (!phv->needkeep && bms_is_empty(phv->phnullingrels))
+				return (Node *) phv->phexpr;
+
 			return (Node *) phv;
 		}
 		/* Otherwise fall through to copy the PlaceHolderVar normally */
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 540d021592..ab9758587a 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -2758,6 +2758,10 @@ typedef struct MergeScanSelCache
  * level of a PlaceHolderVar might be a join rather than a base relation.
  * Likewise, phnullingrels corresponds to varnullingrels.
  *
+ * needkeep indicates whether the PHV needs to be kept when its phnullingrels
+ * becomes empty.  This is set true in cases where the PHV is used to isolate
+ * subexpressions; see wrap_non_vars usages in prepjointree.c.
+ *
  * Although the planner treats this as an expression node type, it is not
  * recognized by the parser or executor, so we declare it here rather than
  * in primnodes.h.
@@ -2796,6 +2800,9 @@ typedef struct PlaceHolderVar
 
 	/* > 0 if PHV belongs to outer query */
 	Index		phlevelsup;
+
+	/* true if PHV needs to be kept */
+	bool		needkeep;
 } PlaceHolderVar;
 
 /*
-- 
2.43.0

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Richard Guo (#1)
Re: Remove no-op PlaceHolderVars

Richard Guo <guofenglinux@gmail.com> writes:

Attached is a WIP patch that marks PHVs that need to be kept because
they are serving to isolate subexpressions, and removes all other PHVs
in remove_nulling_relids_mutator if their phnullingrels bits become
empty. One problem with it is that a PHV's contained expression may
not have been fully preprocessed.

Yeah. I've been mulling over how we could do this, and the real
problem is that the expression containing the PHV *has* been fully
preprocessed by the time we get to outer join strength reduction
(cf. file header comment in prepjointree.c). Simply dropping the PHV
can break various invariants that expression preprocessing is supposed
to establish, such as "no RelabelType directly above another" or "no
AND clause directly above another". I haven't thought of a reliable
way to fix that short of re-running eval_const_expressions afterwards,
which seems like a mighty expensive answer. We could try to make
remove_nulling_relids_mutator preserve all these invariants, but
keeping it in sync with what eval_const_expressions does seems like
a maintenance nightmare.

The other problem with this is that it breaks 17 test cases.

I've not looked into that, but yeah, it would need some tedious
analysis to verify whether the changes are OK.

Before delving into these two problems, I'd like to know whether this
optimization is worthwhile, and whether I'm going in the right
direction.

I believe this is an area worth working on. I've been wondering
whether it'd be better to handle the expression-identity problems by
introducing an "expression wrapper" node type that is distinct from
PHV and has the sole purpose of demarcating a subexpression we don't
want to be folded into its parent. I think that doesn't really move
the football in terms of fixing the problems mentioned above, but
perhaps it's conceptually cleaner than adding a bool field to PHV.

Another thought is that grouping sets provide one of the big reasons
why we need an "expression wrapper" or equivalent functionality.
So maybe we should try to move forward on your other patch to change
the representation of those before we spend too much effort here.

regards, tom lane

#3Richard Guo
guofenglinux@gmail.com
In reply to: Tom Lane (#2)
Re: Remove no-op PlaceHolderVars

On Tue, Sep 3, 2024 at 11:31 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Yeah. I've been mulling over how we could do this, and the real
problem is that the expression containing the PHV *has* been fully
preprocessed by the time we get to outer join strength reduction
(cf. file header comment in prepjointree.c). Simply dropping the PHV
can break various invariants that expression preprocessing is supposed
to establish, such as "no RelabelType directly above another" or "no
AND clause directly above another".

Yeah, this is what the problem is.

I haven't thought of a reliable
way to fix that short of re-running eval_const_expressions afterwards,
which seems like a mighty expensive answer.

This seems likely to result in a lot of duplicate work.

We could try to make
remove_nulling_relids_mutator preserve all these invariants, but
keeping it in sync with what eval_const_expressions does seems like
a maintenance nightmare.

Yeah, and it seems that we also need to know the EXPRKIND code for the
expression containing the to-be-dropped PHV in remove_nulling_relids
to know how we should preserve all these invariants, which seems to
require a lot of code changes.

Looking at the routines run in preprocess_expression, most of them
recurse into PlaceHolderVars and preprocess the contained expressions.
The two exceptions are canonicalize_qual and make_ands_implicit. I
wonder if we can modify them to also recurse into PlaceHolderVars.
Will this resolve our problem here?

I believe this is an area worth working on. I've been wondering
whether it'd be better to handle the expression-identity problems by
introducing an "expression wrapper" node type that is distinct from
PHV and has the sole purpose of demarcating a subexpression we don't
want to be folded into its parent. I think that doesn't really move
the football in terms of fixing the problems mentioned above, but
perhaps it's conceptually cleaner than adding a bool field to PHV.

Another thought is that grouping sets provide one of the big reasons
why we need an "expression wrapper" or equivalent functionality.
So maybe we should try to move forward on your other patch to change
the representation of those before we spend too much effort here.

Hmm, in the case of grouping sets, we have a similar situation where
we do not want a subexpression that is part of grouping items to be
folded into its parent, because otherwise we may fail to match these
subexpressions to lower target items.

For grouping sets, this problem is much easier to resolve, because
we've already replaced such subexpressions with Vars referencing the
GROUP RTE. As a result, during expression preprocessing, these
subexpressions will not be folded into their parents.

An ensuing effect of this approach is that a HAVING clause may contain
expressions that are not fully preprocessed if they are part of
grouping items. This is not an issue as long as the clause remains in
HAVING, because these expressions will be matched to lower target
items in setrefs.c. However, if the clause is moved or copied into
WHERE, we need to re-preprocess these expressions. This should not be
too expensive, as we only need to re-preprocess the HAVING clauses
that are moved to WHERE, not the entire tree. The v13 patch in that
thread implements this approach.

Thanks
Richard