Window Function "Run Conditions"

Started by David Rowleyover 4 years ago33 messages

dgrowleyml@gmail.com

over 4 years ago

1 attachment(s)

It seems like a few too many years of an SQL standard without any
standardised way to LIMIT the number of records in a result set caused
various applications to adopt some strange ways to get this behaviour.
Over here in the PostgreSQL world, we just type LIMIT n; at the end of
our queries. I believe Oracle people did a few tricks with a special
column named "rownum". Another set of people needed SQL that would
work over multiple DBMSes and used something like:

SELECT * FROM (SELECT ... row_number() over (order by ...) rn) a WHERE rn <= 10;

I believe it's fairly common to do paging this way on commerce sites.

The problem with PostgreSQL here is that neither the planner nor
executor knows that once we get to row_number 11 that we may as well
stop. The number will never go back down in this partition.

I'd like to make this better for PostgreSQL 15. I've attached a WIP
patch to do so.

How this works is that I've added prosupport functions for each of
row_number(), rank() and dense_rank(). When doing qual pushdown, if
we happen to hit a windowing function, instead of rejecting the
pushdown, we see if there's a prosupport function and if there is, ask
it if this qual can be used to allow us to stop emitting tuples from
the Window node by making use of this qual. I've called these "run
conditions". Basically, keep running while this remains true. Stop
when it's not.

We can't always use the qual directly. For example, if someone does.

SELECT * FROM (SELECT ... row_number() over (order by ...) rn) a WHERE rn = 10;

then if we use the rn = 10 qual, we'd think we could stop right away.
Instead, I've made the prosupport function handle this by generating a
rn <= 10 qual so that we can stop once we get to 11. In this case we
cannot completely pushdown the qual. It needs to remain in place to
filter out rn values 1-9.

Row_number(), rank() and dense_rank() are all monotonically increasing
functions. But we're not limited to just those. COUNT(*) works too
providing the frame bounds guarantee that the function is either
monotonically increasing or decreasing.

COUNT(*) OVER (ORDER BY .. ROWS BETWEEN CURRENT ROW AND UNBOUNDED
FOLLOWING) is monotonically decreasing, whereas the standard bound
options would make it monotonically increasing.

The same could be done for MIN() and MAX(). I just don't think that's
worth doing. It seems unlikely that would get enough use.

Anyway. I'd like to work on this more during the PG15 cycle. I
believe the attached patch makes this work ok. There are just a few
things to iron out.

1) Unsure of the API to the prosupport function. I wonder if the
prosupport function should just be able to say if the function is
either monotonically increasing or decreasing or neither then have
core code build a qual. That would make the job of building new
functions easier, but massively reduce the flexibility of the feature.
I'm just not sure it needs to do more in the future.

2) Unsure if what I've got to make EXPLAIN show the run condition is
the right way to do it. Because I don't want nodeWindow.c to have to
re-evaluate the window function to determine of the run condition is
no longer met, I've coded the qual to reference the varno in the
window node's targetlist. That qual is no good for EXPLAIN so had to
include another set of quals that include the WindowFunc reference. I
saw that Index Only Scans have a similar means to make EXPLAIN work,
so I just followed that.

David

Attachments:

v1-0001-Allow-some-window-functions-to-finish-execution-e.patchapplication/octet-stream; name=v1-0001-Allow-some-window-functions-to-finish-execution-e.patchDownload

From 99a0fae50ed178d88b423e765e3ef1d3c79f68a8 Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Sat, 8 May 2021 23:43:25 +1200
Subject: [PATCH v1] Allow some window functions to finish execution early

Window functions such as row_number() always return a value higher than
the one previously in any given partition.  If a query were to only
request the first few row numbers, then traditionally we would continue
evaluating the WindowAgg node until all tuples are exhausted.  However, it
is possible if someone, say only wanted all row numbers <= 10, then we
could just stop once we get number 11.

Here we implement means to do just that.  This is done by way of adding a
prosupport function to various of the built-in window functions and adding
supporting code to allow them to review any OpExprs that are comparing
their output to another argument.  If the prosupport function decides that
the given qual is suitable, then the qual gets added to the WindowFunc's
"runcondition".  These runconditions are accumulated for each WindowFunc
in the WindowAgg node and during execution, we just stop execution
whenever this condition is no longer true.

In many cases the input qual can just become the runcondition verbatim,
however, when the user compares, say, a row_number() result with an
equality operator, then that qual cannot be used verbatim.  We must
replace it with something that gets all results up to and including the
equality operand.  For example, let rn be the alias for a row_number()
column, if the user does rn = 10, then we must make rn <= 10 the run
condition.  That will be true right up until rn becomes 11. When that
occurs, we stop execution.

Here we add prosupport functions to allow this to work for; row_number(),
rank(), and dense_rank().
---
 src/backend/commands/explain.c          |   4 +
 src/backend/executor/nodeWindowAgg.c    |  27 ++-
 src/backend/nodes/copyfuncs.c           |   4 +
 src/backend/nodes/equalfuncs.c          |   2 +
 src/backend/nodes/outfuncs.c            |   4 +
 src/backend/nodes/readfuncs.c           |   4 +
 src/backend/optimizer/path/allpaths.c   | 183 ++++++++++++++++-
 src/backend/optimizer/plan/createplan.c |   7 +-
 src/backend/optimizer/plan/setrefs.c    |   8 +
 src/backend/parser/parse_collate.c      |   3 +-
 src/backend/utils/adt/int8.c            | 121 +++++++++++
 src/backend/utils/adt/windowfuncs.c     | 220 ++++++++++++++++++++
 src/backend/utils/misc/queryjumble.c    |   2 +-
 src/include/catalog/pg_proc.dat         |  35 +++-
 src/include/nodes/execnodes.h           |   4 +
 src/include/nodes/nodes.h               |   3 +-
 src/include/nodes/parsenodes.h          |   2 +
 src/include/nodes/plannodes.h           |   3 +
 src/include/nodes/supportnodes.h        |  86 +++++++-
 src/test/regress/expected/window.out    | 255 ++++++++++++++++++++++++
 src/test/regress/sql/window.sql         | 132 ++++++++++++
 21 files changed, 1090 insertions(+), 19 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index e81b990092..7926bb5743 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1977,6 +1977,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
 			break;
+		case T_WindowAgg:
+			show_upper_qual(((WindowAgg *) plan)->runconditionorig,
+							"Run Condition", planstate, ancestors, es);
+			break;
 		case T_Group:
 			show_group_keys(castNode(GroupState, planstate), ancestors, es);
 			show_upper_qual(plan->qual, "Filter", planstate, ancestors, es);
diff --git a/src/backend/executor/nodeWindowAgg.c b/src/backend/executor/nodeWindowAgg.c
index f8ea9e96d8..278a0fad83 100644
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -2023,6 +2023,7 @@ static TupleTableSlot *
 ExecWindowAgg(PlanState *pstate)
 {
 	WindowAggState *winstate = castNode(WindowAggState, pstate);
+	TupleTableSlot *slot;
 	ExprContext *econtext;
 	int			i;
 	int			numfuncs;
@@ -2235,7 +2236,20 @@ ExecWindowAgg(PlanState *pstate)
 	 */
 	econtext->ecxt_outertuple = winstate->ss.ss_ScanTupleSlot;
 
-	return ExecProject(winstate->ss.ps.ps_ProjInfo);
+	slot = ExecProject(winstate->ss.ps.ps_ProjInfo);
+
+	/*
+	 * Now evaluate the run condition to see if we need to continue further
+	 * with execution.
+	 */
+	econtext->ecxt_scantuple = slot;
+	if (!ExecQual(winstate->runcondition, econtext))
+	{
+		winstate->all_done = true;
+		return NULL;
+	}
+
+	return slot;
 }
 
 /* -----------------
@@ -2307,6 +2321,17 @@ ExecInitWindowAgg(WindowAgg *node, EState *estate, int eflags)
 	Assert(node->plan.qual == NIL);
 	winstate->ss.ps.qual = NULL;
 
+	/*
+	 * Setup the run condition, if we received one from the query planner.
+	 * When set, this can allow us to finish execution early because we know
+	 * some higher-level filter exists that would just filter out any further
+	 * results that we produce.
+	 */
+	if (node->runcondition != NIL)
+		winstate->runcondition = ExecInitQual(node->runcondition, (PlanState *) winstate);
+	else
+		winstate->runcondition = NULL;
+
 	/*
 	 * initialize child nodes
 	 */
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index bd87f23784..1b710203ba 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -1107,6 +1107,8 @@ _copyWindowAgg(const WindowAgg *from)
 	COPY_SCALAR_FIELD(frameOptions);
 	COPY_NODE_FIELD(startOffset);
 	COPY_NODE_FIELD(endOffset);
+	COPY_NODE_FIELD(runcondition);
+	COPY_NODE_FIELD(runconditionorig);
 	COPY_SCALAR_FIELD(startInRangeFunc);
 	COPY_SCALAR_FIELD(endInRangeFunc);
 	COPY_SCALAR_FIELD(inRangeColl);
@@ -2582,6 +2584,8 @@ _copyWindowClause(const WindowClause *from)
 	COPY_SCALAR_FIELD(frameOptions);
 	COPY_NODE_FIELD(startOffset);
 	COPY_NODE_FIELD(endOffset);
+	COPY_NODE_FIELD(runcondition);
+	COPY_NODE_FIELD(runconditionorig);
 	COPY_SCALAR_FIELD(startInRangeFunc);
 	COPY_SCALAR_FIELD(endInRangeFunc);
 	COPY_SCALAR_FIELD(inRangeColl);
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index dba3e6b31e..ea0044b357 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -2828,6 +2828,8 @@ _equalWindowClause(const WindowClause *a, const WindowClause *b)
 	COMPARE_SCALAR_FIELD(frameOptions);
 	COMPARE_NODE_FIELD(startOffset);
 	COMPARE_NODE_FIELD(endOffset);
+	COMPARE_NODE_FIELD(runcondition);
+	COMPARE_NODE_FIELD(runconditionorig);
 	COMPARE_SCALAR_FIELD(startInRangeFunc);
 	COMPARE_SCALAR_FIELD(endInRangeFunc);
 	COMPARE_SCALAR_FIELD(inRangeColl);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e32b92e299..842a519485 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -819,6 +819,8 @@ _outWindowAgg(StringInfo str, const WindowAgg *node)
 	WRITE_INT_FIELD(frameOptions);
 	WRITE_NODE_FIELD(startOffset);
 	WRITE_NODE_FIELD(endOffset);
+	WRITE_NODE_FIELD(runcondition);
+	WRITE_NODE_FIELD(runconditionorig);
 	WRITE_OID_FIELD(startInRangeFunc);
 	WRITE_OID_FIELD(endInRangeFunc);
 	WRITE_OID_FIELD(inRangeColl);
@@ -3140,6 +3142,8 @@ _outWindowClause(StringInfo str, const WindowClause *node)
 	WRITE_INT_FIELD(frameOptions);
 	WRITE_NODE_FIELD(startOffset);
 	WRITE_NODE_FIELD(endOffset);
+	WRITE_NODE_FIELD(runcondition);
+	WRITE_NODE_FIELD(runconditionorig);
 	WRITE_OID_FIELD(startInRangeFunc);
 	WRITE_OID_FIELD(endInRangeFunc);
 	WRITE_OID_FIELD(inRangeColl);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index f0b34ecfac..5cedc50100 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -384,6 +384,8 @@ _readWindowClause(void)
 	READ_INT_FIELD(frameOptions);
 	READ_NODE_FIELD(startOffset);
 	READ_NODE_FIELD(endOffset);
+	READ_NODE_FIELD(runcondition);
+	READ_NODE_FIELD(runconditionorig);
 	READ_OID_FIELD(startInRangeFunc);
 	READ_OID_FIELD(endInRangeFunc);
 	READ_OID_FIELD(inRangeColl);
@@ -2345,6 +2347,8 @@ _readWindowAgg(void)
 	READ_INT_FIELD(frameOptions);
 	READ_NODE_FIELD(startOffset);
 	READ_NODE_FIELD(endOffset);
+	READ_NODE_FIELD(runcondition);
+	READ_NODE_FIELD(runconditionorig);
 	READ_OID_FIELD(startInRangeFunc);
 	READ_OID_FIELD(endInRangeFunc);
 	READ_OID_FIELD(inRangeColl);
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 353454b183..89a6102f1b 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -27,6 +27,7 @@
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
+#include "nodes/supportnodes.h"
 #ifdef OPTIMIZER_DEBUG
 #include "nodes/print.h"
 #endif
@@ -2090,6 +2091,169 @@ has_multiple_baserels(PlannerInfo *root)
 	return false;
 }
 
+/*
+ * find_window_run_conditions
+ *		Call wfunc's prosupport function to ask if 'opexpr' might help to
+ *		allow the executor to stop processing WindowAgg nodes early.
+ *
+ * For example row_number() over (order by ...) always produces a value one
+ * higher than the previous.  If someone has a window function such as that
+ * in a subquery and just wants, say all rows with a row number less than or
+ * equal to 10, then we may as well stop processing the windowagg the row
+ * number reaches 11.  Here we look for opexprs that might help us to abort
+ * doing needless extra processing in WindowAgg nodes.
+ *
+ * To do this we make use of the window function's prosupport function. We
+ * pass along the 'opexpr' to ask if there is a suitable OpExpr that we can
+ * use to help stop WindowAggs processing early.
+ *
+ * '*keep_original' is set to true if the caller should also use 'opexpr' for
+ * its original purpose.  This is set to false if the caller can assume that
+ * the runcondition will handle all of the required filtering.
+ *
+ * Returns true if a runcondition qual was found and added to the
+ * wclause->runcondition list.  Returns false if no clause was found.
+ */
+static bool
+find_window_run_conditions(Query *subquery, RangeTblEntry *rte, Index rti,
+						   AttrNumber attno, WindowClause *wclause,
+						   WindowFunc *wfunc, OpExpr *opexpr, bool wfunc_left,
+						   bool *keep_original)
+{
+	Oid			prosupport;
+	WindowFunctionRunCondition req;
+	WindowFunctionRunCondition *res;
+
+	*keep_original = true;
+
+	prosupport = get_func_support(wfunc->winfnoid);
+
+	/* Check if there's a support function that can validate 'opexpr' */
+	if (!OidIsValid(prosupport))
+	{
+		*keep_original = true;
+		return false;
+	}
+
+	req.type = T_WindowFunctionRunCondition;
+	req.opexpr = opexpr;
+	req.window_func = wfunc;
+	req.window_clause = wclause;
+	req.wfunc_left = wfunc_left;
+	req.runopexpr = NULL;			/* default */
+	req.keep_original = true;		/* default */
+
+	res = (WindowFunctionRunCondition *)
+		DatumGetPointer(OidFunctionCall1(prosupport,
+			PointerGetDatum(&req)));
+
+	/* return the the run condition, if there is one */
+	if (res && res->runopexpr != NULL)
+	{
+		Expr *origexpr;
+		OpExpr *rexpr = res->runopexpr;
+
+		wclause->runcondition = lappend(wclause->runcondition, rexpr);
+		*keep_original = res->keep_original;
+
+		/*
+		 * We must also create a version of the qual that we can display in
+		 * EXPLAIN.
+		 */
+		if (wfunc_left)
+			origexpr = make_opclause(rexpr->opno, rexpr->opresulttype,
+									 rexpr->opretset, (Expr *) wfunc,
+									 (Expr *) lsecond(rexpr->args),
+									 rexpr->opcollid, rexpr->inputcollid);
+		else
+			origexpr = make_opclause(rexpr->opno, rexpr->opresulttype,
+									 rexpr->opretset,
+									 (Expr *) lsecond(rexpr->args),
+									 (Expr *) wfunc, rexpr->opcollid,
+									 rexpr->inputcollid);
+
+		wclause->runconditionorig = lappend(wclause->runconditionorig, origexpr);
+		return true;
+	}
+
+	/* prosupport function didn't support our request */
+	return false;
+}
+
+/*
+ * check_and_push_window_quals
+ *		Check if 'rinfo' is a qual that can be pushed into a WindowFunc as a
+ *		'runcondition' qual.  These, when present cause the window function
+ *		evaluation to stop when the condition becomes false.
+ *
+ * Returns true if the caller still must keep the original qual or false if
+ * the caller can safely ignore the original qual because the window function
+ * will use the runcondition to stop at the right time.
+ */
+static bool
+check_and_push_window_quals(Query *subquery, RangeTblEntry *rte, Index rti,
+							Node *clause)
+{
+	OpExpr	   *opexpr = (OpExpr *) clause;
+	bool		keep_original = true;
+	Var		   *var1;
+	Var		   *var2;
+
+	if (!IsA(opexpr, OpExpr))
+		return true;
+
+	if (list_length(opexpr->args) != 2)
+		return true;
+
+	/*
+	 * Check for plain Vars which reference window functions in the subquery.
+	 * If we find any, we'll ask find_window_run_conditions() if there are
+	 * any useful conditions that might allow us to stop windowagg execution
+	 * early.
+	 */
+
+	/* Check the left side of the OpExpr */
+	var1 = linitial(opexpr->args);
+	if (IsA(var1, Var) && var1->varattno > 0)
+	{
+		TargetEntry *tle = list_nth(subquery->targetList, var1->varattno - 1);
+		WindowFunc *wfunc = (WindowFunc *) tle->expr;
+		if (IsA(wfunc, WindowFunc))
+		{
+			WindowClause *wclause = (WindowClause *)
+												list_nth(subquery->windowClause,
+														 wfunc->winref - 1);
+
+			if (find_window_run_conditions(subquery, rte, rti, tle->resno,
+										   wclause, wfunc, opexpr, true,
+										   &keep_original))
+				return keep_original;
+		}
+	}
+
+	/* and check the right side */
+	var2 = lsecond(opexpr->args);
+	if (IsA(var2, Var) && var2->varattno > 0)
+	{
+		TargetEntry *tle = list_nth(subquery->targetList, var2->varattno - 1);
+		WindowFunc *wfunc = (WindowFunc *) tle->expr;
+
+		if (IsA(wfunc, WindowFunc))
+		{
+			WindowClause *wclause = (WindowClause *)
+												list_nth(subquery->windowClause,
+														 wfunc->winref - 1);
+
+			if (find_window_run_conditions(subquery, rte, rti, tle->resno,
+										   wclause, wfunc, opexpr, false,
+										   &keep_original))
+				return keep_original;
+		}
+	}
+
+	return true;
+}
+
 /*
  * set_subquery_pathlist
  *		Generate SubqueryScan access paths for a subquery RTE
@@ -2178,19 +2342,30 @@ set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 		foreach(l, rel->baserestrictinfo)
 		{
 			RestrictInfo *rinfo = (RestrictInfo *) lfirst(l);
+			Node	   *clause = (Node *)rinfo->clause;
 
 			if (!rinfo->pseudoconstant &&
 				qual_is_pushdown_safe(subquery, rti, rinfo, &safetyInfo))
 			{
-				Node	   *clause = (Node *) rinfo->clause;
-
 				/* Push it down */
 				subquery_push_qual(subquery, rte, rti, clause);
 			}
 			else
 			{
-				/* Keep it in the upper query */
-				upperrestrictlist = lappend(upperrestrictlist, rinfo);
+				/*
+				 * Since we can't push the qual down into the subquery, check
+				 * if it happens to reference a windowing function.  If so
+				 * then it might be useful to allow the window evaluation to
+				 * stop early.
+				 */
+				if (check_and_push_window_quals(subquery, rte, rti, clause))
+				{
+					/*
+					 * It's not a suitable window run condition qual or it is,
+					 * but the original must also be kept in the upper query.
+					 */
+					upperrestrictlist = lappend(upperrestrictlist, rinfo);
+				}
 			}
 		}
 		rel->baserestrictinfo = upperrestrictlist;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 439e6b6426..e68b0d99c7 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -289,6 +289,7 @@ static WindowAgg *make_windowagg(List *tlist, Index winref,
 								 int frameOptions, Node *startOffset, Node *endOffset,
 								 Oid startInRangeFunc, Oid endInRangeFunc,
 								 Oid inRangeColl, bool inRangeAsc, bool inRangeNullsFirst,
+								 List *runcondition, List *runconditionorig,
 								 Plan *lefttree);
 static Group *make_group(List *tlist, List *qual, int numGroupCols,
 						 AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
@@ -2622,6 +2623,8 @@ create_windowagg_plan(PlannerInfo *root, WindowAggPath *best_path)
 						  wc->inRangeColl,
 						  wc->inRangeAsc,
 						  wc->inRangeNullsFirst,
+						  wc->runcondition,
+						  wc->runconditionorig,
 						  subplan);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -6480,7 +6483,7 @@ make_windowagg(List *tlist, Index winref,
 			   int frameOptions, Node *startOffset, Node *endOffset,
 			   Oid startInRangeFunc, Oid endInRangeFunc,
 			   Oid inRangeColl, bool inRangeAsc, bool inRangeNullsFirst,
-			   Plan *lefttree)
+			   List *runcondition, List *runconditionorig, Plan *lefttree)
 {
 	WindowAgg  *node = makeNode(WindowAgg);
 	Plan	   *plan = &node->plan;
@@ -6497,6 +6500,8 @@ make_windowagg(List *tlist, Index winref,
 	node->frameOptions = frameOptions;
 	node->startOffset = startOffset;
 	node->endOffset = endOffset;
+	node->runcondition = runcondition;
+	node->runconditionorig = runconditionorig;
 	node->startInRangeFunc = startInRangeFunc;
 	node->endInRangeFunc = endInRangeFunc;
 	node->inRangeColl = inRangeColl;
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 61ccfd300b..8b6ee5a01b 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -870,6 +870,14 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 					fix_scan_expr(root, wplan->startOffset, rtoffset, 1);
 				wplan->endOffset =
 					fix_scan_expr(root, wplan->endOffset, rtoffset, 1);
+				wplan->runcondition = fix_scan_list(root,
+													wplan->runcondition,
+													rtoffset,
+													NUM_EXEC_TLIST(plan));
+				wplan->runconditionorig = fix_scan_list(root,
+														wplan->runconditionorig,
+														rtoffset,
+														NUM_EXEC_TLIST(plan));
 			}
 			break;
 		case T_Result:
diff --git a/src/backend/parser/parse_collate.c b/src/backend/parser/parse_collate.c
index 4133526f04..9ae33263d3 100644
--- a/src/backend/parser/parse_collate.c
+++ b/src/backend/parser/parse_collate.c
@@ -623,7 +623,8 @@ assign_collations_walker(Node *node, assign_collations_context *context)
 						{
 							/*
 							 * WindowFunc requires special processing only for
-							 * its aggfilter clause, as for aggregates.
+							 * its aggfilter, as for aggregates and its
+							 * runcondition clause.
 							 */
 							WindowFunc *wfunc = (WindowFunc *) node;
 
diff --git a/src/backend/utils/adt/int8.c b/src/backend/utils/adt/int8.c
index 2168080dcc..ef74c3b5cd 100644
--- a/src/backend/utils/adt/int8.c
+++ b/src/backend/utils/adt/int8.c
@@ -25,6 +25,7 @@
 #include "optimizer/optimizer.h"
 #include "utils/builtins.h"
 #include "utils/int8.h"
+#include "utils/lsyscache.h"
 
 
 typedef struct
@@ -877,6 +878,7 @@ int8dec(PG_FUNCTION_ARGS)
 }
 
 
+
 /*
  * These functions are exactly like int8inc/int8dec but are used for
  * aggregates that count only non-null values.  Since the functions are
@@ -904,6 +906,125 @@ int8dec_any(PG_FUNCTION_ARGS)
 	return int8dec(fcinfo);
 }
 
+/*
+ * int8inc_support
+ *		prosupport function for int8inc() and int8inc_any()
+ */
+Datum
+int8inc_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq = (Node *) PG_GETARG_POINTER(0);
+
+	if (IsA(rawreq, WindowFunctionRunCondition))
+	{
+		WindowFunctionRunCondition *req = (WindowFunctionRunCondition *)rawreq;
+
+		OpExpr	   *opexpr = req->opexpr;
+		Oid			opno = opexpr->opno;
+		List	   *opfamilies;
+		ListCell   *lc;
+		int			frameOptions;
+
+		/* Early abort of execution is not possible if there's a PARTITION BY */
+		if (req->window_clause->partitionClause != NULL)
+			PG_RETURN_POINTER(NULL);
+
+		frameOptions = req->window_clause->frameOptions;
+
+		/*
+		 * With FRAMEOPTION_START_UNBOUNDED_PRECEDING we know the count can
+		 * only go up, so we can support run conditions that only want values
+		 * < or <= to a given constant.  With
+		 * FRAMEOPTION_END_UNBOUNDED_FOLLOWING, the row count could only
+		 * possibly decrease, so we can support a run condition that uses > or
+		 * >=.  If we have neither of these two, then bail.
+		 */
+		if ((frameOptions &	(FRAMEOPTION_START_UNBOUNDED_PRECEDING |
+							 FRAMEOPTION_END_UNBOUNDED_FOLLOWING)) == 0)
+			PG_RETURN_POINTER(NULL);
+
+		opfamilies = get_op_btree_interpretation(opno);
+
+		foreach(lc, opfamilies)
+		{
+			OpBtreeInterpretation *oi = (OpBtreeInterpretation *)lfirst(lc);
+
+			if (req->wfunc_left)
+			{
+				/* Handle < / <= */
+				if (oi->strategy == BTLessStrategyNumber ||
+					oi->strategy == BTLessEqualStrategyNumber)
+				{
+					/*
+					 * If the frame is bound to the top of the window then the
+					 * count cannot go down.
+					 */
+					if ((frameOptions &	(FRAMEOPTION_START_UNBOUNDED_PRECEDING)))
+					{
+						req->keep_original = false;
+						req->runopexpr = req->opexpr;
+					}
+					break;
+				}
+				/* Handle > / >= */
+				else if (oi->strategy == BTGreaterStrategyNumber ||
+						 oi->strategy == BTGreaterEqualStrategyNumber)
+				{
+					/*
+					 * If the frame is bound to the bottom of the window then
+					 * the count cannot go up.
+					 */
+					if ((frameOptions &	(FRAMEOPTION_END_UNBOUNDED_FOLLOWING)))
+					{
+						req->keep_original = false;
+						req->runopexpr = req->opexpr;
+					}
+					break;
+				}
+			}
+			else
+			{
+				/* Handle > / >= */
+				if (oi->strategy == BTGreaterStrategyNumber ||
+					oi->strategy == BTGreaterEqualStrategyNumber)
+				{
+					/*
+					 * If the frame is bound to the top of the window then the
+					 * count cannot go down.
+					 */
+					if ((frameOptions &	(FRAMEOPTION_START_UNBOUNDED_PRECEDING)))
+					{
+						req->keep_original = false;
+						req->runopexpr = req->opexpr;
+					}
+					break;
+				}
+				/* Handle < / <= */
+				else if (oi->strategy == BTLessStrategyNumber ||
+						 oi->strategy == BTLessEqualStrategyNumber)
+				{
+					/*
+					 * If the frame is bound to the bottom of the window then
+					 * the count cannot go up.
+					 */
+					if ((frameOptions &	(FRAMEOPTION_END_UNBOUNDED_FOLLOWING)))
+					{
+						req->keep_original = false;
+						req->runopexpr = req->opexpr;
+					}
+					break;
+				}
+			}
+		}
+
+		list_free(opfamilies);
+
+		PG_RETURN_POINTER(req);
+	}
+
+	PG_RETURN_POINTER(NULL);
+}
+
 
 Datum
 int8larger(PG_FUNCTION_ARGS)
diff --git a/src/backend/utils/adt/windowfuncs.c b/src/backend/utils/adt/windowfuncs.c
index 9c127617d1..17f1909413 100644
--- a/src/backend/utils/adt/windowfuncs.c
+++ b/src/backend/utils/adt/windowfuncs.c
@@ -13,7 +13,12 @@
  */
 #include "postgres.h"
 
+#include "access/stratnum.h"
+#include "catalog/pg_operator_d.h"
+#include "nodes/makefuncs.h"
+#include "nodes/supportnodes.h"
 #include "utils/builtins.h"
+#include "utils/lsyscache.h"
 #include "windowapi.h"
 
 /*
@@ -73,6 +78,162 @@ rank_up(WindowObject winobj)
 	return up;
 }
 
+/*
+ * process_row_number_rank_run_condition
+ *		Common function for prosupport WindowFunctionRunCondition for
+ *		row_number(), rank() and dense_rank().
+ */
+static void
+process_row_number_rank_run_condition(WindowFunctionRunCondition *req)
+{
+	OpExpr	   *opexpr = req->opexpr;
+	Oid			opno = opexpr->opno;
+	Expr	   *arg;
+	WindowFunc *windowfunc;
+	List	   *opfamilies;
+	ListCell   *lc;
+
+	/* Early abort of execution is not possible if there's a PARTITION BY */
+	if (req->window_clause->partitionClause != NULL)
+	{
+		req->runopexpr = NULL;
+		return;
+	}
+
+	if (!req->wfunc_left)
+	{
+		arg = linitial(opexpr->args);
+		windowfunc = lsecond(opexpr->args);
+	}
+	else
+	{
+		windowfunc = linitial(opexpr->args);
+		arg = lsecond(opexpr->args);
+	}
+
+	/*
+	 * We're only able to handle run conditions that compare the window result
+	 * to a constant.
+	 *
+	 * XXX We could probably do better than just Consts.  Exec Params should
+	 * work too.
+	 */
+	if (!IsA(arg, Const))
+	{
+		req->runopexpr = NULL;
+		return;
+	}
+
+	opfamilies = get_op_btree_interpretation(opno);
+
+	foreach(lc, opfamilies)
+	{
+		OpBtreeInterpretation *oi = (OpBtreeInterpretation *) lfirst(lc);
+
+		if (req->wfunc_left)
+		{
+			/*
+			 * When the opexpr is uses < or <= with the window function on the
+			 * left, then we can use the opexpr directly.  We can also set
+			 * keep_original to false too as the planner does not need to keep
+			 * this qual as a filter in the query above the subquery
+			 * containing the window function.
+			 */
+			if (oi->strategy == BTLessStrategyNumber ||
+				oi->strategy == BTLessEqualStrategyNumber)
+			{
+				req->keep_original = false;
+				req->runopexpr = req->opexpr;
+				break;
+			}
+
+			/*
+			 * For equality conditions we need all rows up until the const
+			 * being compared.  We make an OpExpr with a <= operator so that
+			 * we stop processing just after we find our equality match.
+			 */
+			else if (oi->strategy == BTEqualStrategyNumber)
+			{
+				OpExpr *newopexpr;
+				Oid leop;
+
+				leop = get_opfamily_member(oi->opfamily_id,
+											oi->oplefttype,
+											oi->oprighttype,
+											BTLessEqualStrategyNumber);
+
+				newopexpr = (OpExpr *) make_opclause(leop,
+													 opexpr->opresulttype,
+													 opexpr->opretset,
+													 (Expr *) windowfunc,
+													 arg,
+													 opexpr->opcollid,
+													 opexpr->inputcollid);
+				newopexpr->opfuncid = get_opcode(leop);
+
+				/*
+				 * We must keep the original equality condition as this <=
+				 * OpExpr won't filter out all the earlier records.
+				 */
+				req->keep_original = true;
+				req->runopexpr = newopexpr;
+				break;
+			}
+		}
+		else
+		{
+			/*
+			 * When the window function var is on the right, we look for > and
+			 * >= operators.  e.g: 10 >= row_number() ...
+			 */
+			if (oi->strategy == BTGreaterStrategyNumber ||
+				oi->strategy == BTGreaterEqualStrategyNumber)
+			{
+				req->keep_original = false;
+				req->runopexpr = req->opexpr;
+				break;
+			}
+
+			/*
+			 * For equality conditions we need all rows up until the const
+			 * being compared.  The window function is on the right here, so
+			 * we make an OpExpr with <const> >= <wfunc> so that we stop
+			 * processing just after we find our equality match. We don't
+			 * reverse the condition and use <= because we may have a cross
+			 * type opfamily.
+			 */
+			else if (oi->strategy == BTEqualStrategyNumber)
+			{
+				OpExpr *newopexpr;
+				Oid geop;
+
+				geop = get_opfamily_member(oi->opfamily_id,
+											oi->oplefttype,
+											oi->oprighttype,
+											BTGreaterEqualStrategyNumber);
+
+				newopexpr = (OpExpr *) make_opclause(geop,
+													 opexpr->opresulttype,
+													 opexpr->opretset,
+													 arg,
+													 (Expr *) windowfunc,
+													 opexpr->opcollid,
+													 opexpr->inputcollid);
+				newopexpr->opfuncid = get_opcode(geop);
+
+				/*
+				 * We must keep the original equality condition as this >=
+				 * OpExpr won't filter out all the earlier records.
+				 */
+				req->keep_original = true;
+				req->runopexpr = newopexpr;
+				break;
+			}
+		}
+	}
+
+	list_free(opfamilies);
+}
 
 /*
  * row_number
@@ -88,6 +249,25 @@ window_row_number(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(curpos + 1);
 }
 
+/*
+ * window_row_number_support
+ *		prosupport function for window_row_number()
+ */
+Datum
+window_row_number_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq = (Node *) PG_GETARG_POINTER(0);
+
+	if (IsA(rawreq, WindowFunctionRunCondition))
+	{
+		WindowFunctionRunCondition *req = (WindowFunctionRunCondition *) rawreq;
+
+		process_row_number_rank_run_condition(req);
+		PG_RETURN_POINTER(req);
+	}
+
+	PG_RETURN_POINTER(NULL);
+}
 
 /*
  * rank
@@ -110,6 +290,26 @@ window_rank(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(context->rank);
 }
 
+/*
+ * window_rank_support
+ *		prosupport function for window_rank()
+ */
+Datum
+window_rank_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq = (Node *) PG_GETARG_POINTER(0);
+
+	if (IsA(rawreq, WindowFunctionRunCondition))
+	{
+		WindowFunctionRunCondition *req = (WindowFunctionRunCondition *) rawreq;
+
+		process_row_number_rank_run_condition(req);
+		PG_RETURN_POINTER(req);
+	}
+
+	PG_RETURN_POINTER(NULL);
+}
+
 /*
  * dense_rank
  * Rank increases by 1 when key columns change.
@@ -130,6 +330,26 @@ window_dense_rank(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(context->rank);
 }
 
+/*
+ * window_dense_rank_support
+ *		prosupport function for window_dense_rank()
+ */
+Datum
+window_dense_rank_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq = (Node *) PG_GETARG_POINTER(0);
+
+	if (IsA(rawreq, WindowFunctionRunCondition))
+	{
+		WindowFunctionRunCondition *req = (WindowFunctionRunCondition *) rawreq;
+
+		process_row_number_rank_run_condition(req);
+		PG_RETURN_POINTER(req);
+	}
+
+	PG_RETURN_POINTER(NULL);
+}
+
 /*
  * percent_rank
  * return fraction between 0 and 1 inclusive,
diff --git a/src/backend/utils/misc/queryjumble.c b/src/backend/utils/misc/queryjumble.c
index 9f2cd1f127..c100301e5c 100644
--- a/src/backend/utils/misc/queryjumble.c
+++ b/src/backend/utils/misc/queryjumble.c
@@ -434,7 +434,7 @@ JumbleExpr(JumbleState *jstate, Node *node)
 				APP_JUMB(expr->winref);
 				JumbleExpr(jstate, (Node *) expr->args);
 				JumbleExpr(jstate, (Node *) expr->aggfilter);
-			}
+		}
 			break;
 		case T_SubscriptingRef:
 			{
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index fde251fa4f..31953d6dfa 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6558,11 +6558,16 @@
 # count has two forms: count(any) and count(*)
 { oid => '2147',
   descr => 'number of input rows for which the input expression is not null',
-  proname => 'count', prokind => 'a', proisstrict => 'f', prorettype => 'int8',
-  proargtypes => 'any', prosrc => 'aggregate_dummy' },
+  proname => 'count', prosupport => 'int8inc_support', prokind => 'a',
+  proisstrict => 'f', prorettype => 'int8', proargtypes => 'any',
+  prosrc => 'aggregate_dummy' },
 { oid => '2803', descr => 'number of input rows',
-  proname => 'count', prokind => 'a', proisstrict => 'f', prorettype => 'int8',
-  proargtypes => '', prosrc => 'aggregate_dummy' },
+  proname => 'count', prosupport => 'int8inc_support', prokind => 'a',
+  proisstrict => 'f', prorettype => 'int8', proargtypes => '',
+  prosrc => 'aggregate_dummy' },
+{ oid => '8802', descr => 'planner support for count run condition',
+  proname => 'int8inc_support', prorettype => 'internal',
+  proargtypes => 'internal', prosrc => 'int8inc_support' },
 
 { oid => '2718',
   descr => 'population variance of bigint input values (square of the population standard deviation)',
@@ -9996,14 +10001,26 @@
 
 # SQL-spec window functions
 { oid => '3100', descr => 'row number within partition',
-  proname => 'row_number', prokind => 'w', proisstrict => 'f',
-  prorettype => 'int8', proargtypes => '', prosrc => 'window_row_number' },
+  proname => 'row_number', prosupport => 'window_row_number_support',
+  prokind => 'w', proisstrict => 'f',  prorettype => 'int8',
+  proargtypes => '', prosrc => 'window_row_number' },
+{ oid => '8799', descr => 'planner support for row_number run condition',
+  proname => 'window_row_number_support', prorettype => 'internal',
+  proargtypes => 'internal', prosrc => 'window_row_number_support' },
 { oid => '3101', descr => 'integer rank with gaps',
-  proname => 'rank', prokind => 'w', proisstrict => 'f', prorettype => 'int8',
+  proname => 'rank', prosupport => 'window_rank_support',
+  prokind => 'w', proisstrict => 'f', prorettype => 'int8',
   proargtypes => '', prosrc => 'window_rank' },
+{ oid => '8800', descr => 'planner support for rank run condition',
+  proname => 'window_rank_support', prorettype => 'internal',
+  proargtypes => 'internal', prosrc => 'window_rank_support' },
 { oid => '3102', descr => 'integer rank without gaps',
-  proname => 'dense_rank', prokind => 'w', proisstrict => 'f',
-  prorettype => 'int8', proargtypes => '', prosrc => 'window_dense_rank' },
+  proname => 'dense_rank', prosupport => 'window_dense_rank_support',
+  prokind => 'w', proisstrict => 'f', prorettype => 'int8', proargtypes => '',
+  prosrc => 'window_dense_rank' },
+{ oid => '8801', descr => 'planner support for dense rank run condition',
+  proname => 'window_dense_rank_support', prorettype => 'internal',
+  proargtypes => 'internal', prosrc => 'window_dense_rank_support' },
 { oid => '3103', descr => 'fractional rank within partition',
   proname => 'percent_rank', prokind => 'w', proisstrict => 'f',
   prorettype => 'float8', proargtypes => '', prosrc => 'window_percent_rank' },
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 0ec5509e7e..8f9bbf2922 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2407,6 +2407,10 @@ typedef struct WindowAggState
 	MemoryContext curaggcontext;	/* current aggregate's working data */
 	ExprContext *tmpcontext;	/* short-term evaluation context */
 
+	ExprState   *runcondition;	/* Condition which must remain true otherwise
+								 * execution of the WindowAgg will finish, or
+								 * NULL if there is no such condition. */
+
 	bool		all_first;		/* true if the scan is starting */
 	bool		all_done;		/* true if the scan is finished */
 	bool		partition_spooled;	/* true if all tuples in current partition
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index d9e417bcd7..3da1fc4cda 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -527,7 +527,8 @@ typedef enum NodeTag
 	T_SupportRequestSelectivity,	/* in nodes/supportnodes.h */
 	T_SupportRequestCost,		/* in nodes/supportnodes.h */
 	T_SupportRequestRows,		/* in nodes/supportnodes.h */
-	T_SupportRequestIndexCondition	/* in nodes/supportnodes.h */
+	T_SupportRequestIndexCondition,	/* in nodes/supportnodes.h */
+	T_WindowFunctionRunCondition /* in nodes/supportnodes.h */
 } NodeTag;
 
 /*
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index def9651b34..783dc2e97d 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -1381,6 +1381,8 @@ typedef struct WindowClause
 	int			frameOptions;	/* frame_clause options, see WindowDef */
 	Node	   *startOffset;	/* expression for starting bound, if any */
 	Node	   *endOffset;		/* expression for ending bound, if any */
+	List	   *runcondition;	/* */
+	List	   *runconditionorig; /* Used for EXPLAIN */
 	Oid			startInRangeFunc;	/* in_range function for startOffset */
 	Oid			endInRangeFunc; /* in_range function for endOffset */
 	Oid			inRangeColl;	/* collation for in_range tests */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index aaa3b65d04..9b304fdd07 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -890,6 +890,9 @@ typedef struct WindowAgg
 	int			frameOptions;	/* frame_clause options, see WindowDef */
 	Node	   *startOffset;	/* expression for starting bound, if any */
 	Node	   *endOffset;		/* expression for ending bound, if any */
+	List	   *runcondition;	/* Conditions that must remain true in order
+								 * for execution to continue */
+	List	   *runconditionorig;	/* runcondition for display in EXPLAIN */
 	/* these fields are used with RANGE offset PRECEDING/FOLLOWING: */
 	Oid			startInRangeFunc;	/* in_range function for startOffset */
 	Oid			endInRangeFunc; /* in_range function for endOffset */
diff --git a/src/include/nodes/supportnodes.h b/src/include/nodes/supportnodes.h
index 85e1b8a832..cb81cc5e6e 100644
--- a/src/include/nodes/supportnodes.h
+++ b/src/include/nodes/supportnodes.h
@@ -38,7 +38,7 @@
 struct PlannerInfo;				/* avoid including pathnodes.h here */
 struct IndexOptInfo;
 struct SpecialJoinInfo;
-
+struct WindowClause;
 
 /*
  * The Simplify request allows the support function to perform plan-time
@@ -239,4 +239,88 @@ typedef struct SupportRequestIndexCondition
 								 * equivalent of the function call */
 } SupportRequestIndexCondition;
 
+/* ----------
+ * To allow more efficient execution of our monotonically increasing window
+ * functions, we support calling the window function's prosupport function
+ * passing along this struct whenever the planner sees an OpExpr qual directly
+ * reference a windowing function in a subquery.  When the planner encounters
+ * this, we populate this struct and pass it along to the windowing function's
+ * prosupport function so that it can evaluate if the OpExpr can be used to
+ * allow execution of the WindowAgg node to end early.
+ *
+ * This allows queries such as the following one to end execution once "rn"
+ * reaches 3.
+ *
+ * SELECT * FROM (
+ *		SELECT
+ *				col,
+ *				row_number() over (order by col) rn FROM tab
+ * ) t
+ * WHERE rn < 3;
+ *
+ * The prosupport function must properly determine all cases where such an
+ * optimization is possible and leave 'runopexpr' set to NULL in cases where
+ * the optimization is not possible.  It's not possible to stop execution
+ * early if the above example had a PARTITION BY clause as "rn" would drop
+ * back to 1 in each new partition.
+ *
+ * We have a few built-in windowing functions which return a monotonically
+ * increasing value. This optimization is ideal for those.
+ *
+ * It's also possible to use this for windowing functions that have a
+ * monotonically decreasing value.  An example of this would be COUNT(*) with
+ * the frame option UNBOUNDED FOLLOWING, the return value could only ever stay
+ * the same or decrease.  In this case, it would be possible to stop execution
+ * early if there was some qual that expressed the count must be above a given
+ * value.
+ *
+ * Inputs:
+ *	'opexpr' is the qual which the planner would like evaluated.
+ *
+ *	'window_func' is the pointer to the window function being called.
+ *
+ *	'window_clause' pointer to the WindowClause data.  Support functions can
+ *	use this to check frame bounds and partition by clauses, etc.
+ *
+ *	'wfunc_left' indicates which way around the OpExpr is.  True indicates
+ *	that the LHS of the opexpr is the window Var.  False indicates it's on the
+ *	RHS.
+ *
+ * Outputs:
+ *	'runopexpr' the OpExpr that the planner should evaluate during execution.
+ *	This will be evaluated after at the end of the WindowAgg call during
+ *	execution.  If the expression evaluates to NULL or False then the
+ *	WindowAgg will return NULL to indicate there are no more tuples.  Support
+ *	functions may set this to the input 'opexpr' when that expression is
+ *	suitable, or they may craft their own suitable expression.  This output
+ *	argument defaults to NULL. If the 'opexpr' is not suitable then this
+ *	output must remain NULL.
+ *
+ *	'keep_original' indicates to the planner if it should also keep the
+ *	opexpr as a filter on the parent query.  This defaults to True, which is
+ *	the safest option if you're unsure.  Only set it to false if the
+ *	'runopexpr' will correctly filter all of the records that 'opexpr' would.
+ * ----------
+ */
+typedef struct WindowFunctionRunCondition
+{
+	NodeTag		type;
+
+	/* Input fields: */
+	OpExpr		   *opexpr;				/* Input OpExpr to check */
+	WindowFunc	   *window_func;	/* Pointer to the window function data */
+	struct WindowClause   *window_clause;	/* Pointer to the window clause
+											 * data */
+	bool			wfunc_left;		/* True if window func is on left of opexpr */
+
+	/* Output fields: */
+	OpExpr		   *runopexpr;		/* Output OpExpr to use or NULL if input
+									 * does not support a run condition. */
+	bool			keep_original;	/* True if the planner should still use
+									 * the original qual in the base quals of
+									 * the parent query or false if it's safe
+									 * to ignore it due to the run condition
+									 * replacing it */
+} WindowFunctionRunCondition;
+
 #endif							/* SUPPORTNODES_H */
diff --git a/src/test/regress/expected/window.out b/src/test/regress/expected/window.out
index 19e2ac518a..9c4c8a8387 100644
--- a/src/test/regress/expected/window.out
+++ b/src/test/regress/expected/window.out
@@ -3196,6 +3196,261 @@ WHERE depname = 'sales';
                            ->  Seq Scan on empsalary
 (9 rows)
 
+-- Test window function run conditions are properly pushed down into the
+-- WindowAgg
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+                  QUERY PLAN                  
+----------------------------------------------
+ WindowAgg
+   Run Condition: (row_number() OVER (?) < 3)
+   ->  Sort
+         Sort Key: empsalary.empno
+         ->  Seq Scan on empsalary
+(5 rows)
+
+-- The following 3 statements should result the same result.
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+ empno | rn 
+-------+----
+     1 |  1
+     2 |  2
+(2 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE 3 > rn;
+ empno | rn 
+-------+----
+     1 |  1
+     2 |  2
+(2 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE 2 >= rn;
+ empno | rn 
+-------+----
+     1 |  1
+     2 |  2
+(2 rows)
+
+-- Ensure r <= 3 is pushed down into the run condition of the window agg
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          rank() OVER (ORDER BY salary DESC) r
+   FROM empsalary) emp
+WHERE r <= 3;
+               QUERY PLAN                
+-----------------------------------------
+ WindowAgg
+   Run Condition: (rank() OVER (?) <= 3)
+   ->  Sort
+         Sort Key: empsalary.salary DESC
+         ->  Seq Scan on empsalary
+(5 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          rank() OVER (ORDER BY salary DESC) r
+   FROM empsalary) emp
+WHERE r <= 3;
+ empno | salary | r 
+-------+--------+---
+     8 |   6000 | 1
+    10 |   5200 | 2
+    11 |   5200 | 2
+(3 rows)
+
+-- Ensure dr = 1 is converted to dr <= 1 to get all rows leading up to dr = 1
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+                     QUERY PLAN                      
+-----------------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.dr = 1)
+   ->  WindowAgg
+         Run Condition: (dense_rank() OVER (?) <= 1)
+         ->  Sort
+               Sort Key: empsalary.salary DESC
+               ->  Seq Scan on empsalary
+(7 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+ empno | salary | dr 
+-------+--------+----
+     8 |   6000 |  1
+(1 row)
+
+-- Check COUNT() and COUNT(*)
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+                QUERY PLAN                 
+-------------------------------------------
+ WindowAgg
+   Run Condition: (count(*) OVER (?) <= 3)
+   ->  Sort
+         Sort Key: empsalary.salary DESC
+         ->  Seq Scan on empsalary
+(5 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+ empno | salary | c 
+-------+--------+---
+     8 |   6000 | 1
+    10 |   5200 | 3
+    11 |   5200 | 3
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+                       QUERY PLAN                        
+---------------------------------------------------------
+ WindowAgg
+   Run Condition: (count(empsalary.empno) OVER (?) <= 3)
+   ->  Sort
+         Sort Key: empsalary.salary DESC
+         ->  Seq Scan on empsalary
+(5 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+ empno | salary | c 
+-------+--------+---
+     8 |   6000 | 1
+    10 |   5200 | 3
+    11 |   5200 | 3
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) c
+   FROM empsalary) emp
+WHERE c >= 3;
+                QUERY PLAN                 
+-------------------------------------------
+ WindowAgg
+   Run Condition: (count(*) OVER (?) >= 3)
+   ->  Sort
+         Sort Key: empsalary.salary DESC
+         ->  Seq Scan on empsalary
+(5 rows)
+
+-- Tests to ensure we don't push down the run condition when it's not valid to
+-- do so.
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (PARTITION BY depname ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.rn < 3)
+   ->  WindowAgg
+         ->  Sort
+               Sort Key: empsalary.depname, empsalary.empno
+               ->  Seq Scan on empsalary
+(6 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (PARTITION BY depname ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+                            QUERY PLAN                            
+------------------------------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.c <= 3)
+   ->  WindowAgg
+         ->  Sort
+               Sort Key: empsalary.depname, empsalary.salary DESC
+               ->  Seq Scan on empsalary
+(6 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) c
+   FROM empsalary) emp
+WHERE c <= 3;
+                  QUERY PLAN                   
+-----------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.c <= 3)
+   ->  WindowAgg
+         ->  Sort
+               Sort Key: empsalary.salary DESC
+               ->  Seq Scan on empsalary
+(6 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary) c
+   FROM empsalary) emp
+WHERE 3 <= c;
+                QUERY PLAN                
+------------------------------------------
+ Subquery Scan on emp
+   Filter: (3 <= emp.c)
+   ->  WindowAgg
+         ->  Sort
+               Sort Key: empsalary.salary
+               ->  Seq Scan on empsalary
+(6 rows)
+
 -- Test Sort node collapsing
 EXPLAIN (COSTS OFF)
 SELECT * FROM
diff --git a/src/test/regress/sql/window.sql b/src/test/regress/sql/window.sql
index eae5fa6017..63f660299c 100644
--- a/src/test/regress/sql/window.sql
+++ b/src/test/regress/sql/window.sql
@@ -922,6 +922,138 @@ SELECT * FROM
    FROM empsalary) emp
 WHERE depname = 'sales';
 
+-- Test window function run conditions are properly pushed down into the
+-- WindowAgg
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+
+-- The following 3 statements should result the same result.
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE 3 > rn;
+
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE 2 >= rn;
+
+-- Ensure r <= 3 is pushed down into the run condition of the window agg
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          rank() OVER (ORDER BY salary DESC) r
+   FROM empsalary) emp
+WHERE r <= 3;
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          rank() OVER (ORDER BY salary DESC) r
+   FROM empsalary) emp
+WHERE r <= 3;
+
+-- Ensure dr = 1 is converted to dr <= 1 to get all rows leading up to dr = 1
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+
+-- Check COUNT() and COUNT(*)
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) c
+   FROM empsalary) emp
+WHERE c >= 3;
+
+-- Tests to ensure we don't push down the run condition when it's not valid to
+-- do so.
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (PARTITION BY depname ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (PARTITION BY depname ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary) c
+   FROM empsalary) emp
+WHERE 3 <= c;
+
 -- Test Sort node collapsing
 EXPLAIN (COSTS OFF)
 SELECT * FROM
-- 
2.30.2

Pavel Stehule

pavel.stehule@gmail.com

over 4 years ago

In reply to: David Rowley (#1)

Re: Window Function "Run Conditions"

čt 1. 7. 2021 v 11:11 odesílatel David Rowley <dgrowleyml@gmail.com> napsal:

It seems like a few too many years of an SQL standard without any
standardised way to LIMIT the number of records in a result set caused
various applications to adopt some strange ways to get this behaviour.
Over here in the PostgreSQL world, we just type LIMIT n; at the end of
our queries. I believe Oracle people did a few tricks with a special
column named "rownum". Another set of people needed SQL that would
work over multiple DBMSes and used something like:

SELECT * FROM (SELECT ... row_number() over (order by ...) rn) a WHERE rn
<= 10;

I believe it's fairly common to do paging this way on commerce sites.

The problem with PostgreSQL here is that neither the planner nor
executor knows that once we get to row_number 11 that we may as well
stop. The number will never go back down in this partition.

I'd like to make this better for PostgreSQL 15. I've attached a WIP
patch to do so.

How this works is that I've added prosupport functions for each of
row_number(), rank() and dense_rank(). When doing qual pushdown, if
we happen to hit a windowing function, instead of rejecting the
pushdown, we see if there's a prosupport function and if there is, ask
it if this qual can be used to allow us to stop emitting tuples from
the Window node by making use of this qual. I've called these "run
conditions". Basically, keep running while this remains true. Stop
when it's not.

We can't always use the qual directly. For example, if someone does.

SELECT * FROM (SELECT ... row_number() over (order by ...) rn) a WHERE rn
= 10;

then if we use the rn = 10 qual, we'd think we could stop right away.
Instead, I've made the prosupport function handle this by generating a
rn <= 10 qual so that we can stop once we get to 11. In this case we
cannot completely pushdown the qual. It needs to remain in place to
filter out rn values 1-9.

Row_number(), rank() and dense_rank() are all monotonically increasing
functions. But we're not limited to just those. COUNT(*) works too
providing the frame bounds guarantee that the function is either
monotonically increasing or decreasing.

COUNT(*) OVER (ORDER BY .. ROWS BETWEEN CURRENT ROW AND UNBOUNDED
FOLLOWING) is monotonically decreasing, whereas the standard bound
options would make it monotonically increasing.

The same could be done for MIN() and MAX(). I just don't think that's
worth doing. It seems unlikely that would get enough use.

Anyway. I'd like to work on this more during the PG15 cycle. I
believe the attached patch makes this work ok. There are just a few
things to iron out.

1) Unsure of the API to the prosupport function. I wonder if the
prosupport function should just be able to say if the function is
either monotonically increasing or decreasing or neither then have
core code build a qual. That would make the job of building new
functions easier, but massively reduce the flexibility of the feature.
I'm just not sure it needs to do more in the future.

2) Unsure if what I've got to make EXPLAIN show the run condition is
the right way to do it. Because I don't want nodeWindow.c to have to
re-evaluate the window function to determine of the run condition is
no longer met, I've coded the qual to reference the varno in the
window node's targetlist. That qual is no good for EXPLAIN so had to
include another set of quals that include the WindowFunc reference. I
saw that Index Only Scans have a similar means to make EXPLAIN work,
so I just followed that.

this can be very nice feature

Pavel

Show quoted text

David

David Rowley

dgrowleyml@gmail.com

over 4 years ago

In reply to: David Rowley (#1)

1 attachment(s)

Re: Window Function "Run Conditions"

On Thu, 1 Jul 2021 at 21:11, David Rowley <dgrowleyml@gmail.com> wrote:

1) Unsure of the API to the prosupport function. I wonder if the
prosupport function should just be able to say if the function is
either monotonically increasing or decreasing or neither then have
core code build a qual. That would make the job of building new
functions easier, but massively reduce the flexibility of the feature.
I'm just not sure it needs to do more in the future.

I looked at this patch again today and ended up changing the API that
I'd done for the prosupport functions. These just now set a new
"monotonic" field in the (also newly renamed)
SupportRequestWFuncMonotonic struct. This can be set to one of the
values from the newly added MonotonicFunction enum, namely:
MONOTONICFUNC_NONE, MONOTONICFUNC_INCREASING, MONOTONICFUNC_DECREASING
or MONOTONICFUNC_BOTH.

I also added handling for a few more cases that are perhaps rare but
could be done with just a few lines of code. For example; COUNT(*)
OVER() is MONOTONICFUNC_BOTH as it can neither increase nor decrease
for a given window partition. I think technically all of the standard
set of aggregate functions could have a prosupport function to handle
that case. Min() and Max() could go a little further, but I'm not sure
if adding handling for that would be worth it, and if someone does
think that it is worth it, then I'd rather do that as a separate
patch.

I put the MonotonicFunction enum in plannodes.h. There's nothing
specific about window functions or support functions. It could, for
example, be reused again for something else such as monotonic
set-returning functions.

One thing which I'm still not sure about is where
find_window_run_conditions() should be located. Currently, it's in
allpaths.c but that does not really feel like the right place to me.
We do have planagg.c in src/backend/optimizer/plan, maybe we need
planwindow.c?

David

Attachments:

v2_teach_planner_and_executor_about_monotonic_wfuncs.patchapplication/octet-stream; name=v2_teach_planner_and_executor_about_monotonic_wfuncs.patchDownload

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 10644dfac4..690ba88e7e 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1979,6 +1979,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
 			break;
+		case T_WindowAgg:
+			show_upper_qual(((WindowAgg *) plan)->runconditionorig,
+							"Run Condition", planstate, ancestors, es);
+			break;
 		case T_Group:
 			show_group_keys(castNode(GroupState, planstate), ancestors, es);
 			show_upper_qual(plan->qual, "Filter", planstate, ancestors, es);
diff --git a/src/backend/executor/nodeWindowAgg.c b/src/backend/executor/nodeWindowAgg.c
index f8ea9e96d8..ad5fcdd13b 100644
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -2023,6 +2023,7 @@ static TupleTableSlot *
 ExecWindowAgg(PlanState *pstate)
 {
 	WindowAggState *winstate = castNode(WindowAggState, pstate);
+	TupleTableSlot *slot;
 	ExprContext *econtext;
 	int			i;
 	int			numfuncs;
@@ -2235,7 +2236,20 @@ ExecWindowAgg(PlanState *pstate)
 	 */
 	econtext->ecxt_outertuple = winstate->ss.ss_ScanTupleSlot;
 
-	return ExecProject(winstate->ss.ps.ps_ProjInfo);
+	slot = ExecProject(winstate->ss.ps.ps_ProjInfo);
+
+	/*
+	 * Now evaluate the run condition to see if we need to continue further
+	 * with execution.
+	 */
+	econtext->ecxt_scantuple = slot;
+	if (!ExecQual(winstate->runcondition, econtext))
+	{
+		winstate->all_done = true;
+		return NULL;
+	}
+
+	return slot;
 }
 
 /* -----------------
@@ -2307,6 +2321,18 @@ ExecInitWindowAgg(WindowAgg *node, EState *estate, int eflags)
 	Assert(node->plan.qual == NIL);
 	winstate->ss.ps.qual = NULL;
 
+	/*
+	 * Setup the run condition, if we received one from the query planner.
+	 * When set, this can allow us to finish execution early because we know
+	 * some higher-level filter exists that would just filter out any further
+	 * results that we produce.
+	 */
+	if (node->runcondition != NIL)
+		winstate->runcondition = ExecInitQual(node->runcondition,
+											  (PlanState *) winstate);
+	else
+		winstate->runcondition = NULL;
+
 	/*
 	 * initialize child nodes
 	 */
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 38251c2b8e..a1876fe54c 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -1095,6 +1095,8 @@ _copyWindowAgg(const WindowAgg *from)
 	COPY_SCALAR_FIELD(frameOptions);
 	COPY_NODE_FIELD(startOffset);
 	COPY_NODE_FIELD(endOffset);
+	COPY_NODE_FIELD(runcondition);
+	COPY_NODE_FIELD(runconditionorig);
 	COPY_SCALAR_FIELD(startInRangeFunc);
 	COPY_SCALAR_FIELD(endInRangeFunc);
 	COPY_SCALAR_FIELD(inRangeColl);
@@ -2571,6 +2573,8 @@ _copyWindowClause(const WindowClause *from)
 	COPY_SCALAR_FIELD(frameOptions);
 	COPY_NODE_FIELD(startOffset);
 	COPY_NODE_FIELD(endOffset);
+	COPY_NODE_FIELD(runcondition);
+	COPY_NODE_FIELD(runconditionorig);
 	COPY_SCALAR_FIELD(startInRangeFunc);
 	COPY_SCALAR_FIELD(endInRangeFunc);
 	COPY_SCALAR_FIELD(inRangeColl);
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 8a1762000c..79136ec47d 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -2834,6 +2834,8 @@ _equalWindowClause(const WindowClause *a, const WindowClause *b)
 	COMPARE_SCALAR_FIELD(frameOptions);
 	COMPARE_NODE_FIELD(startOffset);
 	COMPARE_NODE_FIELD(endOffset);
+	COMPARE_NODE_FIELD(runcondition);
+	COMPARE_NODE_FIELD(runconditionorig);
 	COMPARE_SCALAR_FIELD(startInRangeFunc);
 	COMPARE_SCALAR_FIELD(endInRangeFunc);
 	COMPARE_SCALAR_FIELD(inRangeColl);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 87561cbb6f..41b3d66fe3 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -819,6 +819,8 @@ _outWindowAgg(StringInfo str, const WindowAgg *node)
 	WRITE_INT_FIELD(frameOptions);
 	WRITE_NODE_FIELD(startOffset);
 	WRITE_NODE_FIELD(endOffset);
+	WRITE_NODE_FIELD(runcondition);
+	WRITE_NODE_FIELD(runconditionorig);
 	WRITE_OID_FIELD(startInRangeFunc);
 	WRITE_OID_FIELD(endInRangeFunc);
 	WRITE_OID_FIELD(inRangeColl);
@@ -3142,6 +3144,8 @@ _outWindowClause(StringInfo str, const WindowClause *node)
 	WRITE_INT_FIELD(frameOptions);
 	WRITE_NODE_FIELD(startOffset);
 	WRITE_NODE_FIELD(endOffset);
+	WRITE_NODE_FIELD(runcondition);
+	WRITE_NODE_FIELD(runconditionorig);
 	WRITE_OID_FIELD(startInRangeFunc);
 	WRITE_OID_FIELD(endInRangeFunc);
 	WRITE_OID_FIELD(inRangeColl);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 0dd1ad7dfc..366d2ef43d 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -384,6 +384,8 @@ _readWindowClause(void)
 	READ_INT_FIELD(frameOptions);
 	READ_NODE_FIELD(startOffset);
 	READ_NODE_FIELD(endOffset);
+	READ_NODE_FIELD(runcondition);
+	READ_NODE_FIELD(runconditionorig);
 	READ_OID_FIELD(startInRangeFunc);
 	READ_OID_FIELD(endInRangeFunc);
 	READ_OID_FIELD(inRangeColl);
@@ -2346,6 +2348,8 @@ _readWindowAgg(void)
 	READ_INT_FIELD(frameOptions);
 	READ_NODE_FIELD(startOffset);
 	READ_NODE_FIELD(endOffset);
+	READ_NODE_FIELD(runcondition);
+	READ_NODE_FIELD(runconditionorig);
 	READ_OID_FIELD(startInRangeFunc);
 	READ_OID_FIELD(endInRangeFunc);
 	READ_OID_FIELD(inRangeColl);
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 296dd75c1b..e3bb0f5bae 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -27,6 +27,7 @@
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
+#include "nodes/supportnodes.h"
 #ifdef OPTIMIZER_DEBUG
 #include "nodes/print.h"
 #endif
@@ -2090,6 +2091,377 @@ has_multiple_baserels(PlannerInfo *root)
 	return false;
 }
 
+/*
+ * find_window_run_conditions
+ *		Call wfunc's prosupport function to ask if the given window function
+ *		is monotonic and then see if 'opexpr' can be used to stop processing
+ *		WindowAgg nodes early.
+ *
+ * For example row_number() over (order by ...) always produces a value one
+ * higher than the previous.  If someone has a window function such as that
+ * in a subquery and just wants, say all rows with a row number less than or
+ * equal to 10, then we may as well stop processing the windowagg once the row
+ * number reaches 11.  Here we look for opexprs that might help us to stop
+ * doing needless extra processing in WindowAgg nodes.
+ *
+ * To do this we make use of the window function's prosupport function to
+ * determine if the given window function with the given window clause is a
+ * monotonically increasing or decreasing function.
+ *
+ * '*keep_original' is set to true if the caller should also use 'opexpr' for
+ * its original purpose.  This is set to false if the caller can assume that
+ * the run condition will handle all of the required filtering.
+ *
+ * Returns true if a run condition qual was found and added to the
+ * wclause->runcondition list and sets *keep_original accordingly.  Returns
+ * false if unable to use 'opexpr' as a run condition and does not set
+ * *keep_original.
+ */
+static bool
+find_window_run_conditions(Query *subquery, RangeTblEntry *rte, Index rti,
+						   AttrNumber attno, WindowClause *wclause,
+						   WindowFunc *wfunc, OpExpr *opexpr, bool wfunc_left,
+						   bool *keep_original)
+{
+	Oid			prosupport;
+	Expr	   *otherexpr;
+	SupportRequestWFuncMonotonic req;
+	SupportRequestWFuncMonotonic *res;
+	List	   *opinfos;
+	OpExpr	   *runopexpr;
+	ListCell   *lc;
+
+	*keep_original = true;
+
+	prosupport = get_func_support(wfunc->winfnoid);
+
+	/* Check if there's a support function for 'wfunc' */
+	if (!OidIsValid(prosupport))
+		return false;
+
+	/*
+	 * Currently the WindowAgg node just stop when the run condition is no
+	 * longer true.  If there is a PARTITION BY clause then we cannot just
+	 * stop as other partitions still need to be processed.
+	 */
+	if (wclause->partitionClause != NIL)
+		return false;
+
+	/* get the Expr from the other side of the OpExpr */
+	if (wfunc_left)
+		otherexpr = lsecond(opexpr->args);
+	else
+		otherexpr = linitial(opexpr->args);
+
+	/*
+	 * The value being compared must not change during the evaluation of the
+	 * window partition.
+	 */
+	if (!is_pseudo_constant_clause((Node *) otherexpr))
+		return false;
+
+	req.type = T_SupportRequestWFuncMonotonic;
+	req.window_func = wfunc;
+	req.window_clause = wclause;
+
+	/* Call the support function */
+	res = (SupportRequestWFuncMonotonic *)
+		DatumGetPointer(OidFunctionCall1(prosupport,
+										 PointerGetDatum(&req)));
+
+	/*
+	 * Nothing to do if the function is not monotonically increasing or
+	 * decreasing.
+	 */
+	if (res == NULL || res->monotonic == MONOTONICFUNC_NONE)
+		return false;
+
+	runopexpr = NULL;
+	opinfos = get_op_btree_interpretation(opexpr->opno);
+
+	foreach(lc, opinfos)
+	{
+		OpBtreeInterpretation *opinfo = (OpBtreeInterpretation *) lfirst(lc);
+		int			strategy = opinfo->strategy;
+
+		if (wfunc_left)
+		{
+			/* Handle < / <= */
+			if (strategy == BTLessStrategyNumber ||
+				strategy == BTLessEqualStrategyNumber)
+			{
+				/*
+				 * If the frame is bound to the top of the window then the
+				 * result cannot decrease.
+				 */
+				if (res->monotonic & MONOTONICFUNC_INCREASING)
+				{
+					*keep_original = false;
+					runopexpr = opexpr;
+				}
+				break;
+			}
+			/* Handle > / >= */
+			else if (strategy == BTGreaterStrategyNumber ||
+					 strategy == BTGreaterEqualStrategyNumber)
+			{
+				/*
+				 * If the frame is bound to the bottom of the window then the
+				 * result cannot increase.
+				 */
+				if (res->monotonic & MONOTONICFUNC_DECREASING)
+				{
+					*keep_original = false;
+					runopexpr = opexpr;
+				}
+				break;
+			}
+			/* Handle = */
+			else if (strategy == BTEqualStrategyNumber)
+			{
+				OpExpr	   *newopexpr;
+				Oid			op;
+				int16		newstrategy;
+
+				/*
+				 * When both monotonically increasing and decreasing then the
+				 * return value of the window function will be the same each
+				 * time.  We can simply use 'opexpr' as the run condition
+				 * without modifying it.
+				 */
+				if ((res->monotonic & MONOTONICFUNC_BOTH) == MONOTONICFUNC_BOTH)
+				{
+					*keep_original = false;
+					runopexpr = opexpr;
+					break;
+				}
+
+				/*
+				 * When monotonically increasing we make a qual with <wfunc>
+				 * <= <value> in order to filter out values which are above
+				 * the value in the equality condition.  For monotonically
+				 * decreasing we want to filter values below the value in the
+				 * equality condition.
+				 */
+				if ((res->monotonic & MONOTONICFUNC_INCREASING) == MONOTONICFUNC_INCREASING)
+					newstrategy = BTLessEqualStrategyNumber;
+				else
+					newstrategy = BTGreaterEqualStrategyNumber;
+
+				op = get_opfamily_member(opinfo->opfamily_id,
+										 opinfo->oplefttype,
+										 opinfo->oprighttype,
+										 newstrategy);
+
+				newopexpr = (OpExpr *) make_opclause(op,
+													 opexpr->opresulttype,
+													 opexpr->opretset,
+													 (Expr *) wfunc,
+													 otherexpr,
+													 opexpr->opcollid,
+													 opexpr->inputcollid);
+				newopexpr->opfuncid = get_opcode(op);
+
+				*keep_original = true;
+				runopexpr = newopexpr;
+				break;
+			}
+		}
+		else
+		{
+			/* Handle > / >= */
+			if (strategy == BTGreaterStrategyNumber ||
+				strategy == BTGreaterEqualStrategyNumber)
+			{
+				/*
+				 * If the frame is bound to the top of the window then the
+				 * result cannot decrease.
+				 */
+				if (res->monotonic & MONOTONICFUNC_INCREASING)
+				{
+					*keep_original = false;
+					runopexpr = opexpr;
+				}
+				break;
+			}
+			/* Handle < / <= */
+			else if (strategy == BTLessStrategyNumber ||
+					 strategy == BTLessEqualStrategyNumber)
+			{
+				/*
+				 * If the frame is bound to the bottom of the window then the
+				 * result cannot increase.
+				 */
+				if (res->monotonic & MONOTONICFUNC_DECREASING)
+				{
+					*keep_original = false;
+					runopexpr = opexpr;
+				}
+				break;
+			}
+			/* Handle = */
+			else if (strategy == BTEqualStrategyNumber)
+			{
+				OpExpr	   *newopexpr;
+				Oid			op;
+				int16		newstrategy;
+
+				/*
+				 * When both monotonically increasing and decreasing then the
+				 * return value of the window function will be the same each
+				 * time.  We can simply use 'opexpr' as the run condition
+				 * without modifying it.
+				 */
+				if ((res->monotonic & MONOTONICFUNC_BOTH) == MONOTONICFUNC_BOTH)
+				{
+					*keep_original = false;
+					runopexpr = opexpr;
+					break;
+				}
+
+				/*
+				 * When monotonically increasing we make a qual with <value>
+				 * >= <wfunc> in order to filter out values which are above
+				 * the value in the equality condition.  For monotonically
+				 * decreasing we want to filter values below the value in the
+				 * equality condition.
+				 */
+				if ((res->monotonic & MONOTONICFUNC_INCREASING) == MONOTONICFUNC_INCREASING)
+					newstrategy = BTGreaterEqualStrategyNumber;
+				else
+					newstrategy = BTLessEqualStrategyNumber;
+
+				op = get_opfamily_member(opinfo->opfamily_id,
+										 opinfo->oplefttype,
+										 opinfo->oprighttype,
+										 newstrategy);
+
+				newopexpr = (OpExpr *) make_opclause(op,
+													 opexpr->opresulttype,
+													 opexpr->opretset,
+													 otherexpr,
+													 (Expr *) wfunc,
+													 opexpr->opcollid,
+													 opexpr->inputcollid);
+				newopexpr->opfuncid = get_opcode(op);
+
+				*keep_original = true;
+				runopexpr = newopexpr;
+				break;
+			}
+		}
+	}
+
+	if (runopexpr != NULL)
+	{
+		Expr	   *origexpr;
+
+		wclause->runcondition = lappend(wclause->runcondition, runopexpr);
+
+		/* also create a version of the qual that we can display in EXPLAIN */
+		if (wfunc_left)
+			origexpr = make_opclause(runopexpr->opno,
+									 runopexpr->opresulttype,
+									 runopexpr->opretset, (Expr *) wfunc,
+									 otherexpr, runopexpr->opcollid,
+									 runopexpr->inputcollid);
+		else
+			origexpr = make_opclause(runopexpr->opno,
+									 runopexpr->opresulttype,
+									 runopexpr->opretset,
+									 otherexpr, (Expr *) wfunc,
+									 runopexpr->opcollid,
+									 runopexpr->inputcollid);
+
+		wclause->runconditionorig = lappend(wclause->runconditionorig, origexpr);
+		return true;
+	}
+
+	/* unsupported OpExpr */
+	return false;
+}
+
+/*
+ * check_and_push_window_quals
+ *		Check if 'rinfo' is a qual that can be pushed into a WindowFunc as a
+ *		'runcondition' qual.  These, when present, cause the window function
+ *		evaluation to stop when the condition becomes false.
+ *
+ * Returns true if the caller still must keep the original qual or false if
+ * the caller can safely ignore the original qual because the window function
+ * will use the run condition to stop at the right time.
+ */
+static bool
+check_and_push_window_quals(Query *subquery, RangeTblEntry *rte, Index rti,
+							Node *clause)
+{
+	OpExpr	   *opexpr = (OpExpr *) clause;
+	bool		keep_original = true;
+	Var		   *var1;
+	Var		   *var2;
+
+	if (!IsA(opexpr, OpExpr))
+		return true;
+
+	if (list_length(opexpr->args) != 2)
+		return true;
+
+	/*
+	 * Check for plain Vars which reference window functions in the subquery.
+	 * If we find any, we'll ask find_window_run_conditions() if 'opexpr' can
+	 * be used as a run condition to allow us to stop windowagg execution
+	 * early.
+	 */
+
+	/* Check the left side of the OpExpr */
+	var1 = linitial(opexpr->args);
+	if (IsA(var1, Var) && var1->varattno > 0)
+	{
+		TargetEntry *tle = list_nth(subquery->targetList, var1->varattno - 1);
+		WindowFunc *wfunc = (WindowFunc *) tle->expr;
+
+		while (IsA(wfunc, RelabelType))
+			wfunc = (WindowFunc *) ((RelabelType *) wfunc)->arg;
+
+		if (IsA(wfunc, WindowFunc))
+		{
+			WindowClause *wclause = (WindowClause *)
+			list_nth(subquery->windowClause,
+					 wfunc->winref - 1);
+
+			if (find_window_run_conditions(subquery, rte, rti, tle->resno,
+										   wclause, wfunc, opexpr, true,
+										   &keep_original))
+				return keep_original;
+		}
+	}
+
+	/* and check the right side */
+	var2 = lsecond(opexpr->args);
+	if (IsA(var2, Var) && var2->varattno > 0)
+	{
+		TargetEntry *tle = list_nth(subquery->targetList, var2->varattno - 1);
+		WindowFunc *wfunc = (WindowFunc *) tle->expr;
+
+		while (IsA(wfunc, RelabelType))
+			wfunc = (WindowFunc *) ((RelabelType *) wfunc)->arg;
+
+		if (IsA(wfunc, WindowFunc))
+		{
+			WindowClause *wclause = (WindowClause *)
+			list_nth(subquery->windowClause,
+					 wfunc->winref - 1);
+
+			if (find_window_run_conditions(subquery, rte, rti, tle->resno,
+										   wclause, wfunc, opexpr, false,
+										   &keep_original))
+				return keep_original;
+		}
+	}
+
+	return true;
+}
+
 /*
  * set_subquery_pathlist
  *		Generate SubqueryScan access paths for a subquery RTE
@@ -2178,19 +2550,30 @@ set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 		foreach(l, rel->baserestrictinfo)
 		{
 			RestrictInfo *rinfo = (RestrictInfo *) lfirst(l);
+			Node	   *clause = (Node *) rinfo->clause;
 
 			if (!rinfo->pseudoconstant &&
 				qual_is_pushdown_safe(subquery, rti, rinfo, &safetyInfo))
 			{
-				Node	   *clause = (Node *) rinfo->clause;
-
 				/* Push it down */
 				subquery_push_qual(subquery, rte, rti, clause);
 			}
 			else
 			{
-				/* Keep it in the upper query */
-				upperrestrictlist = lappend(upperrestrictlist, rinfo);
+				/*
+				 * Since we can't push the qual down into the subquery, check
+				 * if it happens to reference a window function.  If so then
+				 * it might be useful to allow the window evaluation to stop
+				 * early.
+				 */
+				if (check_and_push_window_quals(subquery, rte, rti, clause))
+				{
+					/*
+					 * It's not a suitable window run condition qual or it is,
+					 * but the original must also be kept in the upper query.
+					 */
+					upperrestrictlist = lappend(upperrestrictlist, rinfo);
+				}
 			}
 		}
 		rel->baserestrictinfo = upperrestrictlist;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index a5f6d678cc..642e088f7e 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -286,6 +286,7 @@ static WindowAgg *make_windowagg(List *tlist, Index winref,
 								 int frameOptions, Node *startOffset, Node *endOffset,
 								 Oid startInRangeFunc, Oid endInRangeFunc,
 								 Oid inRangeColl, bool inRangeAsc, bool inRangeNullsFirst,
+								 List *runcondition, List *runconditionorig,
 								 Plan *lefttree);
 static Group *make_group(List *tlist, List *qual, int numGroupCols,
 						 AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
@@ -2619,6 +2620,8 @@ create_windowagg_plan(PlannerInfo *root, WindowAggPath *best_path)
 						  wc->inRangeColl,
 						  wc->inRangeAsc,
 						  wc->inRangeNullsFirst,
+						  wc->runcondition,
+						  wc->runconditionorig,
 						  subplan);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -6477,7 +6480,7 @@ make_windowagg(List *tlist, Index winref,
 			   int frameOptions, Node *startOffset, Node *endOffset,
 			   Oid startInRangeFunc, Oid endInRangeFunc,
 			   Oid inRangeColl, bool inRangeAsc, bool inRangeNullsFirst,
-			   Plan *lefttree)
+			   List *runcondition, List *runconditionorig, Plan *lefttree)
 {
 	WindowAgg  *node = makeNode(WindowAgg);
 	Plan	   *plan = &node->plan;
@@ -6494,6 +6497,8 @@ make_windowagg(List *tlist, Index winref,
 	node->frameOptions = frameOptions;
 	node->startOffset = startOffset;
 	node->endOffset = endOffset;
+	node->runcondition = runcondition;
+	node->runconditionorig = runconditionorig;
 	node->startInRangeFunc = startInRangeFunc;
 	node->endInRangeFunc = endInRangeFunc;
 	node->inRangeColl = inRangeColl;
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index e50624c465..a925aa3d83 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -870,6 +870,14 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 					fix_scan_expr(root, wplan->startOffset, rtoffset, 1);
 				wplan->endOffset =
 					fix_scan_expr(root, wplan->endOffset, rtoffset, 1);
+				wplan->runcondition = fix_scan_list(root,
+													wplan->runcondition,
+													rtoffset,
+													NUM_EXEC_TLIST(plan));
+				wplan->runconditionorig = fix_scan_list(root,
+														wplan->runconditionorig,
+														rtoffset,
+														NUM_EXEC_TLIST(plan));
 			}
 			break;
 		case T_Result:
diff --git a/src/backend/utils/adt/int8.c b/src/backend/utils/adt/int8.c
index 2168080dcc..8abd4e8598 100644
--- a/src/backend/utils/adt/int8.c
+++ b/src/backend/utils/adt/int8.c
@@ -25,6 +25,7 @@
 #include "optimizer/optimizer.h"
 #include "utils/builtins.h"
 #include "utils/int8.h"
+#include "utils/lsyscache.h"
 
 
 typedef struct
@@ -877,6 +878,7 @@ int8dec(PG_FUNCTION_ARGS)
 }
 
 
+
 /*
  * These functions are exactly like int8inc/int8dec but are used for
  * aggregates that count only non-null values.  Since the functions are
@@ -904,6 +906,49 @@ int8dec_any(PG_FUNCTION_ARGS)
 	return int8dec(fcinfo);
 }
 
+/*
+ * int8inc_support
+ *		prosupport function for int8inc() and int8inc_any()
+ */
+Datum
+int8inc_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq = (Node *) PG_GETARG_POINTER(0);
+
+	if (IsA(rawreq, SupportRequestWFuncMonotonic))
+	{
+		SupportRequestWFuncMonotonic *req = (SupportRequestWFuncMonotonic *) rawreq;
+		MonotonicFunction monotonic = MONOTONICFUNC_NONE;
+		int			frameOptions = req->window_clause->frameOptions;
+
+		/* No ORDER BY clause then all rows are peers */
+		if (req->window_clause->orderClause == NIL)
+			monotonic = MONOTONICFUNC_BOTH;
+		else
+		{
+			/*
+			 * Otherwise take into account the frame options.  When the frame
+			 * bound is the start of the window then the resulting value can
+			 * never decrease, therefore is monotonically increasing
+			 */
+			if (frameOptions & FRAMEOPTION_START_UNBOUNDED_PRECEDING)
+				monotonic |= MONOTONICFUNC_INCREASING;
+
+			/*
+			 * Likewise, if the frame bound is the end of the window then the
+			 * resulting value can never increase.
+			 */
+			if (frameOptions & FRAMEOPTION_END_UNBOUNDED_FOLLOWING)
+				monotonic |= MONOTONICFUNC_DECREASING;
+		}
+
+		req->monotonic = monotonic;
+		PG_RETURN_POINTER(req);
+	}
+
+	PG_RETURN_POINTER(NULL);
+}
+
 
 Datum
 int8larger(PG_FUNCTION_ARGS)
diff --git a/src/backend/utils/adt/windowfuncs.c b/src/backend/utils/adt/windowfuncs.c
index 9c127617d1..27257b7628 100644
--- a/src/backend/utils/adt/windowfuncs.c
+++ b/src/backend/utils/adt/windowfuncs.c
@@ -13,6 +13,7 @@
  */
 #include "postgres.h"
 
+#include "nodes/supportnodes.h"
 #include "utils/builtins.h"
 #include "windowapi.h"
 
@@ -88,6 +89,26 @@ window_row_number(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(curpos + 1);
 }
 
+/*
+ * window_row_number_support
+ *		prosupport function for window_row_number()
+ */
+Datum
+window_row_number_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq = (Node *) PG_GETARG_POINTER(0);
+
+	if (IsA(rawreq, SupportRequestWFuncMonotonic))
+	{
+		SupportRequestWFuncMonotonic *req = (SupportRequestWFuncMonotonic *) rawreq;
+
+		/* row_number() is monotonically increasing */
+		req->monotonic = MONOTONICFUNC_INCREASING;
+		PG_RETURN_POINTER(req);
+	}
+
+	PG_RETURN_POINTER(NULL);
+}
 
 /*
  * rank
@@ -110,6 +131,27 @@ window_rank(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(context->rank);
 }
 
+/*
+ * window_rank_support
+ *		prosupport function for window_rank()
+ */
+Datum
+window_rank_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq = (Node *) PG_GETARG_POINTER(0);
+
+	if (IsA(rawreq, SupportRequestWFuncMonotonic))
+	{
+		SupportRequestWFuncMonotonic *req = (SupportRequestWFuncMonotonic *) rawreq;
+
+		/* rank() is monotonically increasing */
+		req->monotonic = MONOTONICFUNC_INCREASING;
+		PG_RETURN_POINTER(req);
+	}
+
+	PG_RETURN_POINTER(NULL);
+}
+
 /*
  * dense_rank
  * Rank increases by 1 when key columns change.
@@ -130,6 +172,27 @@ window_dense_rank(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(context->rank);
 }
 
+/*
+ * window_dense_rank_support
+ *		prosupport function for window_dense_rank()
+ */
+Datum
+window_dense_rank_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq = (Node *) PG_GETARG_POINTER(0);
+
+	if (IsA(rawreq, SupportRequestWFuncMonotonic))
+	{
+		SupportRequestWFuncMonotonic *req = (SupportRequestWFuncMonotonic *) rawreq;
+
+		/* dense_rank() is monotonically increasing */
+		req->monotonic = MONOTONICFUNC_INCREASING;
+		PG_RETURN_POINTER(req);
+	}
+
+	PG_RETURN_POINTER(NULL);
+}
+
 /*
  * percent_rank
  * return fraction between 0 and 1 inclusive,
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index b603700ed9..22988d4c59 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6626,11 +6626,16 @@
 # count has two forms: count(any) and count(*)
 { oid => '2147',
   descr => 'number of input rows for which the input expression is not null',
-  proname => 'count', prokind => 'a', proisstrict => 'f', prorettype => 'int8',
-  proargtypes => 'any', prosrc => 'aggregate_dummy' },
+  proname => 'count', prosupport => 'int8inc_support', prokind => 'a',
+  proisstrict => 'f', prorettype => 'int8', proargtypes => 'any',
+  prosrc => 'aggregate_dummy' },
 { oid => '2803', descr => 'number of input rows',
-  proname => 'count', prokind => 'a', proisstrict => 'f', prorettype => 'int8',
-  proargtypes => '', prosrc => 'aggregate_dummy' },
+  proname => 'count', prosupport => 'int8inc_support', prokind => 'a',
+  proisstrict => 'f', prorettype => 'int8', proargtypes => '',
+  prosrc => 'aggregate_dummy' },
+{ oid => '8802', descr => 'planner support for count run condition',
+  proname => 'int8inc_support', prorettype => 'internal',
+  proargtypes => 'internal', prosrc => 'int8inc_support' },
 
 { oid => '2718',
   descr => 'population variance of bigint input values (square of the population standard deviation)',
@@ -10064,14 +10069,26 @@
 
 # SQL-spec window functions
 { oid => '3100', descr => 'row number within partition',
-  proname => 'row_number', prokind => 'w', proisstrict => 'f',
-  prorettype => 'int8', proargtypes => '', prosrc => 'window_row_number' },
+  proname => 'row_number', prosupport => 'window_row_number_support',
+  prokind => 'w', proisstrict => 'f',  prorettype => 'int8',
+  proargtypes => '', prosrc => 'window_row_number' },
+{ oid => '8799', descr => 'planner support for row_number run condition',
+  proname => 'window_row_number_support', prorettype => 'internal',
+  proargtypes => 'internal', prosrc => 'window_row_number_support' },
 { oid => '3101', descr => 'integer rank with gaps',
-  proname => 'rank', prokind => 'w', proisstrict => 'f', prorettype => 'int8',
+  proname => 'rank', prosupport => 'window_rank_support',
+  prokind => 'w', proisstrict => 'f', prorettype => 'int8',
   proargtypes => '', prosrc => 'window_rank' },
+{ oid => '8800', descr => 'planner support for rank run condition',
+  proname => 'window_rank_support', prorettype => 'internal',
+  proargtypes => 'internal', prosrc => 'window_rank_support' },
 { oid => '3102', descr => 'integer rank without gaps',
-  proname => 'dense_rank', prokind => 'w', proisstrict => 'f',
-  prorettype => 'int8', proargtypes => '', prosrc => 'window_dense_rank' },
+  proname => 'dense_rank', prosupport => 'window_dense_rank_support',
+  prokind => 'w', proisstrict => 'f', prorettype => 'int8', proargtypes => '',
+  prosrc => 'window_dense_rank' },
+{ oid => '8801', descr => 'planner support for dense rank run condition',
+  proname => 'window_dense_rank_support', prorettype => 'internal',
+  proargtypes => 'internal', prosrc => 'window_dense_rank_support' },
 { oid => '3103', descr => 'fractional rank within partition',
   proname => 'percent_rank', prokind => 'w', proisstrict => 'f',
   prorettype => 'float8', proargtypes => '', prosrc => 'window_percent_rank' },
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 37cb4f3d59..1e2befd797 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2408,6 +2408,10 @@ typedef struct WindowAggState
 	MemoryContext curaggcontext;	/* current aggregate's working data */
 	ExprContext *tmpcontext;	/* short-term evaluation context */
 
+	ExprState  *runcondition;	/* Condition which must remain true otherwise
+								 * execution of the WindowAgg will finish, or
+								 * NULL if there is no such condition. */
+
 	bool		all_first;		/* true if the scan is starting */
 	bool		all_done;		/* true if the scan is finished */
 	bool		partition_spooled;	/* true if all tuples in current partition
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 6a4d82f0a8..9c4799d3e1 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -526,7 +526,8 @@ typedef enum NodeTag
 	T_SupportRequestSelectivity,	/* in nodes/supportnodes.h */
 	T_SupportRequestCost,		/* in nodes/supportnodes.h */
 	T_SupportRequestRows,		/* in nodes/supportnodes.h */
-	T_SupportRequestIndexCondition	/* in nodes/supportnodes.h */
+	T_SupportRequestIndexCondition, /* in nodes/supportnodes.h */
+	T_SupportRequestWFuncMonotonic	/* in nodes/supportnodes.h */
 } NodeTag;
 
 /*
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index e28248af32..b76e3106dd 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -1381,6 +1381,8 @@ typedef struct WindowClause
 	int			frameOptions;	/* frame_clause options, see WindowDef */
 	Node	   *startOffset;	/* expression for starting bound, if any */
 	Node	   *endOffset;		/* expression for ending bound, if any */
+	List	   *runcondition;	/* Exec WindowAgg while this is true */
+	List	   *runconditionorig;	/* EXPLAIN compatible version of above */
 	Oid			startInRangeFunc;	/* in_range function for startOffset */
 	Oid			endInRangeFunc; /* in_range function for endOffset */
 	Oid			inRangeColl;	/* collation for in_range tests */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index ec9a8b0c81..1c71e6db41 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -893,6 +893,9 @@ typedef struct WindowAgg
 	int			frameOptions;	/* frame_clause options, see WindowDef */
 	Node	   *startOffset;	/* expression for starting bound, if any */
 	Node	   *endOffset;		/* expression for ending bound, if any */
+	List	   *runcondition;	/* Conditions that must remain true in order
+								 * for execution to continue */
+	List	   *runconditionorig;	/* runcondition for display in EXPLAIN */
 	/* these fields are used with RANGE offset PRECEDING/FOLLOWING: */
 	Oid			startInRangeFunc;	/* in_range function for startOffset */
 	Oid			endInRangeFunc; /* in_range function for endOffset */
@@ -1291,4 +1294,21 @@ typedef struct PlanInvalItem
 	uint32		hashValue;		/* hash value of object's cache lookup key */
 } PlanInvalItem;
 
+/*
+ * MonotonicFunction
+ *
+ * Allows the planner to track monotonic properties of functions.  A function
+ * is monotonically increasing if a subsequent call cannot yield a lower value
+ * than the previous call.  A monotonically decreasing function cannot yield a
+ * higher value, and a function which is both must return the same value on
+ * each call.
+ */
+typedef enum MonotonicFunction
+{
+	MONOTONICFUNC_NONE = 0,
+	MONOTONICFUNC_INCREASING = (1 << 0),
+	MONOTONICFUNC_DECREASING = (1 << 1),
+	MONOTONICFUNC_BOTH = MONOTONICFUNC_INCREASING | MONOTONICFUNC_DECREASING
+} MonotonicFunction;
+
 #endif							/* PLANNODES_H */
diff --git a/src/include/nodes/supportnodes.h b/src/include/nodes/supportnodes.h
index 85e1b8a832..33718332c9 100644
--- a/src/include/nodes/supportnodes.h
+++ b/src/include/nodes/supportnodes.h
@@ -33,12 +33,12 @@
 #ifndef SUPPORTNODES_H
 #define SUPPORTNODES_H
 
-#include "nodes/primnodes.h"
+#include "nodes/plannodes.h"
 
 struct PlannerInfo;				/* avoid including pathnodes.h here */
 struct IndexOptInfo;
 struct SpecialJoinInfo;
-
+struct WindowClause;
 
 /*
  * The Simplify request allows the support function to perform plan-time
@@ -239,4 +239,58 @@ typedef struct SupportRequestIndexCondition
 								 * equivalent of the function call */
 } SupportRequestIndexCondition;
 
+/* ----------
+ * To allow more efficient execution of any monotonically increasing and/or
+ * monotonically decreasing window functions, we support calling the window
+ * function's prosupport function passing along this struct whenever the
+ * planner sees an OpExpr qual directly reference a window function in a
+ * subquery.  When the planner encounters this, we populate this struct and
+ * pass it along to the window function's prosupport function so that it can
+ * evaluate if the given WindowFunc is;
+ *
+ * a) monotonically increasing, or
+ * b) monotonically decreasing, or
+ * c) both monotonically increasing and decreasing, or
+ * d) none of the above.
+ *
+ * A function that is monotonically increasing can never return a value that
+ * is lower than a value returned in a "previous call".  A monotonically
+ * decreasing function can never return a value higher than a value returned
+ * in a previous call.  A function which is both must return the same value
+ * each time.
+ *
+ * We define "previous call" to mean a previous call to the same WindowFunc
+ * struct in the same window partition.
+ *
+ * row_number() is an example of a monotonically increasing function.  The
+ * return value will be reset back to 1 in each new partition.  An example of
+ * a monotonically decreasing function is count(*) over ().  Since there is
+ * no ORDER BY clause in this example, all rows in the partition are peers and
+ * all rows within the partition will be within the frame bound.  Likewise for
+ * count(*) over(order by a rows between unbounded preceding and unbounded
+ * following).
+ *
+ * Inputs:
+ *	'window_func' is the pointer to the window function being called.
+ *
+ *	'window_clause' pointer to the WindowClause data.  Support functions can
+ *	use this to check frame bounds and partition by clauses, etc.
+ *
+ * Outputs:
+ *	'monotonic' the resulting MonotonicFunction value for the given input
+ *	window function and window clause.
+ * ----------
+ */
+typedef struct SupportRequestWFuncMonotonic
+{
+	NodeTag		type;
+
+	/* Input fields: */
+	WindowFunc *window_func;	/* Pointer to the window function data */
+	struct WindowClause *window_clause; /* Pointer to the window clause data */
+
+	/* Output fields: */
+	MonotonicFunction monotonic;
+} SupportRequestWFuncMonotonic;
+
 #endif							/* SUPPORTNODES_H */
diff --git a/src/test/regress/expected/window.out b/src/test/regress/expected/window.out
index bb9ff7f07b..7c343dc3e6 100644
--- a/src/test/regress/expected/window.out
+++ b/src/test/regress/expected/window.out
@@ -3336,6 +3336,275 @@ WHERE depname = 'sales';
                            ->  Seq Scan on empsalary
 (9 rows)
 
+-- Test window function run conditions are properly pushed down into the
+-- WindowAgg
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+                  QUERY PLAN                  
+----------------------------------------------
+ WindowAgg
+   Run Condition: (row_number() OVER (?) < 3)
+   ->  Sort
+         Sort Key: empsalary.empno
+         ->  Seq Scan on empsalary
+(5 rows)
+
+-- The following 3 statements should result the same result.
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+ empno | rn 
+-------+----
+     1 |  1
+     2 |  2
+(2 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE 3 > rn;
+ empno | rn 
+-------+----
+     1 |  1
+     2 |  2
+(2 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE 2 >= rn;
+ empno | rn 
+-------+----
+     1 |  1
+     2 |  2
+(2 rows)
+
+-- Ensure r <= 3 is pushed down into the run condition of the window agg
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          rank() OVER (ORDER BY salary DESC) r
+   FROM empsalary) emp
+WHERE r <= 3;
+               QUERY PLAN                
+-----------------------------------------
+ WindowAgg
+   Run Condition: (rank() OVER (?) <= 3)
+   ->  Sort
+         Sort Key: empsalary.salary DESC
+         ->  Seq Scan on empsalary
+(5 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          rank() OVER (ORDER BY salary DESC) r
+   FROM empsalary) emp
+WHERE r <= 3;
+ empno | salary | r 
+-------+--------+---
+     8 |   6000 | 1
+    10 |   5200 | 2
+    11 |   5200 | 2
+(3 rows)
+
+-- Ensure dr = 1 is converted to dr <= 1 to get all rows leading up to dr = 1
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+                     QUERY PLAN                      
+-----------------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.dr = 1)
+   ->  WindowAgg
+         Run Condition: (dense_rank() OVER (?) <= 1)
+         ->  Sort
+               Sort Key: empsalary.salary DESC
+               ->  Seq Scan on empsalary
+(7 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+ empno | salary | dr 
+-------+--------+----
+     8 |   6000 |  1
+(1 row)
+
+-- Check COUNT() and COUNT(*)
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+                QUERY PLAN                 
+-------------------------------------------
+ WindowAgg
+   Run Condition: (count(*) OVER (?) <= 3)
+   ->  Sort
+         Sort Key: empsalary.salary DESC
+         ->  Seq Scan on empsalary
+(5 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+ empno | salary | c 
+-------+--------+---
+     8 |   6000 | 1
+    10 |   5200 | 3
+    11 |   5200 | 3
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+                       QUERY PLAN                        
+---------------------------------------------------------
+ WindowAgg
+   Run Condition: (count(empsalary.empno) OVER (?) <= 3)
+   ->  Sort
+         Sort Key: empsalary.salary DESC
+         ->  Seq Scan on empsalary
+(5 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+ empno | salary | c 
+-------+--------+---
+     8 |   6000 | 1
+    10 |   5200 | 3
+    11 |   5200 | 3
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) c
+   FROM empsalary) emp
+WHERE c >= 3;
+                QUERY PLAN                 
+-------------------------------------------
+ WindowAgg
+   Run Condition: (count(*) OVER (?) >= 3)
+   ->  Sort
+         Sort Key: empsalary.salary DESC
+         ->  Seq Scan on empsalary
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER () c
+   FROM empsalary) emp
+WHERE 11 <= c;
+                 QUERY PLAN                 
+--------------------------------------------
+ WindowAgg
+   Run Condition: (11 <= count(*) OVER (?))
+   ->  Seq Scan on empsalary
+(3 rows)
+
+-- Tests to ensure we don't push down the run condition when it's not valid to
+-- do so.
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (PARTITION BY depname ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.rn < 3)
+   ->  WindowAgg
+         ->  Sort
+               Sort Key: empsalary.depname, empsalary.empno
+               ->  Seq Scan on empsalary
+(6 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (PARTITION BY depname ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+                            QUERY PLAN                            
+------------------------------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.c <= 3)
+   ->  WindowAgg
+         ->  Sort
+               Sort Key: empsalary.depname, empsalary.salary DESC
+               ->  Seq Scan on empsalary
+(6 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) c
+   FROM empsalary) emp
+WHERE c <= 3;
+                  QUERY PLAN                   
+-----------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.c <= 3)
+   ->  WindowAgg
+         ->  Sort
+               Sort Key: empsalary.salary DESC
+               ->  Seq Scan on empsalary
+(6 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary) c
+   FROM empsalary) emp
+WHERE 3 <= c;
+                QUERY PLAN                
+------------------------------------------
+ Subquery Scan on emp
+   Filter: (3 <= emp.c)
+   ->  WindowAgg
+         ->  Sort
+               Sort Key: empsalary.salary
+               ->  Seq Scan on empsalary
+(6 rows)
+
 -- Test Sort node collapsing
 EXPLAIN (COSTS OFF)
 SELECT * FROM
diff --git a/src/test/regress/sql/window.sql b/src/test/regress/sql/window.sql
index 41a8e0d152..c746d5343a 100644
--- a/src/test/regress/sql/window.sql
+++ b/src/test/regress/sql/window.sql
@@ -988,6 +988,146 @@ SELECT * FROM
    FROM empsalary) emp
 WHERE depname = 'sales';
 
+-- Test window function run conditions are properly pushed down into the
+-- WindowAgg
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+
+-- The following 3 statements should result the same result.
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE 3 > rn;
+
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE 2 >= rn;
+
+-- Ensure r <= 3 is pushed down into the run condition of the window agg
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          rank() OVER (ORDER BY salary DESC) r
+   FROM empsalary) emp
+WHERE r <= 3;
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          rank() OVER (ORDER BY salary DESC) r
+   FROM empsalary) emp
+WHERE r <= 3;
+
+-- Ensure dr = 1 is converted to dr <= 1 to get all rows leading up to dr = 1
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+
+-- Check COUNT() and COUNT(*)
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) c
+   FROM empsalary) emp
+WHERE c >= 3;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER () c
+   FROM empsalary) emp
+WHERE 11 <= c;
+
+-- Tests to ensure we don't push down the run condition when it's not valid to
+-- do so.
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (PARTITION BY depname ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (PARTITION BY depname ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary) c
+   FROM empsalary) emp
+WHERE 3 <= c;
+
 -- Test Sort node collapsing
 EXPLAIN (COSTS OFF)
 SELECT * FROM

Zhihong Yu

zyu@yugabyte.com

over 4 years ago

In reply to: David Rowley (#3)

Re: Window Function "Run Conditions"

On Mon, Aug 16, 2021 at 3:28 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Thu, 1 Jul 2021 at 21:11, David Rowley <dgrowleyml@gmail.com> wrote:

1) Unsure of the API to the prosupport function. I wonder if the
prosupport function should just be able to say if the function is
either monotonically increasing or decreasing or neither then have
core code build a qual. That would make the job of building new
functions easier, but massively reduce the flexibility of the feature.
I'm just not sure it needs to do more in the future.

I looked at this patch again today and ended up changing the API that
I'd done for the prosupport functions. These just now set a new
"monotonic" field in the (also newly renamed)
SupportRequestWFuncMonotonic struct. This can be set to one of the
values from the newly added MonotonicFunction enum, namely:
MONOTONICFUNC_NONE, MONOTONICFUNC_INCREASING, MONOTONICFUNC_DECREASING
or MONOTONICFUNC_BOTH.

I also added handling for a few more cases that are perhaps rare but
could be done with just a few lines of code. For example; COUNT(*)
OVER() is MONOTONICFUNC_BOTH as it can neither increase nor decrease
for a given window partition. I think technically all of the standard
set of aggregate functions could have a prosupport function to handle
that case. Min() and Max() could go a little further, but I'm not sure
if adding handling for that would be worth it, and if someone does
think that it is worth it, then I'd rather do that as a separate
patch.

I put the MonotonicFunction enum in plannodes.h. There's nothing
specific about window functions or support functions. It could, for
example, be reused again for something else such as monotonic
set-returning functions.

One thing which I'm still not sure about is where
find_window_run_conditions() should be located. Currently, it's in
allpaths.c but that does not really feel like the right place to me.
We do have planagg.c in src/backend/optimizer/plan, maybe we need
planwindow.c?

David

Hi,

+ if ((res->monotonic & MONOTONICFUNC_INCREASING) ==
MONOTONICFUNC_INCREASING)

The above can be simplified as 'if (res->monotonic &
MONOTONICFUNC_INCREASING) '

Cheers

David Rowley

dgrowleyml@gmail.com

over 4 years ago

In reply to: Zhihong Yu (#4)

1 attachment(s)

Re: Window Function "Run Conditions"

On Tue, 17 Aug 2021 at 03:51, Zhihong Yu <zyu@yugabyte.com> wrote:

+ if ((res->monotonic & MONOTONICFUNC_INCREASING) == MONOTONICFUNC_INCREASING)

The above can be simplified as 'if (res->monotonic & MONOTONICFUNC_INCREASING) '

True. I've attached an updated patch.

David

Attachments:

v3_teach_planner_and_executor_about_monotonic_wfuncs.patchapplication/octet-stream; name=v3_teach_planner_and_executor_about_monotonic_wfuncs.patchDownload

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 10644dfac4..690ba88e7e 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1979,6 +1979,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
 			break;
+		case T_WindowAgg:
+			show_upper_qual(((WindowAgg *) plan)->runconditionorig,
+							"Run Condition", planstate, ancestors, es);
+			break;
 		case T_Group:
 			show_group_keys(castNode(GroupState, planstate), ancestors, es);
 			show_upper_qual(plan->qual, "Filter", planstate, ancestors, es);
diff --git a/src/backend/executor/nodeWindowAgg.c b/src/backend/executor/nodeWindowAgg.c
index f8ea9e96d8..ad5fcdd13b 100644
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -2023,6 +2023,7 @@ static TupleTableSlot *
 ExecWindowAgg(PlanState *pstate)
 {
 	WindowAggState *winstate = castNode(WindowAggState, pstate);
+	TupleTableSlot *slot;
 	ExprContext *econtext;
 	int			i;
 	int			numfuncs;
@@ -2235,7 +2236,20 @@ ExecWindowAgg(PlanState *pstate)
 	 */
 	econtext->ecxt_outertuple = winstate->ss.ss_ScanTupleSlot;
 
-	return ExecProject(winstate->ss.ps.ps_ProjInfo);
+	slot = ExecProject(winstate->ss.ps.ps_ProjInfo);
+
+	/*
+	 * Now evaluate the run condition to see if we need to continue further
+	 * with execution.
+	 */
+	econtext->ecxt_scantuple = slot;
+	if (!ExecQual(winstate->runcondition, econtext))
+	{
+		winstate->all_done = true;
+		return NULL;
+	}
+
+	return slot;
 }
 
 /* -----------------
@@ -2307,6 +2321,18 @@ ExecInitWindowAgg(WindowAgg *node, EState *estate, int eflags)
 	Assert(node->plan.qual == NIL);
 	winstate->ss.ps.qual = NULL;
 
+	/*
+	 * Setup the run condition, if we received one from the query planner.
+	 * When set, this can allow us to finish execution early because we know
+	 * some higher-level filter exists that would just filter out any further
+	 * results that we produce.
+	 */
+	if (node->runcondition != NIL)
+		winstate->runcondition = ExecInitQual(node->runcondition,
+											  (PlanState *) winstate);
+	else
+		winstate->runcondition = NULL;
+
 	/*
 	 * initialize child nodes
 	 */
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 38251c2b8e..a1876fe54c 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -1095,6 +1095,8 @@ _copyWindowAgg(const WindowAgg *from)
 	COPY_SCALAR_FIELD(frameOptions);
 	COPY_NODE_FIELD(startOffset);
 	COPY_NODE_FIELD(endOffset);
+	COPY_NODE_FIELD(runcondition);
+	COPY_NODE_FIELD(runconditionorig);
 	COPY_SCALAR_FIELD(startInRangeFunc);
 	COPY_SCALAR_FIELD(endInRangeFunc);
 	COPY_SCALAR_FIELD(inRangeColl);
@@ -2571,6 +2573,8 @@ _copyWindowClause(const WindowClause *from)
 	COPY_SCALAR_FIELD(frameOptions);
 	COPY_NODE_FIELD(startOffset);
 	COPY_NODE_FIELD(endOffset);
+	COPY_NODE_FIELD(runcondition);
+	COPY_NODE_FIELD(runconditionorig);
 	COPY_SCALAR_FIELD(startInRangeFunc);
 	COPY_SCALAR_FIELD(endInRangeFunc);
 	COPY_SCALAR_FIELD(inRangeColl);
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 8a1762000c..79136ec47d 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -2834,6 +2834,8 @@ _equalWindowClause(const WindowClause *a, const WindowClause *b)
 	COMPARE_SCALAR_FIELD(frameOptions);
 	COMPARE_NODE_FIELD(startOffset);
 	COMPARE_NODE_FIELD(endOffset);
+	COMPARE_NODE_FIELD(runcondition);
+	COMPARE_NODE_FIELD(runconditionorig);
 	COMPARE_SCALAR_FIELD(startInRangeFunc);
 	COMPARE_SCALAR_FIELD(endInRangeFunc);
 	COMPARE_SCALAR_FIELD(inRangeColl);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 87561cbb6f..41b3d66fe3 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -819,6 +819,8 @@ _outWindowAgg(StringInfo str, const WindowAgg *node)
 	WRITE_INT_FIELD(frameOptions);
 	WRITE_NODE_FIELD(startOffset);
 	WRITE_NODE_FIELD(endOffset);
+	WRITE_NODE_FIELD(runcondition);
+	WRITE_NODE_FIELD(runconditionorig);
 	WRITE_OID_FIELD(startInRangeFunc);
 	WRITE_OID_FIELD(endInRangeFunc);
 	WRITE_OID_FIELD(inRangeColl);
@@ -3142,6 +3144,8 @@ _outWindowClause(StringInfo str, const WindowClause *node)
 	WRITE_INT_FIELD(frameOptions);
 	WRITE_NODE_FIELD(startOffset);
 	WRITE_NODE_FIELD(endOffset);
+	WRITE_NODE_FIELD(runcondition);
+	WRITE_NODE_FIELD(runconditionorig);
 	WRITE_OID_FIELD(startInRangeFunc);
 	WRITE_OID_FIELD(endInRangeFunc);
 	WRITE_OID_FIELD(inRangeColl);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 0dd1ad7dfc..366d2ef43d 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -384,6 +384,8 @@ _readWindowClause(void)
 	READ_INT_FIELD(frameOptions);
 	READ_NODE_FIELD(startOffset);
 	READ_NODE_FIELD(endOffset);
+	READ_NODE_FIELD(runcondition);
+	READ_NODE_FIELD(runconditionorig);
 	READ_OID_FIELD(startInRangeFunc);
 	READ_OID_FIELD(endInRangeFunc);
 	READ_OID_FIELD(inRangeColl);
@@ -2346,6 +2348,8 @@ _readWindowAgg(void)
 	READ_INT_FIELD(frameOptions);
 	READ_NODE_FIELD(startOffset);
 	READ_NODE_FIELD(endOffset);
+	READ_NODE_FIELD(runcondition);
+	READ_NODE_FIELD(runconditionorig);
 	READ_OID_FIELD(startInRangeFunc);
 	READ_OID_FIELD(endInRangeFunc);
 	READ_OID_FIELD(inRangeColl);
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 296dd75c1b..5c612e53e5 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -27,6 +27,7 @@
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
+#include "nodes/supportnodes.h"
 #ifdef OPTIMIZER_DEBUG
 #include "nodes/print.h"
 #endif
@@ -2090,6 +2091,377 @@ has_multiple_baserels(PlannerInfo *root)
 	return false;
 }
 
+/*
+ * find_window_run_conditions
+ *		Call wfunc's prosupport function to ask if the given window function
+ *		is monotonic and then see if 'opexpr' can be used to stop processing
+ *		WindowAgg nodes early.
+ *
+ * For example row_number() over (order by ...) always produces a value one
+ * higher than the previous.  If someone has a window function such as that
+ * in a subquery and just wants, say all rows with a row number less than or
+ * equal to 10, then we may as well stop processing the windowagg once the row
+ * number reaches 11.  Here we look for opexprs that might help us to stop
+ * doing needless extra processing in WindowAgg nodes.
+ *
+ * To do this we make use of the window function's prosupport function to
+ * determine if the given window function with the given window clause is a
+ * monotonically increasing or decreasing function.
+ *
+ * '*keep_original' is set to true if the caller should also use 'opexpr' for
+ * its original purpose.  This is set to false if the caller can assume that
+ * the run condition will handle all of the required filtering.
+ *
+ * Returns true if a run condition qual was found and added to the
+ * wclause->runcondition list and sets *keep_original accordingly.  Returns
+ * false if unable to use 'opexpr' as a run condition and does not set
+ * *keep_original.
+ */
+static bool
+find_window_run_conditions(Query *subquery, RangeTblEntry *rte, Index rti,
+						   AttrNumber attno, WindowClause *wclause,
+						   WindowFunc *wfunc, OpExpr *opexpr, bool wfunc_left,
+						   bool *keep_original)
+{
+	Oid			prosupport;
+	Expr	   *otherexpr;
+	SupportRequestWFuncMonotonic req;
+	SupportRequestWFuncMonotonic *res;
+	List	   *opinfos;
+	OpExpr	   *runopexpr;
+	ListCell   *lc;
+
+	*keep_original = true;
+
+	prosupport = get_func_support(wfunc->winfnoid);
+
+	/* Check if there's a support function for 'wfunc' */
+	if (!OidIsValid(prosupport))
+		return false;
+
+	/*
+	 * Currently the WindowAgg node just stop when the run condition is no
+	 * longer true.  If there is a PARTITION BY clause then we cannot just
+	 * stop as other partitions still need to be processed.
+	 */
+	if (wclause->partitionClause != NIL)
+		return false;
+
+	/* get the Expr from the other side of the OpExpr */
+	if (wfunc_left)
+		otherexpr = lsecond(opexpr->args);
+	else
+		otherexpr = linitial(opexpr->args);
+
+	/*
+	 * The value being compared must not change during the evaluation of the
+	 * window partition.
+	 */
+	if (!is_pseudo_constant_clause((Node *) otherexpr))
+		return false;
+
+	req.type = T_SupportRequestWFuncMonotonic;
+	req.window_func = wfunc;
+	req.window_clause = wclause;
+
+	/* Call the support function */
+	res = (SupportRequestWFuncMonotonic *)
+		DatumGetPointer(OidFunctionCall1(prosupport,
+										 PointerGetDatum(&req)));
+
+	/*
+	 * Nothing to do if the function is not monotonically increasing or
+	 * decreasing.
+	 */
+	if (res == NULL || res->monotonic == MONOTONICFUNC_NONE)
+		return false;
+
+	runopexpr = NULL;
+	opinfos = get_op_btree_interpretation(opexpr->opno);
+
+	foreach(lc, opinfos)
+	{
+		OpBtreeInterpretation *opinfo = (OpBtreeInterpretation *) lfirst(lc);
+		int			strategy = opinfo->strategy;
+
+		if (wfunc_left)
+		{
+			/* Handle < / <= */
+			if (strategy == BTLessStrategyNumber ||
+				strategy == BTLessEqualStrategyNumber)
+			{
+				/*
+				 * If the frame is bound to the top of the window then the
+				 * result cannot decrease.
+				 */
+				if (res->monotonic & MONOTONICFUNC_INCREASING)
+				{
+					*keep_original = false;
+					runopexpr = opexpr;
+				}
+				break;
+			}
+			/* Handle > / >= */
+			else if (strategy == BTGreaterStrategyNumber ||
+					 strategy == BTGreaterEqualStrategyNumber)
+			{
+				/*
+				 * If the frame is bound to the bottom of the window then the
+				 * result cannot increase.
+				 */
+				if (res->monotonic & MONOTONICFUNC_DECREASING)
+				{
+					*keep_original = false;
+					runopexpr = opexpr;
+				}
+				break;
+			}
+			/* Handle = */
+			else if (strategy == BTEqualStrategyNumber)
+			{
+				OpExpr	   *newopexpr;
+				Oid			op;
+				int16		newstrategy;
+
+				/*
+				 * When both monotonically increasing and decreasing then the
+				 * return value of the window function will be the same each
+				 * time.  We can simply use 'opexpr' as the run condition
+				 * without modifying it.
+				 */
+				if ((res->monotonic & MONOTONICFUNC_BOTH) == MONOTONICFUNC_BOTH)
+				{
+					*keep_original = false;
+					runopexpr = opexpr;
+					break;
+				}
+
+				/*
+				 * When monotonically increasing we make a qual with <wfunc>
+				 * <= <value> in order to filter out values which are above
+				 * the value in the equality condition.  For monotonically
+				 * decreasing we want to filter values below the value in the
+				 * equality condition.
+				 */
+				if (res->monotonic & MONOTONICFUNC_INCREASING)
+					newstrategy = BTLessEqualStrategyNumber;
+				else
+					newstrategy = BTGreaterEqualStrategyNumber;
+
+				op = get_opfamily_member(opinfo->opfamily_id,
+										 opinfo->oplefttype,
+										 opinfo->oprighttype,
+										 newstrategy);
+
+				newopexpr = (OpExpr *) make_opclause(op,
+													 opexpr->opresulttype,
+													 opexpr->opretset,
+													 (Expr *) wfunc,
+													 otherexpr,
+													 opexpr->opcollid,
+													 opexpr->inputcollid);
+				newopexpr->opfuncid = get_opcode(op);
+
+				*keep_original = true;
+				runopexpr = newopexpr;
+				break;
+			}
+		}
+		else
+		{
+			/* Handle > / >= */
+			if (strategy == BTGreaterStrategyNumber ||
+				strategy == BTGreaterEqualStrategyNumber)
+			{
+				/*
+				 * If the frame is bound to the top of the window then the
+				 * result cannot decrease.
+				 */
+				if (res->monotonic & MONOTONICFUNC_INCREASING)
+				{
+					*keep_original = false;
+					runopexpr = opexpr;
+				}
+				break;
+			}
+			/* Handle < / <= */
+			else if (strategy == BTLessStrategyNumber ||
+					 strategy == BTLessEqualStrategyNumber)
+			{
+				/*
+				 * If the frame is bound to the bottom of the window then the
+				 * result cannot increase.
+				 */
+				if (res->monotonic & MONOTONICFUNC_DECREASING)
+				{
+					*keep_original = false;
+					runopexpr = opexpr;
+				}
+				break;
+			}
+			/* Handle = */
+			else if (strategy == BTEqualStrategyNumber)
+			{
+				OpExpr	   *newopexpr;
+				Oid			op;
+				int16		newstrategy;
+
+				/*
+				 * When both monotonically increasing and decreasing then the
+				 * return value of the window function will be the same each
+				 * time.  We can simply use 'opexpr' as the run condition
+				 * without modifying it.
+				 */
+				if ((res->monotonic & MONOTONICFUNC_BOTH) == MONOTONICFUNC_BOTH)
+				{
+					*keep_original = false;
+					runopexpr = opexpr;
+					break;
+				}
+
+				/*
+				 * When monotonically increasing we make a qual with <value>
+				 * >= <wfunc> in order to filter out values which are above
+				 * the value in the equality condition.  For monotonically
+				 * decreasing we want to filter values below the value in the
+				 * equality condition.
+				 */
+				if (res->monotonic & MONOTONICFUNC_INCREASING)
+					newstrategy = BTGreaterEqualStrategyNumber;
+				else
+					newstrategy = BTLessEqualStrategyNumber;
+
+				op = get_opfamily_member(opinfo->opfamily_id,
+										 opinfo->oplefttype,
+										 opinfo->oprighttype,
+										 newstrategy);
+
+				newopexpr = (OpExpr *) make_opclause(op,
+													 opexpr->opresulttype,
+													 opexpr->opretset,
+													 otherexpr,
+													 (Expr *) wfunc,
+													 opexpr->opcollid,
+													 opexpr->inputcollid);
+				newopexpr->opfuncid = get_opcode(op);
+
+				*keep_original = true;
+				runopexpr = newopexpr;
+				break;
+			}
+		}
+	}
+
+	if (runopexpr != NULL)
+	{
+		Expr	   *origexpr;
+
+		wclause->runcondition = lappend(wclause->runcondition, runopexpr);
+
+		/* also create a version of the qual that we can display in EXPLAIN */
+		if (wfunc_left)
+			origexpr = make_opclause(runopexpr->opno,
+									 runopexpr->opresulttype,
+									 runopexpr->opretset, (Expr *) wfunc,
+									 otherexpr, runopexpr->opcollid,
+									 runopexpr->inputcollid);
+		else
+			origexpr = make_opclause(runopexpr->opno,
+									 runopexpr->opresulttype,
+									 runopexpr->opretset,
+									 otherexpr, (Expr *) wfunc,
+									 runopexpr->opcollid,
+									 runopexpr->inputcollid);
+
+		wclause->runconditionorig = lappend(wclause->runconditionorig, origexpr);
+		return true;
+	}
+
+	/* unsupported OpExpr */
+	return false;
+}
+
+/*
+ * check_and_push_window_quals
+ *		Check if 'rinfo' is a qual that can be pushed into a WindowFunc as a
+ *		'runcondition' qual.  These, when present, cause the window function
+ *		evaluation to stop when the condition becomes false.
+ *
+ * Returns true if the caller still must keep the original qual or false if
+ * the caller can safely ignore the original qual because the window function
+ * will use the run condition to stop at the right time.
+ */
+static bool
+check_and_push_window_quals(Query *subquery, RangeTblEntry *rte, Index rti,
+							Node *clause)
+{
+	OpExpr	   *opexpr = (OpExpr *) clause;
+	bool		keep_original = true;
+	Var		   *var1;
+	Var		   *var2;
+
+	if (!IsA(opexpr, OpExpr))
+		return true;
+
+	if (list_length(opexpr->args) != 2)
+		return true;
+
+	/*
+	 * Check for plain Vars which reference window functions in the subquery.
+	 * If we find any, we'll ask find_window_run_conditions() if 'opexpr' can
+	 * be used as a run condition to allow us to stop windowagg execution
+	 * early.
+	 */
+
+	/* Check the left side of the OpExpr */
+	var1 = linitial(opexpr->args);
+	if (IsA(var1, Var) && var1->varattno > 0)
+	{
+		TargetEntry *tle = list_nth(subquery->targetList, var1->varattno - 1);
+		WindowFunc *wfunc = (WindowFunc *) tle->expr;
+
+		while (IsA(wfunc, RelabelType))
+			wfunc = (WindowFunc *) ((RelabelType *) wfunc)->arg;
+
+		if (IsA(wfunc, WindowFunc))
+		{
+			WindowClause *wclause = (WindowClause *)
+			list_nth(subquery->windowClause,
+					 wfunc->winref - 1);
+
+			if (find_window_run_conditions(subquery, rte, rti, tle->resno,
+										   wclause, wfunc, opexpr, true,
+										   &keep_original))
+				return keep_original;
+		}
+	}
+
+	/* and check the right side */
+	var2 = lsecond(opexpr->args);
+	if (IsA(var2, Var) && var2->varattno > 0)
+	{
+		TargetEntry *tle = list_nth(subquery->targetList, var2->varattno - 1);
+		WindowFunc *wfunc = (WindowFunc *) tle->expr;
+
+		while (IsA(wfunc, RelabelType))
+			wfunc = (WindowFunc *) ((RelabelType *) wfunc)->arg;
+
+		if (IsA(wfunc, WindowFunc))
+		{
+			WindowClause *wclause = (WindowClause *)
+			list_nth(subquery->windowClause,
+					 wfunc->winref - 1);
+
+			if (find_window_run_conditions(subquery, rte, rti, tle->resno,
+										   wclause, wfunc, opexpr, false,
+										   &keep_original))
+				return keep_original;
+		}
+	}
+
+	return true;
+}
+
 /*
  * set_subquery_pathlist
  *		Generate SubqueryScan access paths for a subquery RTE
@@ -2178,19 +2550,30 @@ set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 		foreach(l, rel->baserestrictinfo)
 		{
 			RestrictInfo *rinfo = (RestrictInfo *) lfirst(l);
+			Node	   *clause = (Node *) rinfo->clause;
 
 			if (!rinfo->pseudoconstant &&
 				qual_is_pushdown_safe(subquery, rti, rinfo, &safetyInfo))
 			{
-				Node	   *clause = (Node *) rinfo->clause;
-
 				/* Push it down */
 				subquery_push_qual(subquery, rte, rti, clause);
 			}
 			else
 			{
-				/* Keep it in the upper query */
-				upperrestrictlist = lappend(upperrestrictlist, rinfo);
+				/*
+				 * Since we can't push the qual down into the subquery, check
+				 * if it happens to reference a window function.  If so then
+				 * it might be useful to allow the window evaluation to stop
+				 * early.
+				 */
+				if (check_and_push_window_quals(subquery, rte, rti, clause))
+				{
+					/*
+					 * It's not a suitable window run condition qual or it is,
+					 * but the original must also be kept in the upper query.
+					 */
+					upperrestrictlist = lappend(upperrestrictlist, rinfo);
+				}
 			}
 		}
 		rel->baserestrictinfo = upperrestrictlist;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index a5f6d678cc..642e088f7e 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -286,6 +286,7 @@ static WindowAgg *make_windowagg(List *tlist, Index winref,
 								 int frameOptions, Node *startOffset, Node *endOffset,
 								 Oid startInRangeFunc, Oid endInRangeFunc,
 								 Oid inRangeColl, bool inRangeAsc, bool inRangeNullsFirst,
+								 List *runcondition, List *runconditionorig,
 								 Plan *lefttree);
 static Group *make_group(List *tlist, List *qual, int numGroupCols,
 						 AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
@@ -2619,6 +2620,8 @@ create_windowagg_plan(PlannerInfo *root, WindowAggPath *best_path)
 						  wc->inRangeColl,
 						  wc->inRangeAsc,
 						  wc->inRangeNullsFirst,
+						  wc->runcondition,
+						  wc->runconditionorig,
 						  subplan);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -6477,7 +6480,7 @@ make_windowagg(List *tlist, Index winref,
 			   int frameOptions, Node *startOffset, Node *endOffset,
 			   Oid startInRangeFunc, Oid endInRangeFunc,
 			   Oid inRangeColl, bool inRangeAsc, bool inRangeNullsFirst,
-			   Plan *lefttree)
+			   List *runcondition, List *runconditionorig, Plan *lefttree)
 {
 	WindowAgg  *node = makeNode(WindowAgg);
 	Plan	   *plan = &node->plan;
@@ -6494,6 +6497,8 @@ make_windowagg(List *tlist, Index winref,
 	node->frameOptions = frameOptions;
 	node->startOffset = startOffset;
 	node->endOffset = endOffset;
+	node->runcondition = runcondition;
+	node->runconditionorig = runconditionorig;
 	node->startInRangeFunc = startInRangeFunc;
 	node->endInRangeFunc = endInRangeFunc;
 	node->inRangeColl = inRangeColl;
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index e50624c465..a925aa3d83 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -870,6 +870,14 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 					fix_scan_expr(root, wplan->startOffset, rtoffset, 1);
 				wplan->endOffset =
 					fix_scan_expr(root, wplan->endOffset, rtoffset, 1);
+				wplan->runcondition = fix_scan_list(root,
+													wplan->runcondition,
+													rtoffset,
+													NUM_EXEC_TLIST(plan));
+				wplan->runconditionorig = fix_scan_list(root,
+														wplan->runconditionorig,
+														rtoffset,
+														NUM_EXEC_TLIST(plan));
 			}
 			break;
 		case T_Result:
diff --git a/src/backend/utils/adt/int8.c b/src/backend/utils/adt/int8.c
index 2168080dcc..8abd4e8598 100644
--- a/src/backend/utils/adt/int8.c
+++ b/src/backend/utils/adt/int8.c
@@ -25,6 +25,7 @@
 #include "optimizer/optimizer.h"
 #include "utils/builtins.h"
 #include "utils/int8.h"
+#include "utils/lsyscache.h"
 
 
 typedef struct
@@ -877,6 +878,7 @@ int8dec(PG_FUNCTION_ARGS)
 }
 
 
+
 /*
  * These functions are exactly like int8inc/int8dec but are used for
  * aggregates that count only non-null values.  Since the functions are
@@ -904,6 +906,49 @@ int8dec_any(PG_FUNCTION_ARGS)
 	return int8dec(fcinfo);
 }
 
+/*
+ * int8inc_support
+ *		prosupport function for int8inc() and int8inc_any()
+ */
+Datum
+int8inc_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq = (Node *) PG_GETARG_POINTER(0);
+
+	if (IsA(rawreq, SupportRequestWFuncMonotonic))
+	{
+		SupportRequestWFuncMonotonic *req = (SupportRequestWFuncMonotonic *) rawreq;
+		MonotonicFunction monotonic = MONOTONICFUNC_NONE;
+		int			frameOptions = req->window_clause->frameOptions;
+
+		/* No ORDER BY clause then all rows are peers */
+		if (req->window_clause->orderClause == NIL)
+			monotonic = MONOTONICFUNC_BOTH;
+		else
+		{
+			/*
+			 * Otherwise take into account the frame options.  When the frame
+			 * bound is the start of the window then the resulting value can
+			 * never decrease, therefore is monotonically increasing
+			 */
+			if (frameOptions & FRAMEOPTION_START_UNBOUNDED_PRECEDING)
+				monotonic |= MONOTONICFUNC_INCREASING;
+
+			/*
+			 * Likewise, if the frame bound is the end of the window then the
+			 * resulting value can never increase.
+			 */
+			if (frameOptions & FRAMEOPTION_END_UNBOUNDED_FOLLOWING)
+				monotonic |= MONOTONICFUNC_DECREASING;
+		}
+
+		req->monotonic = monotonic;
+		PG_RETURN_POINTER(req);
+	}
+
+	PG_RETURN_POINTER(NULL);
+}
+
 
 Datum
 int8larger(PG_FUNCTION_ARGS)
diff --git a/src/backend/utils/adt/windowfuncs.c b/src/backend/utils/adt/windowfuncs.c
index 9c127617d1..27257b7628 100644
--- a/src/backend/utils/adt/windowfuncs.c
+++ b/src/backend/utils/adt/windowfuncs.c
@@ -13,6 +13,7 @@
  */
 #include "postgres.h"
 
+#include "nodes/supportnodes.h"
 #include "utils/builtins.h"
 #include "windowapi.h"
 
@@ -88,6 +89,26 @@ window_row_number(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(curpos + 1);
 }
 
+/*
+ * window_row_number_support
+ *		prosupport function for window_row_number()
+ */
+Datum
+window_row_number_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq = (Node *) PG_GETARG_POINTER(0);
+
+	if (IsA(rawreq, SupportRequestWFuncMonotonic))
+	{
+		SupportRequestWFuncMonotonic *req = (SupportRequestWFuncMonotonic *) rawreq;
+
+		/* row_number() is monotonically increasing */
+		req->monotonic = MONOTONICFUNC_INCREASING;
+		PG_RETURN_POINTER(req);
+	}
+
+	PG_RETURN_POINTER(NULL);
+}
 
 /*
  * rank
@@ -110,6 +131,27 @@ window_rank(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(context->rank);
 }
 
+/*
+ * window_rank_support
+ *		prosupport function for window_rank()
+ */
+Datum
+window_rank_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq = (Node *) PG_GETARG_POINTER(0);
+
+	if (IsA(rawreq, SupportRequestWFuncMonotonic))
+	{
+		SupportRequestWFuncMonotonic *req = (SupportRequestWFuncMonotonic *) rawreq;
+
+		/* rank() is monotonically increasing */
+		req->monotonic = MONOTONICFUNC_INCREASING;
+		PG_RETURN_POINTER(req);
+	}
+
+	PG_RETURN_POINTER(NULL);
+}
+
 /*
  * dense_rank
  * Rank increases by 1 when key columns change.
@@ -130,6 +172,27 @@ window_dense_rank(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(context->rank);
 }
 
+/*
+ * window_dense_rank_support
+ *		prosupport function for window_dense_rank()
+ */
+Datum
+window_dense_rank_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq = (Node *) PG_GETARG_POINTER(0);
+
+	if (IsA(rawreq, SupportRequestWFuncMonotonic))
+	{
+		SupportRequestWFuncMonotonic *req = (SupportRequestWFuncMonotonic *) rawreq;
+
+		/* dense_rank() is monotonically increasing */
+		req->monotonic = MONOTONICFUNC_INCREASING;
+		PG_RETURN_POINTER(req);
+	}
+
+	PG_RETURN_POINTER(NULL);
+}
+
 /*
  * percent_rank
  * return fraction between 0 and 1 inclusive,
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index b603700ed9..22988d4c59 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6626,11 +6626,16 @@
 # count has two forms: count(any) and count(*)
 { oid => '2147',
   descr => 'number of input rows for which the input expression is not null',
-  proname => 'count', prokind => 'a', proisstrict => 'f', prorettype => 'int8',
-  proargtypes => 'any', prosrc => 'aggregate_dummy' },
+  proname => 'count', prosupport => 'int8inc_support', prokind => 'a',
+  proisstrict => 'f', prorettype => 'int8', proargtypes => 'any',
+  prosrc => 'aggregate_dummy' },
 { oid => '2803', descr => 'number of input rows',
-  proname => 'count', prokind => 'a', proisstrict => 'f', prorettype => 'int8',
-  proargtypes => '', prosrc => 'aggregate_dummy' },
+  proname => 'count', prosupport => 'int8inc_support', prokind => 'a',
+  proisstrict => 'f', prorettype => 'int8', proargtypes => '',
+  prosrc => 'aggregate_dummy' },
+{ oid => '8802', descr => 'planner support for count run condition',
+  proname => 'int8inc_support', prorettype => 'internal',
+  proargtypes => 'internal', prosrc => 'int8inc_support' },
 
 { oid => '2718',
   descr => 'population variance of bigint input values (square of the population standard deviation)',
@@ -10064,14 +10069,26 @@
 
 # SQL-spec window functions
 { oid => '3100', descr => 'row number within partition',
-  proname => 'row_number', prokind => 'w', proisstrict => 'f',
-  prorettype => 'int8', proargtypes => '', prosrc => 'window_row_number' },
+  proname => 'row_number', prosupport => 'window_row_number_support',
+  prokind => 'w', proisstrict => 'f',  prorettype => 'int8',
+  proargtypes => '', prosrc => 'window_row_number' },
+{ oid => '8799', descr => 'planner support for row_number run condition',
+  proname => 'window_row_number_support', prorettype => 'internal',
+  proargtypes => 'internal', prosrc => 'window_row_number_support' },
 { oid => '3101', descr => 'integer rank with gaps',
-  proname => 'rank', prokind => 'w', proisstrict => 'f', prorettype => 'int8',
+  proname => 'rank', prosupport => 'window_rank_support',
+  prokind => 'w', proisstrict => 'f', prorettype => 'int8',
   proargtypes => '', prosrc => 'window_rank' },
+{ oid => '8800', descr => 'planner support for rank run condition',
+  proname => 'window_rank_support', prorettype => 'internal',
+  proargtypes => 'internal', prosrc => 'window_rank_support' },
 { oid => '3102', descr => 'integer rank without gaps',
-  proname => 'dense_rank', prokind => 'w', proisstrict => 'f',
-  prorettype => 'int8', proargtypes => '', prosrc => 'window_dense_rank' },
+  proname => 'dense_rank', prosupport => 'window_dense_rank_support',
+  prokind => 'w', proisstrict => 'f', prorettype => 'int8', proargtypes => '',
+  prosrc => 'window_dense_rank' },
+{ oid => '8801', descr => 'planner support for dense rank run condition',
+  proname => 'window_dense_rank_support', prorettype => 'internal',
+  proargtypes => 'internal', prosrc => 'window_dense_rank_support' },
 { oid => '3103', descr => 'fractional rank within partition',
   proname => 'percent_rank', prokind => 'w', proisstrict => 'f',
   prorettype => 'float8', proargtypes => '', prosrc => 'window_percent_rank' },
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 37cb4f3d59..1e2befd797 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2408,6 +2408,10 @@ typedef struct WindowAggState
 	MemoryContext curaggcontext;	/* current aggregate's working data */
 	ExprContext *tmpcontext;	/* short-term evaluation context */
 
+	ExprState  *runcondition;	/* Condition which must remain true otherwise
+								 * execution of the WindowAgg will finish, or
+								 * NULL if there is no such condition. */
+
 	bool		all_first;		/* true if the scan is starting */
 	bool		all_done;		/* true if the scan is finished */
 	bool		partition_spooled;	/* true if all tuples in current partition
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 6a4d82f0a8..9c4799d3e1 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -526,7 +526,8 @@ typedef enum NodeTag
 	T_SupportRequestSelectivity,	/* in nodes/supportnodes.h */
 	T_SupportRequestCost,		/* in nodes/supportnodes.h */
 	T_SupportRequestRows,		/* in nodes/supportnodes.h */
-	T_SupportRequestIndexCondition	/* in nodes/supportnodes.h */
+	T_SupportRequestIndexCondition, /* in nodes/supportnodes.h */
+	T_SupportRequestWFuncMonotonic	/* in nodes/supportnodes.h */
 } NodeTag;
 
 /*
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index e28248af32..b76e3106dd 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -1381,6 +1381,8 @@ typedef struct WindowClause
 	int			frameOptions;	/* frame_clause options, see WindowDef */
 	Node	   *startOffset;	/* expression for starting bound, if any */
 	Node	   *endOffset;		/* expression for ending bound, if any */
+	List	   *runcondition;	/* Exec WindowAgg while this is true */
+	List	   *runconditionorig;	/* EXPLAIN compatible version of above */
 	Oid			startInRangeFunc;	/* in_range function for startOffset */
 	Oid			endInRangeFunc; /* in_range function for endOffset */
 	Oid			inRangeColl;	/* collation for in_range tests */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index ec9a8b0c81..1c71e6db41 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -893,6 +893,9 @@ typedef struct WindowAgg
 	int			frameOptions;	/* frame_clause options, see WindowDef */
 	Node	   *startOffset;	/* expression for starting bound, if any */
 	Node	   *endOffset;		/* expression for ending bound, if any */
+	List	   *runcondition;	/* Conditions that must remain true in order
+								 * for execution to continue */
+	List	   *runconditionorig;	/* runcondition for display in EXPLAIN */
 	/* these fields are used with RANGE offset PRECEDING/FOLLOWING: */
 	Oid			startInRangeFunc;	/* in_range function for startOffset */
 	Oid			endInRangeFunc; /* in_range function for endOffset */
@@ -1291,4 +1294,21 @@ typedef struct PlanInvalItem
 	uint32		hashValue;		/* hash value of object's cache lookup key */
 } PlanInvalItem;
 
+/*
+ * MonotonicFunction
+ *
+ * Allows the planner to track monotonic properties of functions.  A function
+ * is monotonically increasing if a subsequent call cannot yield a lower value
+ * than the previous call.  A monotonically decreasing function cannot yield a
+ * higher value, and a function which is both must return the same value on
+ * each call.
+ */
+typedef enum MonotonicFunction
+{
+	MONOTONICFUNC_NONE = 0,
+	MONOTONICFUNC_INCREASING = (1 << 0),
+	MONOTONICFUNC_DECREASING = (1 << 1),
+	MONOTONICFUNC_BOTH = MONOTONICFUNC_INCREASING | MONOTONICFUNC_DECREASING
+} MonotonicFunction;
+
 #endif							/* PLANNODES_H */
diff --git a/src/include/nodes/supportnodes.h b/src/include/nodes/supportnodes.h
index 85e1b8a832..33718332c9 100644
--- a/src/include/nodes/supportnodes.h
+++ b/src/include/nodes/supportnodes.h
@@ -33,12 +33,12 @@
 #ifndef SUPPORTNODES_H
 #define SUPPORTNODES_H
 
-#include "nodes/primnodes.h"
+#include "nodes/plannodes.h"
 
 struct PlannerInfo;				/* avoid including pathnodes.h here */
 struct IndexOptInfo;
 struct SpecialJoinInfo;
-
+struct WindowClause;
 
 /*
  * The Simplify request allows the support function to perform plan-time
@@ -239,4 +239,58 @@ typedef struct SupportRequestIndexCondition
 								 * equivalent of the function call */
 } SupportRequestIndexCondition;
 
+/* ----------
+ * To allow more efficient execution of any monotonically increasing and/or
+ * monotonically decreasing window functions, we support calling the window
+ * function's prosupport function passing along this struct whenever the
+ * planner sees an OpExpr qual directly reference a window function in a
+ * subquery.  When the planner encounters this, we populate this struct and
+ * pass it along to the window function's prosupport function so that it can
+ * evaluate if the given WindowFunc is;
+ *
+ * a) monotonically increasing, or
+ * b) monotonically decreasing, or
+ * c) both monotonically increasing and decreasing, or
+ * d) none of the above.
+ *
+ * A function that is monotonically increasing can never return a value that
+ * is lower than a value returned in a "previous call".  A monotonically
+ * decreasing function can never return a value higher than a value returned
+ * in a previous call.  A function which is both must return the same value
+ * each time.
+ *
+ * We define "previous call" to mean a previous call to the same WindowFunc
+ * struct in the same window partition.
+ *
+ * row_number() is an example of a monotonically increasing function.  The
+ * return value will be reset back to 1 in each new partition.  An example of
+ * a monotonically decreasing function is count(*) over ().  Since there is
+ * no ORDER BY clause in this example, all rows in the partition are peers and
+ * all rows within the partition will be within the frame bound.  Likewise for
+ * count(*) over(order by a rows between unbounded preceding and unbounded
+ * following).
+ *
+ * Inputs:
+ *	'window_func' is the pointer to the window function being called.
+ *
+ *	'window_clause' pointer to the WindowClause data.  Support functions can
+ *	use this to check frame bounds and partition by clauses, etc.
+ *
+ * Outputs:
+ *	'monotonic' the resulting MonotonicFunction value for the given input
+ *	window function and window clause.
+ * ----------
+ */
+typedef struct SupportRequestWFuncMonotonic
+{
+	NodeTag		type;
+
+	/* Input fields: */
+	WindowFunc *window_func;	/* Pointer to the window function data */
+	struct WindowClause *window_clause; /* Pointer to the window clause data */
+
+	/* Output fields: */
+	MonotonicFunction monotonic;
+} SupportRequestWFuncMonotonic;
+
 #endif							/* SUPPORTNODES_H */
diff --git a/src/test/regress/expected/window.out b/src/test/regress/expected/window.out
index bb9ff7f07b..7c343dc3e6 100644
--- a/src/test/regress/expected/window.out
+++ b/src/test/regress/expected/window.out
@@ -3336,6 +3336,275 @@ WHERE depname = 'sales';
                            ->  Seq Scan on empsalary
 (9 rows)
 
+-- Test window function run conditions are properly pushed down into the
+-- WindowAgg
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+                  QUERY PLAN                  
+----------------------------------------------
+ WindowAgg
+   Run Condition: (row_number() OVER (?) < 3)
+   ->  Sort
+         Sort Key: empsalary.empno
+         ->  Seq Scan on empsalary
+(5 rows)
+
+-- The following 3 statements should result the same result.
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+ empno | rn 
+-------+----
+     1 |  1
+     2 |  2
+(2 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE 3 > rn;
+ empno | rn 
+-------+----
+     1 |  1
+     2 |  2
+(2 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE 2 >= rn;
+ empno | rn 
+-------+----
+     1 |  1
+     2 |  2
+(2 rows)
+
+-- Ensure r <= 3 is pushed down into the run condition of the window agg
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          rank() OVER (ORDER BY salary DESC) r
+   FROM empsalary) emp
+WHERE r <= 3;
+               QUERY PLAN                
+-----------------------------------------
+ WindowAgg
+   Run Condition: (rank() OVER (?) <= 3)
+   ->  Sort
+         Sort Key: empsalary.salary DESC
+         ->  Seq Scan on empsalary
+(5 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          rank() OVER (ORDER BY salary DESC) r
+   FROM empsalary) emp
+WHERE r <= 3;
+ empno | salary | r 
+-------+--------+---
+     8 |   6000 | 1
+    10 |   5200 | 2
+    11 |   5200 | 2
+(3 rows)
+
+-- Ensure dr = 1 is converted to dr <= 1 to get all rows leading up to dr = 1
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+                     QUERY PLAN                      
+-----------------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.dr = 1)
+   ->  WindowAgg
+         Run Condition: (dense_rank() OVER (?) <= 1)
+         ->  Sort
+               Sort Key: empsalary.salary DESC
+               ->  Seq Scan on empsalary
+(7 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+ empno | salary | dr 
+-------+--------+----
+     8 |   6000 |  1
+(1 row)
+
+-- Check COUNT() and COUNT(*)
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+                QUERY PLAN                 
+-------------------------------------------
+ WindowAgg
+   Run Condition: (count(*) OVER (?) <= 3)
+   ->  Sort
+         Sort Key: empsalary.salary DESC
+         ->  Seq Scan on empsalary
+(5 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+ empno | salary | c 
+-------+--------+---
+     8 |   6000 | 1
+    10 |   5200 | 3
+    11 |   5200 | 3
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+                       QUERY PLAN                        
+---------------------------------------------------------
+ WindowAgg
+   Run Condition: (count(empsalary.empno) OVER (?) <= 3)
+   ->  Sort
+         Sort Key: empsalary.salary DESC
+         ->  Seq Scan on empsalary
+(5 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+ empno | salary | c 
+-------+--------+---
+     8 |   6000 | 1
+    10 |   5200 | 3
+    11 |   5200 | 3
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) c
+   FROM empsalary) emp
+WHERE c >= 3;
+                QUERY PLAN                 
+-------------------------------------------
+ WindowAgg
+   Run Condition: (count(*) OVER (?) >= 3)
+   ->  Sort
+         Sort Key: empsalary.salary DESC
+         ->  Seq Scan on empsalary
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER () c
+   FROM empsalary) emp
+WHERE 11 <= c;
+                 QUERY PLAN                 
+--------------------------------------------
+ WindowAgg
+   Run Condition: (11 <= count(*) OVER (?))
+   ->  Seq Scan on empsalary
+(3 rows)
+
+-- Tests to ensure we don't push down the run condition when it's not valid to
+-- do so.
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (PARTITION BY depname ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.rn < 3)
+   ->  WindowAgg
+         ->  Sort
+               Sort Key: empsalary.depname, empsalary.empno
+               ->  Seq Scan on empsalary
+(6 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (PARTITION BY depname ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+                            QUERY PLAN                            
+------------------------------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.c <= 3)
+   ->  WindowAgg
+         ->  Sort
+               Sort Key: empsalary.depname, empsalary.salary DESC
+               ->  Seq Scan on empsalary
+(6 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) c
+   FROM empsalary) emp
+WHERE c <= 3;
+                  QUERY PLAN                   
+-----------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.c <= 3)
+   ->  WindowAgg
+         ->  Sort
+               Sort Key: empsalary.salary DESC
+               ->  Seq Scan on empsalary
+(6 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary) c
+   FROM empsalary) emp
+WHERE 3 <= c;
+                QUERY PLAN                
+------------------------------------------
+ Subquery Scan on emp
+   Filter: (3 <= emp.c)
+   ->  WindowAgg
+         ->  Sort
+               Sort Key: empsalary.salary
+               ->  Seq Scan on empsalary
+(6 rows)
+
 -- Test Sort node collapsing
 EXPLAIN (COSTS OFF)
 SELECT * FROM
diff --git a/src/test/regress/sql/window.sql b/src/test/regress/sql/window.sql
index 41a8e0d152..c746d5343a 100644
--- a/src/test/regress/sql/window.sql
+++ b/src/test/regress/sql/window.sql
@@ -988,6 +988,146 @@ SELECT * FROM
    FROM empsalary) emp
 WHERE depname = 'sales';
 
+-- Test window function run conditions are properly pushed down into the
+-- WindowAgg
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+
+-- The following 3 statements should result the same result.
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE 3 > rn;
+
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE 2 >= rn;
+
+-- Ensure r <= 3 is pushed down into the run condition of the window agg
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          rank() OVER (ORDER BY salary DESC) r
+   FROM empsalary) emp
+WHERE r <= 3;
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          rank() OVER (ORDER BY salary DESC) r
+   FROM empsalary) emp
+WHERE r <= 3;
+
+-- Ensure dr = 1 is converted to dr <= 1 to get all rows leading up to dr = 1
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+
+-- Check COUNT() and COUNT(*)
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) c
+   FROM empsalary) emp
+WHERE c >= 3;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER () c
+   FROM empsalary) emp
+WHERE 11 <= c;
+
+-- Tests to ensure we don't push down the run condition when it's not valid to
+-- do so.
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (PARTITION BY depname ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (PARTITION BY depname ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary) c
+   FROM empsalary) emp
+WHERE 3 <= c;
+
 -- Test Sort node collapsing
 EXPLAIN (COSTS OFF)
 SELECT * FROM

Andy Fan

zhihui.fan1213@gmail.com

over 4 years ago

In reply to: David Rowley (#5)

Re: Window Function "Run Conditions"

Hi David:

Thanks for the patch.

On Wed, Aug 18, 2021 at 6:40 PM David Rowley <dgrowleyml@gmail.com> wrote:

On Tue, 17 Aug 2021 at 03:51, Zhihong Yu <zyu@yugabyte.com> wrote:

+ if ((res->monotonic & MONOTONICFUNC_INCREASING) == MONOTONICFUNC_INCREASING)

The above can be simplified as 'if (res->monotonic & MONOTONICFUNC_INCREASING) '

True. I've attached an updated patch.

David

Looks like we need to narrow down the situation where we can apply
this optimization.

SELECT * FROM
(SELECT empno,
salary,
count(*) over (order by empno desc) as c ,
dense_rank() OVER (ORDER BY salary DESC) dr

FROM empsalary) emp
WHERE dr = 1;

In the current master, the result is:

empno | salary | c | dr

-------+--------+---+----

8 | 6000 | 4 | 1

(1 row)

In the patched version， the result is:

empno | salary | c | dr

-------+--------+---+----

8 | 6000 | 1 | 1

(1 row)

--
Best Regards
Andy Fan (https://www.aliyun.com/)

David Rowley

dgrowleyml@gmail.com

over 4 years ago

In reply to: Andy Fan (#6)

Re: Window Function "Run Conditions"

On Thu, 19 Aug 2021 at 00:20, Andy Fan <zhihui.fan1213@gmail.com> wrote:

In the current master, the result is:

empno | salary | c | dr
-------+--------+---+----
8 | 6000 | 4 | 1

In the patched version， the result is:

empno | salary | c | dr
-------+--------+---+----
8 | 6000 | 1 | 1

Thanks for taking it for a spin.

That's a bit unfortunate. I don't immediately see how to fix it other
than to restrict the optimisation to only apply when there's a single
WindowClause. It might be possible to relax it further and only apply
if it's the final window clause to be evaluated, but in those cases,
the savings are likely to be much less anyway as some previous
WindowAgg will have exhausted all rows from its subplan. Likely
restricting it to only working if there's 1 WindowClause would be fine
as for the people using row_number() for a top-N type query, there's
most likely only going to be 1 WindowClause.

Anyway, I'll take a few more days to think about it before posting a fix.

David

Andy Fan

zhihui.fan1213@gmail.com

over 4 years ago

In reply to: David Rowley (#7)

Re: Window Function "Run Conditions"

On Thu, Aug 19, 2021 at 2:35 PM David Rowley <dgrowleyml@gmail.com> wrote:

On Thu, 19 Aug 2021 at 00:20, Andy Fan <zhihui.fan1213@gmail.com> wrote:

In the current master, the result is:

empno | salary | c | dr
-------+--------+---+----
8 | 6000 | 4 | 1

In the patched version， the result is:

empno | salary | c | dr
-------+--------+---+----
8 | 6000 | 1 | 1

Thanks for taking it for a spin.

That's a bit unfortunate. I don't immediately see how to fix it other
than to restrict the optimisation to only apply when there's a single
WindowClause. It might be possible to relax it further and only apply
if it's the final window clause to be evaluated, but in those cases,
the savings are likely to be much less anyway as some previous
WindowAgg will have exhausted all rows from its subplan.

I am trying to hack the select_active_windows function to make
sure the WindowClause with Run Condition clause to be the last one
to evaluate (we also need to consider more than 1 window func has
run condition), at that time, the run condition clause is ready already.

However there are two troubles in this direction: a). This may conflict
with "the windows that need the same sorting are adjacent in the list."
b). "when two or more windows are order-equivalent then all peer rows
must be presented in the same order in all of them. .. (See General Rule 4 of
<window clause> in SQL2008 - SQL2016.)"

In summary, I am not sure if it is correct to change the execution Order
of WindowAgg freely.

Likely
restricting it to only working if there's 1 WindowClause would be fine
as for the people using row_number() for a top-N type query, there's
most likely only going to be 1 WindowClause.

This sounds practical. And I suggest the following small changes.
(just check the partitionClause before the prosupport)

@@ -2133,20 +2133,22 @@ find_window_run_conditions(Query *subquery,
RangeTblEntry *rte, Index rti,

*keep_original = true;

-       prosupport = get_func_support(wfunc->winfnoid);
-
-       /* Check if there's a support function for 'wfunc' */
-       if (!OidIsValid(prosupport))
-               return false;
-
        /*
         * Currently the WindowAgg node just stop when the run condition is no
         * longer true.  If there is a PARTITION BY clause then we cannot just
         * stop as other partitions still need to be processed.
         */
+
+       /* Check this first since window function with a partition
clause is common*/
        if (wclause->partitionClause != NIL)
                return false;

+       prosupport = get_func_support(wfunc->winfnoid);
+
+       /* Check if there's a support function for 'wfunc' */
+       if (!OidIsValid(prosupport))
+               return false;
+
        /* get the Expr from the other side of the OpExpr */
        if (wfunc_left)
                otherexpr = lsecond(opexpr->args);

--
Best Regards
Andy Fan (https://www.aliyun.com/)

Greg Stark

stark@mit.edu

almost 4 years ago

In reply to: Andy Fan (#8)

Re: Window Function "Run Conditions"

This looks like an awesome addition.

I have one technical questions...

Is it possible to actually transform the row_number case into a LIMIT
clause or make the planner support for this case equivalent to it (in
which case we can replace the LIMIT clause planning to transform into
a window function)?

The reason I ask is because the Limit plan node is actually quite a
bit more optimized than the general window function plan node. It
calculates cost estimates based on the limit and can support Top-N
sort nodes.

But the bigger question is whether this patch is ready for a committer
to look at? Were you able to resolve Andy Fan's bug report? Did you
resolve the two questions in the original email?

#10

Corey Huinker

corey.huinker@gmail.com

almost 4 years ago

In reply to: Greg Stark (#9)

Re: Window Function "Run Conditions"

On Tue, Mar 15, 2022 at 5:24 PM Greg Stark <stark@mit.edu> wrote:

This looks like an awesome addition.

I have one technical questions...

Is it possible to actually transform the row_number case into a LIMIT
clause or make the planner support for this case equivalent to it (in
which case we can replace the LIMIT clause planning to transform into
a window function)?

The reason I ask is because the Limit plan node is actually quite a
bit more optimized than the general window function plan node. It
calculates cost estimates based on the limit and can support Top-N
sort nodes.

But the bigger question is whether this patch is ready for a committer
to look at? Were you able to resolve Andy Fan's bug report? Did you
resolve the two questions in the original email?

+1 to all this

It seems like this effort would aid in implementing what some other
databases implement via the QUALIFY clause, which is to window functions
what HAVING is to aggregate functions.
example:
https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#qualify_clause

#11

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: David Rowley (#7)

Re: Window Function "Run Conditions"

Hi,

On 2021-08-19 18:35:27 +1200, David Rowley wrote:

Anyway, I'll take a few more days to think about it before posting a fix.

The patch in the CF entry doesn't apply: http://cfbot.cputube.org/patch_37_3234.log

The quoted message was ~6 months ago. I think this CF entry should be marked
as returned-with-feedback?

- Andres

#12

David Rowley

dgrowleyml@gmail.com

almost 4 years ago

In reply to: Andy Fan (#8)

1 attachment(s)

Re: Window Function "Run Conditions"

On Thu, 26 Aug 2021 at 14:54, Andy Fan <zhihui.fan1213@gmail.com> wrote:

On Thu, Aug 19, 2021 at 2:35 PM David Rowley <dgrowleyml@gmail.com> wrote:

On Thu, 19 Aug 2021 at 00:20, Andy Fan <zhihui.fan1213@gmail.com> wrote:

In the current master, the result is:

empno | salary | c | dr
-------+--------+---+----
8 | 6000 | 4 | 1

In the patched version， the result is:

empno | salary | c | dr
-------+--------+---+----
8 | 6000 | 1 | 1

Thanks for taking it for a spin.

That's a bit unfortunate. I don't immediately see how to fix it other
than to restrict the optimisation to only apply when there's a single
WindowClause. It might be possible to relax it further and only apply
if it's the final window clause to be evaluated, but in those cases,
the savings are likely to be much less anyway as some previous
WindowAgg will have exhausted all rows from its subplan.

I am trying to hack the select_active_windows function to make
sure the WindowClause with Run Condition clause to be the last one
to evaluate (we also need to consider more than 1 window func has
run condition), at that time, the run condition clause is ready already.

However there are two troubles in this direction: a). This may conflict
with "the windows that need the same sorting are adjacent in the list."
b). "when two or more windows are order-equivalent then all peer rows
must be presented in the same order in all of them. .. (See General Rule 4 of
<window clause> in SQL2008 - SQL2016.)"

In summary, I am not sure if it is correct to change the execution Order
of WindowAgg freely.

Thanks for looking at that.

My current thoughts are that it just feels a little too risky to
adjust the comparison function that sorts the window clauses to pay
attention to the run-condition.

We would need to ensure that there's just a single window function
with a run condition as it wouldn't be valid for there to be multiple.
It would be easy enough to ensure we only push quals into just 1
window clause, but that and meddling with the evaluation order has
trade-offs. To do that properly, we'd likely want to consider the
costs when deciding which window clause would benefit from having
quals pushed the most. Plus, if we start messing with the evaluation
order then we'd likely really want some sort of costing to check if
pushing a qual and adjusting the evaluation order is actually cheaper
than not pushing the qual and keeping the clauses in the order that
requires the minimum number of sorts. The planner is not really
geared up for costing things like that properly, that's why we just
assume the order with the least sorts is best. In reality that's often
not going to be true as an index may exist and we might want to
evaluate a clause first if we could get rid of a sort and index scan
instead. Sorting the window clauses based on their SortGroupClause is
just the best we can do for now at that stage in planning.

I think it's safer to just disable the optimisation when there are
multiple window clauses. Multiple matching clauses are merged
already, so it's perfectly valid to have multiple window functions,
it's just they must share the same window clause. I don't think
that's terrible as with the major use case that I have in mind for
this, the window function is only added to limit the number of rows.
In most cases I can imagine, there'd be no reason to have an
additional window function with different frame options.

I've attached an updated patch.

Attachments:

v2-0001-Teach-planner-and-executor-about-monotonic-window.patchtext/plain; charset=US-ASCII; name=v2-0001-Teach-planner-and-executor-about-monotonic-window.patchDownload

From a130b09b7acf5e98785e4560b98f4aca2fce4080 Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Sat, 8 May 2021 23:43:25 +1200
Subject: [PATCH v2] Teach planner and executor about monotonic window funcs

Window functions such as row_number() always return a value higher than
the one previously in any given partition.  If a query were to only
request the first few row numbers, then traditionally we would continue
evaluating the WindowAgg node until all tuples are exhausted.  However, it
is possible if someone, say only wanted all row numbers <= 10, then we
could just stop once we get number 11.

Here we implement means to do just that.  This is done by way of adding a
prosupport function to various of the built-in window functions and adding
supporting code to allow the support function to inform the planner if
the function is monotonically increasing, monotonically decreasing, both
or neither.  The planner is then able to make use of that information and
possibly allow the executor to short-circuit execution by way of adding a
"run condition" to the WindowAgg to instruct it to run while this
condition is true and stop when it becomes false.

This optimization is only possible when the subquery contains only a
single window clause.  The problem with having multiple clauses is that at
the time when we're looking for run conditions, we don't yet know the
order that the window clauses will be evaluated.  We cannot add a run
condition to a window clause that is evaluated first as we may stop
execution and not output rows that are required for a window clause which
will be evaluated later.

Here we add prosupport functions to allow this to work for; row_number(),
rank(), dense_rank(), count(*) and count(expr).

Author: David Rowley
Reviewed-by: Andy Fan
Discussion: https://postgr.es/m/CAApHDvqvp3At8++yF8ij06sdcoo1S_b2YoaT9D4Nf+MObzsrLQ@mail.gmail.com
---
 src/backend/commands/explain.c          |   4 +
 src/backend/executor/nodeWindowAgg.c    |  28 +-
 src/backend/nodes/copyfuncs.c           |   4 +
 src/backend/nodes/equalfuncs.c          |   2 +
 src/backend/nodes/outfuncs.c            |   4 +
 src/backend/nodes/readfuncs.c           |   4 +
 src/backend/optimizer/path/allpaths.c   | 405 +++++++++++++++++++++++-
 src/backend/optimizer/plan/createplan.c |   7 +-
 src/backend/optimizer/plan/setrefs.c    |   8 +
 src/backend/utils/adt/int8.c            |  45 +++
 src/backend/utils/adt/windowfuncs.c     |  63 ++++
 src/include/catalog/pg_proc.dat         |  35 +-
 src/include/nodes/execnodes.h           |   4 +
 src/include/nodes/nodes.h               |   3 +-
 src/include/nodes/parsenodes.h          |   2 +
 src/include/nodes/plannodes.h           |  20 ++
 src/include/nodes/supportnodes.h        |  61 +++-
 src/test/regress/expected/window.out    | 315 ++++++++++++++++++
 src/test/regress/sql/window.sql         | 165 ++++++++++
 19 files changed, 1161 insertions(+), 18 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 9f632285b6..92ca4acf50 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1985,6 +1985,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
 			break;
+		case T_WindowAgg:
+			show_upper_qual(((WindowAgg *) plan)->runconditionorig,
+							"Run Condition", planstate, ancestors, es);
+			break;
 		case T_Group:
 			show_group_keys(castNode(GroupState, planstate), ancestors, es);
 			show_upper_qual(plan->qual, "Filter", planstate, ancestors, es);
diff --git a/src/backend/executor/nodeWindowAgg.c b/src/backend/executor/nodeWindowAgg.c
index 08ce05ca5a..3e4f00e162 100644
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -2023,6 +2023,7 @@ static TupleTableSlot *
 ExecWindowAgg(PlanState *pstate)
 {
 	WindowAggState *winstate = castNode(WindowAggState, pstate);
+	TupleTableSlot *slot;
 	ExprContext *econtext;
 	int			i;
 	int			numfuncs;
@@ -2235,7 +2236,20 @@ ExecWindowAgg(PlanState *pstate)
 	 */
 	econtext->ecxt_outertuple = winstate->ss.ss_ScanTupleSlot;
 
-	return ExecProject(winstate->ss.ps.ps_ProjInfo);
+	slot = ExecProject(winstate->ss.ps.ps_ProjInfo);
+
+	/*
+	 * Now evaluate the run condition to see if we need to continue further
+	 * with execution.
+	 */
+	econtext->ecxt_scantuple = slot;
+	if (!ExecQual(winstate->runcondition, econtext))
+	{
+		winstate->all_done = true;
+		return NULL;
+	}
+
+	return slot;
 }
 
 /* -----------------
@@ -2307,6 +2321,18 @@ ExecInitWindowAgg(WindowAgg *node, EState *estate, int eflags)
 	Assert(node->plan.qual == NIL);
 	winstate->ss.ps.qual = NULL;
 
+	/*
+	 * Setup the run condition, if we received one from the query planner.
+	 * When set, this can allow us to finish execution early because we know
+	 * some higher-level filter exists that would just filter out any further
+	 * results that we produce.
+	 */
+	if (node->runcondition != NIL)
+		winstate->runcondition = ExecInitQual(node->runcondition,
+											  (PlanState *) winstate);
+	else
+		winstate->runcondition = NULL;
+
 	/*
 	 * initialize child nodes
 	 */
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 55c36b46a8..1ce3f50ce1 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -1102,6 +1102,8 @@ _copyWindowAgg(const WindowAgg *from)
 	COPY_SCALAR_FIELD(frameOptions);
 	COPY_NODE_FIELD(startOffset);
 	COPY_NODE_FIELD(endOffset);
+	COPY_NODE_FIELD(runcondition);
+	COPY_NODE_FIELD(runconditionorig);
 	COPY_SCALAR_FIELD(startInRangeFunc);
 	COPY_SCALAR_FIELD(endInRangeFunc);
 	COPY_SCALAR_FIELD(inRangeColl);
@@ -2625,6 +2627,8 @@ _copyWindowClause(const WindowClause *from)
 	COPY_SCALAR_FIELD(frameOptions);
 	COPY_NODE_FIELD(startOffset);
 	COPY_NODE_FIELD(endOffset);
+	COPY_NODE_FIELD(runcondition);
+	COPY_NODE_FIELD(runconditionorig);
 	COPY_SCALAR_FIELD(startInRangeFunc);
 	COPY_SCALAR_FIELD(endInRangeFunc);
 	COPY_SCALAR_FIELD(inRangeColl);
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 0ddebd066e..b893cffb64 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -2909,6 +2909,8 @@ _equalWindowClause(const WindowClause *a, const WindowClause *b)
 	COMPARE_SCALAR_FIELD(frameOptions);
 	COMPARE_NODE_FIELD(startOffset);
 	COMPARE_NODE_FIELD(endOffset);
+	COMPARE_NODE_FIELD(runcondition);
+	COMPARE_NODE_FIELD(runconditionorig);
 	COMPARE_SCALAR_FIELD(startInRangeFunc);
 	COMPARE_SCALAR_FIELD(endInRangeFunc);
 	COMPARE_SCALAR_FIELD(inRangeColl);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 449d90c8f4..a22e7983b0 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -827,6 +827,8 @@ _outWindowAgg(StringInfo str, const WindowAgg *node)
 	WRITE_INT_FIELD(frameOptions);
 	WRITE_NODE_FIELD(startOffset);
 	WRITE_NODE_FIELD(endOffset);
+	WRITE_NODE_FIELD(runcondition);
+	WRITE_NODE_FIELD(runconditionorig);
 	WRITE_OID_FIELD(startInRangeFunc);
 	WRITE_OID_FIELD(endInRangeFunc);
 	WRITE_OID_FIELD(inRangeColl);
@@ -3178,6 +3180,8 @@ _outWindowClause(StringInfo str, const WindowClause *node)
 	WRITE_INT_FIELD(frameOptions);
 	WRITE_NODE_FIELD(startOffset);
 	WRITE_NODE_FIELD(endOffset);
+	WRITE_NODE_FIELD(runcondition);
+	WRITE_NODE_FIELD(runconditionorig);
 	WRITE_OID_FIELD(startInRangeFunc);
 	WRITE_OID_FIELD(endInRangeFunc);
 	WRITE_OID_FIELD(inRangeColl);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 6f398cdc15..937d7394bd 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -382,6 +382,8 @@ _readWindowClause(void)
 	READ_INT_FIELD(frameOptions);
 	READ_NODE_FIELD(startOffset);
 	READ_NODE_FIELD(endOffset);
+	READ_NODE_FIELD(runcondition);
+	READ_NODE_FIELD(runconditionorig);
 	READ_OID_FIELD(startInRangeFunc);
 	READ_OID_FIELD(endInRangeFunc);
 	READ_OID_FIELD(inRangeColl);
@@ -2392,6 +2394,8 @@ _readWindowAgg(void)
 	READ_INT_FIELD(frameOptions);
 	READ_NODE_FIELD(startOffset);
 	READ_NODE_FIELD(endOffset);
+	READ_NODE_FIELD(runcondition);
+	READ_NODE_FIELD(runconditionorig);
 	READ_OID_FIELD(startInRangeFunc);
 	READ_OID_FIELD(endInRangeFunc);
 	READ_OID_FIELD(inRangeColl);
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 169b1d53fc..d823c09e3c 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -27,6 +27,7 @@
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
+#include "nodes/supportnodes.h"
 #ifdef OPTIMIZER_DEBUG
 #include "nodes/print.h"
 #endif
@@ -2157,6 +2158,391 @@ has_multiple_baserels(PlannerInfo *root)
 	return false;
 }
 
+/*
+ * find_window_run_conditions
+ *		Call wfunc's prosupport function to ask if the given window function
+ *		is monotonic and then see if 'opexpr' can be used to stop processing
+ *		WindowAgg nodes early.
+ *
+ * For example row_number() over (order by ...) always produces a value one
+ * higher than the previous.  If someone has a window function such as that
+ * in a subquery and just wants, say all rows with a row number less than or
+ * equal to 10, then we may as well stop processing the windowagg once the row
+ * number reaches 11.  Here we look for opexprs that might help us to stop
+ * doing needless extra processing in WindowAgg nodes.
+ *
+ * To do this we make use of the window function's prosupport function to
+ * determine if the given window function with the given window clause is a
+ * monotonically increasing or decreasing function.
+ *
+ * '*keep_original' is set to true if the caller should also use 'opexpr' for
+ * its original purpose.  This is set to false if the caller can assume that
+ * the run condition will handle all of the required filtering.
+ *
+ * Returns true if a run condition qual was found and added to the
+ * wclause->runcondition list and sets *keep_original accordingly.  Returns
+ * false if unable to use 'opexpr' as a run condition and does not set
+ * *keep_original.
+ */
+static bool
+find_window_run_conditions(Query *subquery, RangeTblEntry *rte, Index rti,
+						   AttrNumber attno, WindowClause *wclause,
+						   WindowFunc *wfunc, OpExpr *opexpr, bool wfunc_left,
+						   bool *keep_original)
+{
+	Oid			prosupport;
+	Expr	   *otherexpr;
+	SupportRequestWFuncMonotonic req;
+	SupportRequestWFuncMonotonic *res;
+	List	   *opinfos;
+	OpExpr	   *runopexpr;
+	ListCell   *lc;
+
+	*keep_original = true;
+
+	/*
+	 * Currently the WindowAgg node just stops when the run condition is no
+	 * longer true.  If there is a PARTITION BY clause then we cannot just
+	 * stop as other partitions still need to be processed.
+	 */
+	if (wclause->partitionClause != NIL)
+		return false;
+
+	prosupport = get_func_support(wfunc->winfnoid);
+
+	/* Check if there's a support function for 'wfunc' */
+	if (!OidIsValid(prosupport))
+		return false;
+
+	/* get the Expr from the other side of the OpExpr */
+	if (wfunc_left)
+		otherexpr = lsecond(opexpr->args);
+	else
+		otherexpr = linitial(opexpr->args);
+
+	/*
+	 * The value being compared must not change during the evaluation of the
+	 * window partition.
+	 */
+	if (!is_pseudo_constant_clause((Node *) otherexpr))
+		return false;
+
+	req.type = T_SupportRequestWFuncMonotonic;
+	req.window_func = wfunc;
+	req.window_clause = wclause;
+
+	/* Call the support function */
+	res = (SupportRequestWFuncMonotonic *)
+		DatumGetPointer(OidFunctionCall1(prosupport,
+										 PointerGetDatum(&req)));
+
+	/*
+	 * Nothing to do if the function is not monotonically increasing or
+	 * decreasing.
+	 */
+	if (res == NULL || res->monotonic == MONOTONICFUNC_NONE)
+		return false;
+
+	runopexpr = NULL;
+	opinfos = get_op_btree_interpretation(opexpr->opno);
+
+	foreach(lc, opinfos)
+	{
+		OpBtreeInterpretation *opinfo = (OpBtreeInterpretation *) lfirst(lc);
+		int			strategy = opinfo->strategy;
+
+		if (wfunc_left)
+		{
+			/* Handle < / <= */
+			if (strategy == BTLessStrategyNumber ||
+				strategy == BTLessEqualStrategyNumber)
+			{
+				/*
+				 * If the frame is bound to the top of the window then the
+				 * result cannot decrease.
+				 */
+				if (res->monotonic & MONOTONICFUNC_INCREASING)
+				{
+					*keep_original = false;
+					runopexpr = opexpr;
+				}
+				break;
+			}
+			/* Handle > / >= */
+			else if (strategy == BTGreaterStrategyNumber ||
+					 strategy == BTGreaterEqualStrategyNumber)
+			{
+				/*
+				 * If the frame is bound to the bottom of the window then the
+				 * result cannot increase.
+				 */
+				if (res->monotonic & MONOTONICFUNC_DECREASING)
+				{
+					*keep_original = false;
+					runopexpr = opexpr;
+				}
+				break;
+			}
+			/* Handle = */
+			else if (strategy == BTEqualStrategyNumber)
+			{
+				OpExpr	   *newopexpr;
+				Oid			op;
+				int16		newstrategy;
+
+				/*
+				 * When both monotonically increasing and decreasing then the
+				 * return value of the window function will be the same each
+				 * time.  We can simply use 'opexpr' as the run condition
+				 * without modifying it.
+				 */
+				if ((res->monotonic & MONOTONICFUNC_BOTH) == MONOTONICFUNC_BOTH)
+				{
+					*keep_original = false;
+					runopexpr = opexpr;
+					break;
+				}
+
+				/*
+				 * When monotonically increasing we make a qual with <wfunc>
+				 * <= <value> in order to filter out values which are above
+				 * the value in the equality condition.  For monotonically
+				 * decreasing we want to filter values below the value in the
+				 * equality condition.
+				 */
+				if (res->monotonic & MONOTONICFUNC_INCREASING)
+					newstrategy = BTLessEqualStrategyNumber;
+				else
+					newstrategy = BTGreaterEqualStrategyNumber;
+
+				op = get_opfamily_member(opinfo->opfamily_id,
+										 opinfo->oplefttype,
+										 opinfo->oprighttype,
+										 newstrategy);
+
+				newopexpr = (OpExpr *) make_opclause(op,
+													 opexpr->opresulttype,
+													 opexpr->opretset,
+													 (Expr *) wfunc,
+													 otherexpr,
+													 opexpr->opcollid,
+													 opexpr->inputcollid);
+				newopexpr->opfuncid = get_opcode(op);
+
+				*keep_original = true;
+				runopexpr = newopexpr;
+				break;
+			}
+		}
+		else
+		{
+			/* Handle > / >= */
+			if (strategy == BTGreaterStrategyNumber ||
+				strategy == BTGreaterEqualStrategyNumber)
+			{
+				/*
+				 * If the frame is bound to the top of the window then the
+				 * result cannot decrease.
+				 */
+				if (res->monotonic & MONOTONICFUNC_INCREASING)
+				{
+					*keep_original = false;
+					runopexpr = opexpr;
+				}
+				break;
+			}
+			/* Handle < / <= */
+			else if (strategy == BTLessStrategyNumber ||
+					 strategy == BTLessEqualStrategyNumber)
+			{
+				/*
+				 * If the frame is bound to the bottom of the window then the
+				 * result cannot increase.
+				 */
+				if (res->monotonic & MONOTONICFUNC_DECREASING)
+				{
+					*keep_original = false;
+					runopexpr = opexpr;
+				}
+				break;
+			}
+			/* Handle = */
+			else if (strategy == BTEqualStrategyNumber)
+			{
+				OpExpr	   *newopexpr;
+				Oid			op;
+				int16		newstrategy;
+
+				/*
+				 * When both monotonically increasing and decreasing then the
+				 * return value of the window function will be the same each
+				 * time.  We can simply use 'opexpr' as the run condition
+				 * without modifying it.
+				 */
+				if ((res->monotonic & MONOTONICFUNC_BOTH) == MONOTONICFUNC_BOTH)
+				{
+					*keep_original = false;
+					runopexpr = opexpr;
+					break;
+				}
+
+				/*
+				 * When monotonically increasing we make a qual with <value>
+				 * >= <wfunc> in order to filter out values which are above
+				 * the value in the equality condition.  For monotonically
+				 * decreasing we want to filter values below the value in the
+				 * equality condition.
+				 */
+				if (res->monotonic & MONOTONICFUNC_INCREASING)
+					newstrategy = BTGreaterEqualStrategyNumber;
+				else
+					newstrategy = BTLessEqualStrategyNumber;
+
+				op = get_opfamily_member(opinfo->opfamily_id,
+										 opinfo->oplefttype,
+										 opinfo->oprighttype,
+										 newstrategy);
+
+				newopexpr = (OpExpr *) make_opclause(op,
+													 opexpr->opresulttype,
+													 opexpr->opretset,
+													 otherexpr,
+													 (Expr *) wfunc,
+													 opexpr->opcollid,
+													 opexpr->inputcollid);
+				newopexpr->opfuncid = get_opcode(op);
+
+				*keep_original = true;
+				runopexpr = newopexpr;
+				break;
+			}
+		}
+	}
+
+	if (runopexpr != NULL)
+	{
+		Expr	   *origexpr;
+
+		wclause->runcondition = lappend(wclause->runcondition, runopexpr);
+
+		/* also create a version of the qual that we can display in EXPLAIN */
+		if (wfunc_left)
+			origexpr = make_opclause(runopexpr->opno,
+									 runopexpr->opresulttype,
+									 runopexpr->opretset, (Expr *) wfunc,
+									 otherexpr, runopexpr->opcollid,
+									 runopexpr->inputcollid);
+		else
+			origexpr = make_opclause(runopexpr->opno,
+									 runopexpr->opresulttype,
+									 runopexpr->opretset,
+									 otherexpr, (Expr *) wfunc,
+									 runopexpr->opcollid,
+									 runopexpr->inputcollid);
+
+		wclause->runconditionorig = lappend(wclause->runconditionorig, origexpr);
+		return true;
+	}
+
+	/* unsupported OpExpr */
+	return false;
+}
+
+/*
+ * check_and_push_window_quals
+ *		Check if 'rinfo' is a qual that can be pushed into a WindowFunc as a
+ *		'runcondition' qual.  These, when present, cause the window function
+ *		evaluation to stop when the condition becomes false.
+ *
+ * Returns true if the caller still must keep the original qual or false if
+ * the caller can safely ignore the original qual because the window function
+ * will use the run condition to stop at the right time.
+ */
+static bool
+check_and_push_window_quals(Query *subquery, RangeTblEntry *rte, Index rti,
+							Node *clause)
+{
+	OpExpr	   *opexpr = (OpExpr *) clause;
+	bool		keep_original = true;
+	Var		   *var1;
+	Var		   *var2;
+
+	/* We're only able to use OpExprs with 2 operands */
+	if (!IsA(opexpr, OpExpr))
+		return true;
+
+	if (list_length(opexpr->args) != 2)
+		return true;
+
+	/*
+	 * We don't currently attempt to generate window function run conditions
+	 * when there are multiple window clauses.  It may seem possible to relax
+	 * this a bit and ensure we only put run conditions on the final window
+	 * clause to be evaluated, however, currently we've yet to determine the
+	 * order that the window clauses will be evaluated.  This's done later in
+	 * select_active_windows().  If we were to put run conditions on anything
+	 * apart from the final window clause to be evaluated then we may filter
+	 * rows that are required for a yet-to-be-evaluated window clause.
+	 */
+	if (list_length(subquery->windowClause) > 1)
+		return true;
+
+	/*
+	 * Check for plain Vars which reference window functions in the subquery.
+	 * If we find any, we'll ask find_window_run_conditions() if 'opexpr' can
+	 * be used as a run condition to allow us to stop windowagg execution
+	 * early.
+	 */
+
+	/* Check the left side of the OpExpr */
+	var1 = linitial(opexpr->args);
+	if (IsA(var1, Var) && var1->varattno > 0)
+	{
+		TargetEntry *tle = list_nth(subquery->targetList, var1->varattno - 1);
+		WindowFunc *wfunc = (WindowFunc *) tle->expr;
+
+		while (IsA(wfunc, RelabelType))
+			wfunc = (WindowFunc *) ((RelabelType *) wfunc)->arg;
+
+		if (IsA(wfunc, WindowFunc))
+		{
+			WindowClause *wclause = (WindowClause *)
+			list_nth(subquery->windowClause,
+					 wfunc->winref - 1);
+
+			if (find_window_run_conditions(subquery, rte, rti, tle->resno,
+										   wclause, wfunc, opexpr, true,
+										   &keep_original))
+				return keep_original;
+		}
+	}
+
+	/* and check the right side */
+	var2 = lsecond(opexpr->args);
+	if (IsA(var2, Var) && var2->varattno > 0)
+	{
+		TargetEntry *tle = list_nth(subquery->targetList, var2->varattno - 1);
+		WindowFunc *wfunc = (WindowFunc *) tle->expr;
+
+		while (IsA(wfunc, RelabelType))
+			wfunc = (WindowFunc *) ((RelabelType *) wfunc)->arg;
+
+		if (IsA(wfunc, WindowFunc))
+		{
+			WindowClause *wclause = (WindowClause *)
+			list_nth(subquery->windowClause,
+					 wfunc->winref - 1);
+
+			if (find_window_run_conditions(subquery, rte, rti, tle->resno,
+										   wclause, wfunc, opexpr, false,
+										   &keep_original))
+				return keep_original;
+		}
+	}
+
+	return true;
+}
+
 /*
  * set_subquery_pathlist
  *		Generate SubqueryScan access paths for a subquery RTE
@@ -2245,19 +2631,30 @@ set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 		foreach(l, rel->baserestrictinfo)
 		{
 			RestrictInfo *rinfo = (RestrictInfo *) lfirst(l);
+			Node	   *clause = (Node *) rinfo->clause;
 
 			if (!rinfo->pseudoconstant &&
 				qual_is_pushdown_safe(subquery, rti, rinfo, &safetyInfo))
 			{
-				Node	   *clause = (Node *) rinfo->clause;
-
 				/* Push it down */
 				subquery_push_qual(subquery, rte, rti, clause);
 			}
 			else
 			{
-				/* Keep it in the upper query */
-				upperrestrictlist = lappend(upperrestrictlist, rinfo);
+				/*
+				 * Since we can't push the qual down into the subquery, check
+				 * if it happens to reference a window function.  If so then
+				 * it might be useful to allow the window evaluation to stop
+				 * early.
+				 */
+				if (check_and_push_window_quals(subquery, rte, rti, clause))
+				{
+					/*
+					 * It's not a suitable window run condition qual or it is,
+					 * but the original must also be kept in the upper query.
+					 */
+					upperrestrictlist = lappend(upperrestrictlist, rinfo);
+				}
 			}
 		}
 		rel->baserestrictinfo = upperrestrictlist;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index fa069a217c..1c750bd7a4 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -288,6 +288,7 @@ static WindowAgg *make_windowagg(List *tlist, Index winref,
 								 int frameOptions, Node *startOffset, Node *endOffset,
 								 Oid startInRangeFunc, Oid endInRangeFunc,
 								 Oid inRangeColl, bool inRangeAsc, bool inRangeNullsFirst,
+								 List *runcondition, List *runconditionorig,
 								 Plan *lefttree);
 static Group *make_group(List *tlist, List *qual, int numGroupCols,
 						 AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
@@ -2641,6 +2642,8 @@ create_windowagg_plan(PlannerInfo *root, WindowAggPath *best_path)
 						  wc->inRangeColl,
 						  wc->inRangeAsc,
 						  wc->inRangeNullsFirst,
+						  wc->runcondition,
+						  wc->runconditionorig,
 						  subplan);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -6525,7 +6528,7 @@ make_windowagg(List *tlist, Index winref,
 			   int frameOptions, Node *startOffset, Node *endOffset,
 			   Oid startInRangeFunc, Oid endInRangeFunc,
 			   Oid inRangeColl, bool inRangeAsc, bool inRangeNullsFirst,
-			   Plan *lefttree)
+			   List *runcondition, List *runconditionorig, Plan *lefttree)
 {
 	WindowAgg  *node = makeNode(WindowAgg);
 	Plan	   *plan = &node->plan;
@@ -6542,6 +6545,8 @@ make_windowagg(List *tlist, Index winref,
 	node->frameOptions = frameOptions;
 	node->startOffset = startOffset;
 	node->endOffset = endOffset;
+	node->runcondition = runcondition;
+	node->runconditionorig = runconditionorig;
 	node->startInRangeFunc = startInRangeFunc;
 	node->endInRangeFunc = endInRangeFunc;
 	node->inRangeColl = inRangeColl;
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index a7b11b7f03..eaeba17088 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -897,6 +897,14 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 					fix_scan_expr(root, wplan->startOffset, rtoffset, 1);
 				wplan->endOffset =
 					fix_scan_expr(root, wplan->endOffset, rtoffset, 1);
+				wplan->runcondition = fix_scan_list(root,
+													wplan->runcondition,
+													rtoffset,
+													NUM_EXEC_TLIST(plan));
+				wplan->runconditionorig = fix_scan_list(root,
+														wplan->runconditionorig,
+														rtoffset,
+														NUM_EXEC_TLIST(plan));
 			}
 			break;
 		case T_Result:
diff --git a/src/backend/utils/adt/int8.c b/src/backend/utils/adt/int8.c
index 4a87114a4f..b4bcc26d6c 100644
--- a/src/backend/utils/adt/int8.c
+++ b/src/backend/utils/adt/int8.c
@@ -24,6 +24,7 @@
 #include "nodes/supportnodes.h"
 #include "optimizer/optimizer.h"
 #include "utils/builtins.h"
+#include "utils/lsyscache.h"
 
 
 typedef struct
@@ -791,6 +792,7 @@ int8dec(PG_FUNCTION_ARGS)
 }
 
 
+
 /*
  * These functions are exactly like int8inc/int8dec but are used for
  * aggregates that count only non-null values.  Since the functions are
@@ -818,6 +820,49 @@ int8dec_any(PG_FUNCTION_ARGS)
 	return int8dec(fcinfo);
 }
 
+/*
+ * int8inc_support
+ *		prosupport function for int8inc() and int8inc_any()
+ */
+Datum
+int8inc_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq = (Node *) PG_GETARG_POINTER(0);
+
+	if (IsA(rawreq, SupportRequestWFuncMonotonic))
+	{
+		SupportRequestWFuncMonotonic *req = (SupportRequestWFuncMonotonic *) rawreq;
+		MonotonicFunction monotonic = MONOTONICFUNC_NONE;
+		int			frameOptions = req->window_clause->frameOptions;
+
+		/* No ORDER BY clause then all rows are peers */
+		if (req->window_clause->orderClause == NIL)
+			monotonic = MONOTONICFUNC_BOTH;
+		else
+		{
+			/*
+			 * Otherwise take into account the frame options.  When the frame
+			 * bound is the start of the window then the resulting value can
+			 * never decrease, therefore is monotonically increasing
+			 */
+			if (frameOptions & FRAMEOPTION_START_UNBOUNDED_PRECEDING)
+				monotonic |= MONOTONICFUNC_INCREASING;
+
+			/*
+			 * Likewise, if the frame bound is the end of the window then the
+			 * resulting value can never increase.
+			 */
+			if (frameOptions & FRAMEOPTION_END_UNBOUNDED_FOLLOWING)
+				monotonic |= MONOTONICFUNC_DECREASING;
+		}
+
+		req->monotonic = monotonic;
+		PG_RETURN_POINTER(req);
+	}
+
+	PG_RETURN_POINTER(NULL);
+}
+
 
 Datum
 int8larger(PG_FUNCTION_ARGS)
diff --git a/src/backend/utils/adt/windowfuncs.c b/src/backend/utils/adt/windowfuncs.c
index 3e0cc9be1a..596564fa15 100644
--- a/src/backend/utils/adt/windowfuncs.c
+++ b/src/backend/utils/adt/windowfuncs.c
@@ -13,6 +13,7 @@
  */
 #include "postgres.h"
 
+#include "nodes/supportnodes.h"
 #include "utils/builtins.h"
 #include "windowapi.h"
 
@@ -88,6 +89,26 @@ window_row_number(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(curpos + 1);
 }
 
+/*
+ * window_row_number_support
+ *		prosupport function for window_row_number()
+ */
+Datum
+window_row_number_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq = (Node *) PG_GETARG_POINTER(0);
+
+	if (IsA(rawreq, SupportRequestWFuncMonotonic))
+	{
+		SupportRequestWFuncMonotonic *req = (SupportRequestWFuncMonotonic *) rawreq;
+
+		/* row_number() is monotonically increasing */
+		req->monotonic = MONOTONICFUNC_INCREASING;
+		PG_RETURN_POINTER(req);
+	}
+
+	PG_RETURN_POINTER(NULL);
+}
 
 /*
  * rank
@@ -110,6 +131,27 @@ window_rank(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(context->rank);
 }
 
+/*
+ * window_rank_support
+ *		prosupport function for window_rank()
+ */
+Datum
+window_rank_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq = (Node *) PG_GETARG_POINTER(0);
+
+	if (IsA(rawreq, SupportRequestWFuncMonotonic))
+	{
+		SupportRequestWFuncMonotonic *req = (SupportRequestWFuncMonotonic *) rawreq;
+
+		/* rank() is monotonically increasing */
+		req->monotonic = MONOTONICFUNC_INCREASING;
+		PG_RETURN_POINTER(req);
+	}
+
+	PG_RETURN_POINTER(NULL);
+}
+
 /*
  * dense_rank
  * Rank increases by 1 when key columns change.
@@ -130,6 +172,27 @@ window_dense_rank(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(context->rank);
 }
 
+/*
+ * window_dense_rank_support
+ *		prosupport function for window_dense_rank()
+ */
+Datum
+window_dense_rank_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq = (Node *) PG_GETARG_POINTER(0);
+
+	if (IsA(rawreq, SupportRequestWFuncMonotonic))
+	{
+		SupportRequestWFuncMonotonic *req = (SupportRequestWFuncMonotonic *) rawreq;
+
+		/* dense_rank() is monotonically increasing */
+		req->monotonic = MONOTONICFUNC_INCREASING;
+		PG_RETURN_POINTER(req);
+	}
+
+	PG_RETURN_POINTER(NULL);
+}
+
 /*
  * percent_rank
  * return fraction between 0 and 1 inclusive,
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index d8e8715ed1..990a5098d5 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6641,11 +6641,16 @@
 # count has two forms: count(any) and count(*)
 { oid => '2147',
   descr => 'number of input rows for which the input expression is not null',
-  proname => 'count', prokind => 'a', proisstrict => 'f', prorettype => 'int8',
-  proargtypes => 'any', prosrc => 'aggregate_dummy' },
+  proname => 'count', prosupport => 'int8inc_support', prokind => 'a',
+  proisstrict => 'f', prorettype => 'int8', proargtypes => 'any',
+  prosrc => 'aggregate_dummy' },
 { oid => '2803', descr => 'number of input rows',
-  proname => 'count', prokind => 'a', proisstrict => 'f', prorettype => 'int8',
-  proargtypes => '', prosrc => 'aggregate_dummy' },
+  proname => 'count', prosupport => 'int8inc_support', prokind => 'a',
+  proisstrict => 'f', prorettype => 'int8', proargtypes => '',
+  prosrc => 'aggregate_dummy' },
+{ oid => '8802', descr => 'planner support for count run condition',
+  proname => 'int8inc_support', prorettype => 'internal',
+  proargtypes => 'internal', prosrc => 'int8inc_support' },
 
 { oid => '2718',
   descr => 'population variance of bigint input values (square of the population standard deviation)',
@@ -10079,14 +10084,26 @@
 
 # SQL-spec window functions
 { oid => '3100', descr => 'row number within partition',
-  proname => 'row_number', prokind => 'w', proisstrict => 'f',
-  prorettype => 'int8', proargtypes => '', prosrc => 'window_row_number' },
+  proname => 'row_number', prosupport => 'window_row_number_support',
+  prokind => 'w', proisstrict => 'f',  prorettype => 'int8',
+  proargtypes => '', prosrc => 'window_row_number' },
+{ oid => '8799', descr => 'planner support for row_number run condition',
+  proname => 'window_row_number_support', prorettype => 'internal',
+  proargtypes => 'internal', prosrc => 'window_row_number_support' },
 { oid => '3101', descr => 'integer rank with gaps',
-  proname => 'rank', prokind => 'w', proisstrict => 'f', prorettype => 'int8',
+  proname => 'rank', prosupport => 'window_rank_support',
+  prokind => 'w', proisstrict => 'f', prorettype => 'int8',
   proargtypes => '', prosrc => 'window_rank' },
+{ oid => '8800', descr => 'planner support for rank run condition',
+  proname => 'window_rank_support', prorettype => 'internal',
+  proargtypes => 'internal', prosrc => 'window_rank_support' },
 { oid => '3102', descr => 'integer rank without gaps',
-  proname => 'dense_rank', prokind => 'w', proisstrict => 'f',
-  prorettype => 'int8', proargtypes => '', prosrc => 'window_dense_rank' },
+  proname => 'dense_rank', prosupport => 'window_dense_rank_support',
+  prokind => 'w', proisstrict => 'f', prorettype => 'int8', proargtypes => '',
+  prosrc => 'window_dense_rank' },
+{ oid => '8801', descr => 'planner support for dense rank run condition',
+  proname => 'window_dense_rank_support', prorettype => 'internal',
+  proargtypes => 'internal', prosrc => 'window_dense_rank_support' },
 { oid => '3103', descr => 'fractional rank within partition',
   proname => 'percent_rank', prokind => 'w', proisstrict => 'f',
   prorettype => 'float8', proargtypes => '', prosrc => 'window_percent_rank' },
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 44dd73fc80..e41850eb9f 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2424,6 +2424,10 @@ typedef struct WindowAggState
 	MemoryContext curaggcontext;	/* current aggregate's working data */
 	ExprContext *tmpcontext;	/* short-term evaluation context */
 
+	ExprState  *runcondition;	/* Condition which must remain true otherwise
+								 * execution of the WindowAgg will finish, or
+								 * NULL if there is no such condition. */
+
 	bool		all_first;		/* true if the scan is starting */
 	bool		all_done;		/* true if the scan is finished */
 	bool		partition_spooled;	/* true if all tuples in current partition
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 59737f1034..263aca4419 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -531,7 +531,8 @@ typedef enum NodeTag
 	T_SupportRequestSelectivity,	/* in nodes/supportnodes.h */
 	T_SupportRequestCost,		/* in nodes/supportnodes.h */
 	T_SupportRequestRows,		/* in nodes/supportnodes.h */
-	T_SupportRequestIndexCondition	/* in nodes/supportnodes.h */
+	T_SupportRequestIndexCondition, /* in nodes/supportnodes.h */
+	T_SupportRequestWFuncMonotonic	/* in nodes/supportnodes.h */
 } NodeTag;
 
 /*
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 712e56b5f2..f849977319 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -1396,6 +1396,8 @@ typedef struct WindowClause
 	int			frameOptions;	/* frame_clause options, see WindowDef */
 	Node	   *startOffset;	/* expression for starting bound, if any */
 	Node	   *endOffset;		/* expression for ending bound, if any */
+	List	   *runcondition;	/* Exec WindowAgg while this is true */
+	List	   *runconditionorig;	/* EXPLAIN compatible version of above */
 	Oid			startInRangeFunc;	/* in_range function for startOffset */
 	Oid			endInRangeFunc; /* in_range function for endOffset */
 	Oid			inRangeColl;	/* collation for in_range tests */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 0b518ce6b2..2330b90a03 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -911,6 +911,9 @@ typedef struct WindowAgg
 	int			frameOptions;	/* frame_clause options, see WindowDef */
 	Node	   *startOffset;	/* expression for starting bound, if any */
 	Node	   *endOffset;		/* expression for ending bound, if any */
+	List	   *runcondition;	/* Conditions that must remain true in order
+								 * for execution to continue */
+	List	   *runconditionorig;	/* runcondition for display in EXPLAIN */
 	/* these fields are used with RANGE offset PRECEDING/FOLLOWING: */
 	Oid			startInRangeFunc;	/* in_range function for startOffset */
 	Oid			endInRangeFunc; /* in_range function for endOffset */
@@ -1309,4 +1312,21 @@ typedef struct PlanInvalItem
 	uint32		hashValue;		/* hash value of object's cache lookup key */
 } PlanInvalItem;
 
+/*
+ * MonotonicFunction
+ *
+ * Allows the planner to track monotonic properties of functions.  A function
+ * is monotonically increasing if a subsequent call cannot yield a lower value
+ * than the previous call.  A monotonically decreasing function cannot yield a
+ * higher value on subsequent calls, and a function which is both must return
+ * the same value on each call.
+ */
+typedef enum MonotonicFunction
+{
+	MONOTONICFUNC_NONE = 0,
+	MONOTONICFUNC_INCREASING = (1 << 0),
+	MONOTONICFUNC_DECREASING = (1 << 1),
+	MONOTONICFUNC_BOTH = MONOTONICFUNC_INCREASING | MONOTONICFUNC_DECREASING
+} MonotonicFunction;
+
 #endif							/* PLANNODES_H */
diff --git a/src/include/nodes/supportnodes.h b/src/include/nodes/supportnodes.h
index 88b61b3ab3..0538a45094 100644
--- a/src/include/nodes/supportnodes.h
+++ b/src/include/nodes/supportnodes.h
@@ -33,12 +33,12 @@
 #ifndef SUPPORTNODES_H
 #define SUPPORTNODES_H
 
-#include "nodes/primnodes.h"
+#include "nodes/plannodes.h"
 
 struct PlannerInfo;				/* avoid including pathnodes.h here */
 struct IndexOptInfo;
 struct SpecialJoinInfo;
-
+struct WindowClause;
 
 /*
  * The Simplify request allows the support function to perform plan-time
@@ -239,4 +239,61 @@ typedef struct SupportRequestIndexCondition
 								 * equivalent of the function call */
 } SupportRequestIndexCondition;
 
+/* ----------
+ * To allow more efficient execution of any monotonically increasing and/or
+ * monotonically decreasing window functions, we support calling the window
+ * function's prosupport function passing along this struct whenever the
+ * planner sees an OpExpr qual directly reference a window function in a
+ * subquery.  When the planner encounters this, we populate this struct and
+ * pass it along to the window function's prosupport function so that it can
+ * evaluate if the given WindowFunc is;
+ *
+ * a) monotonically increasing, or
+ * b) monotonically decreasing, or
+ * c) both monotonically increasing and decreasing, or
+ * d) none of the above.
+ *
+ * A function that is monotonically increasing can never return a value that
+ * is lower than a value returned in a "previous call".  A monotonically
+ * decreasing function can never return a value higher than a value returned
+ * in a previous call.  A function that is both must return the same value
+ * each time.
+ *
+ * We define "previous call" to mean a previous call to the same WindowFunc
+ * struct in the same window partition.
+ *
+ * row_number() is an example of a monotonically increasing function.  The
+ * return value will be reset back to 1 in each new partition.  An example of
+ * a monotonically increasing and decreasing function is COUNT(*) OVER ().
+ * Since there is no ORDER BY clause in this example, all rows in the
+ * partition are peers and all rows within the partition will be within the
+ * frame bound.  Likewise for COUNT(*) OVER(ORDER BY a ROWS BETWEEN UNBOUNDED
+ * PRECEDING AND UNBOUNDED FOLLOWING).
+ *
+ * COUNT(*) OVER (ORDER BY a ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
+ * is an example of a monotonically decreasing function.
+ *
+ * Inputs:
+ *	'window_func' is the pointer to the window function being called.
+ *
+ *	'window_clause' pointer to the WindowClause data.  Support functions can
+ *	use this to check frame bounds and partition by clauses, etc.
+ *
+ * Outputs:
+ *	'monotonic' the resulting MonotonicFunction value for the given input
+ *	window function and window clause.
+ * ----------
+ */
+typedef struct SupportRequestWFuncMonotonic
+{
+	NodeTag		type;
+
+	/* Input fields: */
+	WindowFunc *window_func;	/* Pointer to the window function data */
+	struct WindowClause *window_clause; /* Pointer to the window clause data */
+
+	/* Output fields: */
+	MonotonicFunction monotonic;
+} SupportRequestWFuncMonotonic;
+
 #endif							/* SUPPORTNODES_H */
diff --git a/src/test/regress/expected/window.out b/src/test/regress/expected/window.out
index bb9ff7f07b..fdf053ac83 100644
--- a/src/test/regress/expected/window.out
+++ b/src/test/regress/expected/window.out
@@ -3336,6 +3336,321 @@ WHERE depname = 'sales';
                            ->  Seq Scan on empsalary
 (9 rows)
 
+-- Test window function run conditions are properly pushed down into the
+-- WindowAgg
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+                  QUERY PLAN                  
+----------------------------------------------
+ WindowAgg
+   Run Condition: (row_number() OVER (?) < 3)
+   ->  Sort
+         Sort Key: empsalary.empno
+         ->  Seq Scan on empsalary
+(5 rows)
+
+-- The following 3 statements should result the same result.
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+ empno | rn 
+-------+----
+     1 |  1
+     2 |  2
+(2 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE 3 > rn;
+ empno | rn 
+-------+----
+     1 |  1
+     2 |  2
+(2 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE 2 >= rn;
+ empno | rn 
+-------+----
+     1 |  1
+     2 |  2
+(2 rows)
+
+-- Ensure r <= 3 is pushed down into the run condition of the window agg
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          rank() OVER (ORDER BY salary DESC) r
+   FROM empsalary) emp
+WHERE r <= 3;
+               QUERY PLAN                
+-----------------------------------------
+ WindowAgg
+   Run Condition: (rank() OVER (?) <= 3)
+   ->  Sort
+         Sort Key: empsalary.salary DESC
+         ->  Seq Scan on empsalary
+(5 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          rank() OVER (ORDER BY salary DESC) r
+   FROM empsalary) emp
+WHERE r <= 3;
+ empno | salary | r 
+-------+--------+---
+     8 |   6000 | 1
+    10 |   5200 | 2
+    11 |   5200 | 2
+(3 rows)
+
+-- Ensure dr = 1 is converted to dr <= 1 to get all rows leading up to dr = 1
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+                     QUERY PLAN                      
+-----------------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.dr = 1)
+   ->  WindowAgg
+         Run Condition: (dense_rank() OVER (?) <= 1)
+         ->  Sort
+               Sort Key: empsalary.salary DESC
+               ->  Seq Scan on empsalary
+(7 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+ empno | salary | dr 
+-------+--------+----
+     8 |   6000 |  1
+(1 row)
+
+-- Check COUNT() and COUNT(*)
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+                QUERY PLAN                 
+-------------------------------------------
+ WindowAgg
+   Run Condition: (count(*) OVER (?) <= 3)
+   ->  Sort
+         Sort Key: empsalary.salary DESC
+         ->  Seq Scan on empsalary
+(5 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+ empno | salary | c 
+-------+--------+---
+     8 |   6000 | 1
+    10 |   5200 | 3
+    11 |   5200 | 3
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+                       QUERY PLAN                        
+---------------------------------------------------------
+ WindowAgg
+   Run Condition: (count(empsalary.empno) OVER (?) <= 3)
+   ->  Sort
+         Sort Key: empsalary.salary DESC
+         ->  Seq Scan on empsalary
+(5 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+ empno | salary | c 
+-------+--------+---
+     8 |   6000 | 1
+    10 |   5200 | 3
+    11 |   5200 | 3
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) c
+   FROM empsalary) emp
+WHERE c >= 3;
+                QUERY PLAN                 
+-------------------------------------------
+ WindowAgg
+   Run Condition: (count(*) OVER (?) >= 3)
+   ->  Sort
+         Sort Key: empsalary.salary DESC
+         ->  Seq Scan on empsalary
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER () c
+   FROM empsalary) emp
+WHERE 11 <= c;
+                 QUERY PLAN                 
+--------------------------------------------
+ WindowAgg
+   Run Condition: (11 <= count(*) OVER (?))
+   ->  Seq Scan on empsalary
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+                     QUERY PLAN                      
+-----------------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.dr = 1)
+   ->  WindowAgg
+         Run Condition: (dense_rank() OVER (?) <= 1)
+         ->  Sort
+               Sort Key: empsalary.salary DESC
+               ->  Seq Scan on empsalary
+(7 rows)
+
+-- Tests to ensure we don't push down the run condition when it's not valid to
+-- do so.
+-- Ensure we don't push down the clause when there is a PARTITION BY clause
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (PARTITION BY depname ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.rn < 3)
+   ->  WindowAgg
+         ->  Sort
+               Sort Key: empsalary.depname, empsalary.empno
+               ->  Seq Scan on empsalary
+(6 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (PARTITION BY depname ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+                            QUERY PLAN                            
+------------------------------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.c <= 3)
+   ->  WindowAgg
+         ->  Sort
+               Sort Key: empsalary.depname, empsalary.salary DESC
+               ->  Seq Scan on empsalary
+(6 rows)
+
+-- Ensure we don't push down when the frame options show that the window
+-- function is not monotonically increasing
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) c
+   FROM empsalary) emp
+WHERE c <= 3;
+                  QUERY PLAN                   
+-----------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.c <= 3)
+   ->  WindowAgg
+         ->  Sort
+               Sort Key: empsalary.salary DESC
+               ->  Seq Scan on empsalary
+(6 rows)
+
+-- Ensure we don't push down when the window function's monotonic properties
+-- don't match that of the clauses.
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary) c
+   FROM empsalary) emp
+WHERE 3 <= c;
+                QUERY PLAN                
+------------------------------------------
+ Subquery Scan on emp
+   Filter: (3 <= emp.c)
+   ->  WindowAgg
+         ->  Sort
+               Sort Key: empsalary.salary
+               ->  Seq Scan on empsalary
+(6 rows)
+
+-- Ensure we don't pushdown when there are multiple window clauses to evaluate
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY empno DESC) c,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+                        QUERY PLAN                         
+-----------------------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.dr = 1)
+   ->  WindowAgg
+         ->  Sort
+               Sort Key: empsalary.empno DESC
+               ->  WindowAgg
+                     ->  Sort
+                           Sort Key: empsalary.salary DESC
+                           ->  Seq Scan on empsalary
+(9 rows)
+
 -- Test Sort node collapsing
 EXPLAIN (COSTS OFF)
 SELECT * FROM
diff --git a/src/test/regress/sql/window.sql b/src/test/regress/sql/window.sql
index 41a8e0d152..094045b873 100644
--- a/src/test/regress/sql/window.sql
+++ b/src/test/regress/sql/window.sql
@@ -988,6 +988,171 @@ SELECT * FROM
    FROM empsalary) emp
 WHERE depname = 'sales';
 
+-- Test window function run conditions are properly pushed down into the
+-- WindowAgg
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+
+-- The following 3 statements should result the same result.
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE 3 > rn;
+
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE 2 >= rn;
+
+-- Ensure r <= 3 is pushed down into the run condition of the window agg
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          rank() OVER (ORDER BY salary DESC) r
+   FROM empsalary) emp
+WHERE r <= 3;
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          rank() OVER (ORDER BY salary DESC) r
+   FROM empsalary) emp
+WHERE r <= 3;
+
+-- Ensure dr = 1 is converted to dr <= 1 to get all rows leading up to dr = 1
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+
+-- Check COUNT() and COUNT(*)
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) c
+   FROM empsalary) emp
+WHERE c >= 3;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER () c
+   FROM empsalary) emp
+WHERE 11 <= c;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+
+-- Tests to ensure we don't push down the run condition when it's not valid to
+-- do so.
+
+-- Ensure we don't push down the clause when there is a PARTITION BY clause
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (PARTITION BY depname ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (PARTITION BY depname ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+-- Ensure we don't push down when the frame options show that the window
+-- function is not monotonically increasing
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+-- Ensure we don't push down when the window function's monotonic properties
+-- don't match that of the clauses.
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary) c
+   FROM empsalary) emp
+WHERE 3 <= c;
+
+-- Ensure we don't pushdown when there are multiple window clauses to evaluate
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY empno DESC) c,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+
 -- Test Sort node collapsing
 EXPLAIN (COSTS OFF)
 SELECT * FROM
-- 
2.32.0

#13

David Rowley

dgrowleyml@gmail.com

almost 4 years ago

In reply to: Greg Stark (#9)

Re: Window Function "Run Conditions"

On Wed, 16 Mar 2022 at 10:24, Greg Stark <stark@mit.edu> wrote:

This looks like an awesome addition.

Thanks

I have one technical questions...

Is it possible to actually transform the row_number case into a LIMIT
clause or make the planner support for this case equivalent to it (in
which case we can replace the LIMIT clause planning to transform into
a window function)?

Currently, I have only coded it to support monotonically increasing
and decreasing functions. Putting a <= <const> type condition on a
row_number() function with no PARTITION BY clause I think is logically
the same as a LIMIT clause, but that's not the case for rank() and
dense_rank(). There may be multiple peer rows with the same rank in
those cases. We'd have no way to know what the LIMIT should be set to.
I don't really want to just do this for row_number().

The reason I ask is because the Limit plan node is actually quite a
bit more optimized than the general window function plan node. It
calculates cost estimates based on the limit and can support Top-N
sort nodes.

This is true. There's perhaps no reason why an additional property
could not be added to allow the prosupport function to optionally set
*exactly* the maximum number of rows that could match the condition.
e.g. for select * from (select *,row_number() over (order by c) rn
from ..) w where rn <= 10; that could be set to 10, and if we used
rank() instead of row_number(), it could just be left unset.

I think this is probably worth thinking about at some future date. I
don't really want to make it part of this effort. I also don't think
I'm doing anything here that would need to be undone to make that
work.

David

#14

David Rowley

dgrowleyml@gmail.com

almost 4 years ago

In reply to: Corey Huinker (#10)

Re: Window Function "Run Conditions"

On Thu, 17 Mar 2022 at 17:04, Corey Huinker <corey.huinker@gmail.com> wrote:

It seems like this effort would aid in implementing what some other databases implement via the QUALIFY clause, which is to window functions what HAVING is to aggregate functions.
example: https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#qualify_clause

Isn't that just syntactic sugar? You could get the same from adding a
subquery where a WHERE clause to filter rows evaluated after the
window clause.

David

#15

Zhihong Yu

zyu@yugabyte.com

almost 4 years ago

In reply to: David Rowley (#12)

Re: Window Function "Run Conditions"

On Tue, Mar 22, 2022 at 3:24 PM David Rowley <dgrowleyml@gmail.com> wrote:

On Thu, 26 Aug 2021 at 14:54, Andy Fan <zhihui.fan1213@gmail.com> wrote:

On Thu, Aug 19, 2021 at 2:35 PM David Rowley <dgrowleyml@gmail.com>

wrote:

On Thu, 19 Aug 2021 at 00:20, Andy Fan <zhihui.fan1213@gmail.com>

wrote:

In the current master, the result is:

empno | salary | c | dr
-------+--------+---+----
8 | 6000 | 4 | 1

In the patched version， the result is:

empno | salary | c | dr
-------+--------+---+----
8 | 6000 | 1 | 1

Thanks for taking it for a spin.

That's a bit unfortunate. I don't immediately see how to fix it other
than to restrict the optimisation to only apply when there's a single
WindowClause. It might be possible to relax it further and only apply
if it's the final window clause to be evaluated, but in those cases,
the savings are likely to be much less anyway as some previous
WindowAgg will have exhausted all rows from its subplan.

I am trying to hack the select_active_windows function to make
sure the WindowClause with Run Condition clause to be the last one
to evaluate (we also need to consider more than 1 window func has
run condition), at that time, the run condition clause is ready already.

However there are two troubles in this direction: a). This may conflict
with "the windows that need the same sorting are adjacent in the list."
b). "when two or more windows are order-equivalent then all peer rows
must be presented in the same order in all of them. .. (See General Rule

4 of

<window clause> in SQL2008 - SQL2016.)"

In summary, I am not sure if it is correct to change the execution Order
of WindowAgg freely.

Thanks for looking at that.

My current thoughts are that it just feels a little too risky to
adjust the comparison function that sorts the window clauses to pay
attention to the run-condition.

We would need to ensure that there's just a single window function
with a run condition as it wouldn't be valid for there to be multiple.
It would be easy enough to ensure we only push quals into just 1
window clause, but that and meddling with the evaluation order has
trade-offs. To do that properly, we'd likely want to consider the
costs when deciding which window clause would benefit from having
quals pushed the most. Plus, if we start messing with the evaluation
order then we'd likely really want some sort of costing to check if
pushing a qual and adjusting the evaluation order is actually cheaper
than not pushing the qual and keeping the clauses in the order that
requires the minimum number of sorts. The planner is not really
geared up for costing things like that properly, that's why we just
assume the order with the least sorts is best. In reality that's often
not going to be true as an index may exist and we might want to
evaluate a clause first if we could get rid of a sort and index scan
instead. Sorting the window clauses based on their SortGroupClause is
just the best we can do for now at that stage in planning.

I think it's safer to just disable the optimisation when there are
multiple window clauses. Multiple matching clauses are merged
already, so it's perfectly valid to have multiple window functions,
it's just they must share the same window clause. I don't think
that's terrible as with the major use case that I have in mind for
this, the window function is only added to limit the number of rows.
In most cases I can imagine, there'd be no reason to have an
additional window function with different frame options.

I've attached an updated patch.

Hi,
The following code seems to be common between if / else blocks (w.r.t.
wfunc_left) of find_window_run_conditions().

+               op = get_opfamily_member(opinfo->opfamily_id,
+                                        opinfo->oplefttype,
+                                        opinfo->oprighttype,
+                                        newstrategy);
+
+               newopexpr = (OpExpr *) make_opclause(op,
+                                                    opexpr->opresulttype,
+                                                    opexpr->opretset,
+                                                    otherexpr,
+                                                    (Expr *) wfunc,
+                                                    opexpr->opcollid,
+                                                    opexpr->inputcollid);
+               newopexpr->opfuncid = get_opcode(op);
+
+               *keep_original = true;
+               runopexpr = newopexpr;

It would be nice if this code can be shared.

+           WindowClause *wclause = (WindowClause *)
+           list_nth(subquery->windowClause,
+                    wfunc->winref - 1);

The code would be more readable if list_nth() is indented.

+ /* Check the left side of the OpExpr */

It seems the code for checking left / right is the same. It would be better
to extract and reuse the code.

Cheers

#16

David Rowley

dgrowleyml@gmail.com

almost 4 years ago

In reply to: Zhihong Yu (#15)

1 attachment(s)

Re: Window Function "Run Conditions"

On Wed, 23 Mar 2022 at 12:50, Zhihong Yu <zyu@yugabyte.com> wrote:

The following code seems to be common between if / else blocks (w.r.t. wfunc_left) of find_window_run_conditions().

It would be nice if this code can be shared.

I remember thinking about that and thinking that I didn't want to
overcomplicate the if conditions for the strategy tests. I'd thought
these would have become:

if ((wfunc_left && (strategy == BTLessStrategyNumber ||
strategy == BTLessEqualStrategyNumber)) ||
(!wfunc_left && (strategy == BTGreaterStrategyNumber ||
strategy == BTGreaterEqualStrategyNumber)))

which I didn't think was very readable. That caused me to keep it separate.

On reflection, we can just leave the strategy checks as they are, then
add the additional code for checking wfunc_left when checking the
res->monotonic, i.e:

if ((wfunc_left && (res->monotonic & MONOTONICFUNC_INCREASING)) ||
(!wfunc_left && (res->monotonic & MONOTONICFUNC_DECREASING)))

I think that's more readable than doubling up the strategy checks, so
I've done it that way in the attached.

+           WindowClause *wclause = (WindowClause *)
+           list_nth(subquery->windowClause,
+                    wfunc->winref - 1);

The code would be more readable if list_nth() is indented.

That's just the way pgindent put it.

+ /* Check the left side of the OpExpr */

It seems the code for checking left / right is the same. It would be better to extract and reuse the code.

I've moved some of that code into find_window_run_conditions() which
removes about 10 lines of code.

Updated patch attached. Thanks for looking.

David

Attachments:

v3-0001-Teach-planner-and-executor-about-monotonic-window.patchtext/plain; charset=US-ASCII; name=v3-0001-Teach-planner-and-executor-about-monotonic-window.patchDownload

From db393b01ce6dd48f3a49d367a28d7804510bd1f6 Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Sat, 8 May 2021 23:43:25 +1200
Subject: [PATCH v3] Teach planner and executor about monotonic window funcs

Window functions such as row_number() always return a value higher than
the one previously in any given partition.  If a query were to only
request the first few row numbers, then traditionally we would continue
evaluating the WindowAgg node until all tuples are exhausted.  However, it
is possible if someone, say only wanted all row numbers <= 10, then we
could just stop once we get number 11.

Here we implement means to do just that.  This is done by way of adding a
prosupport function to various of the built-in window functions and adding
supporting code to allow the support function to inform the planner if
the function is monotonically increasing, monotonically decreasing, both
or neither.  The planner is then able to make use of that information and
possibly allow the executor to short-circuit execution by way of adding a
"run condition" to the WindowAgg to instruct it to run while this
condition is true and stop when it becomes false.

This optimization is only possible when the subquery contains only a
single window clause.  The problem with having multiple clauses is that at
the time when we're looking for run conditions, we don't yet know the
order that the window clauses will be evaluated.  We cannot add a run
condition to a window clause that is evaluated first as we may stop
execution and not output rows that are required for a window clause which
will be evaluated later.

Here we add prosupport functions to allow this to work for; row_number(),
rank(), dense_rank(), count(*) and count(expr).

Author: David Rowley
Reviewed-by: Andy Fan, Zhihong Yu
Discussion: https://postgr.es/m/CAApHDvqvp3At8++yF8ij06sdcoo1S_b2YoaT9D4Nf+MObzsrLQ@mail.gmail.com
---
 src/backend/commands/explain.c          |   4 +
 src/backend/executor/nodeWindowAgg.c    |  28 +-
 src/backend/nodes/copyfuncs.c           |   4 +
 src/backend/nodes/equalfuncs.c          |   2 +
 src/backend/nodes/outfuncs.c            |   4 +
 src/backend/nodes/readfuncs.c           |   4 +
 src/backend/optimizer/path/allpaths.c   | 327 +++++++++++++++++++++++-
 src/backend/optimizer/plan/createplan.c |   7 +-
 src/backend/optimizer/plan/setrefs.c    |   8 +
 src/backend/utils/adt/int8.c            |  45 ++++
 src/backend/utils/adt/windowfuncs.c     |  63 +++++
 src/include/catalog/pg_proc.dat         |  35 ++-
 src/include/nodes/execnodes.h           |   4 +
 src/include/nodes/nodes.h               |   3 +-
 src/include/nodes/parsenodes.h          |   2 +
 src/include/nodes/plannodes.h           |  20 ++
 src/include/nodes/supportnodes.h        |  61 ++++-
 src/test/regress/expected/window.out    | 315 +++++++++++++++++++++++
 src/test/regress/sql/window.sql         | 165 ++++++++++++
 19 files changed, 1083 insertions(+), 18 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 9f632285b6..92ca4acf50 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1985,6 +1985,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
 			break;
+		case T_WindowAgg:
+			show_upper_qual(((WindowAgg *) plan)->runconditionorig,
+							"Run Condition", planstate, ancestors, es);
+			break;
 		case T_Group:
 			show_group_keys(castNode(GroupState, planstate), ancestors, es);
 			show_upper_qual(plan->qual, "Filter", planstate, ancestors, es);
diff --git a/src/backend/executor/nodeWindowAgg.c b/src/backend/executor/nodeWindowAgg.c
index 08ce05ca5a..3e4f00e162 100644
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -2023,6 +2023,7 @@ static TupleTableSlot *
 ExecWindowAgg(PlanState *pstate)
 {
 	WindowAggState *winstate = castNode(WindowAggState, pstate);
+	TupleTableSlot *slot;
 	ExprContext *econtext;
 	int			i;
 	int			numfuncs;
@@ -2235,7 +2236,20 @@ ExecWindowAgg(PlanState *pstate)
 	 */
 	econtext->ecxt_outertuple = winstate->ss.ss_ScanTupleSlot;
 
-	return ExecProject(winstate->ss.ps.ps_ProjInfo);
+	slot = ExecProject(winstate->ss.ps.ps_ProjInfo);
+
+	/*
+	 * Now evaluate the run condition to see if we need to continue further
+	 * with execution.
+	 */
+	econtext->ecxt_scantuple = slot;
+	if (!ExecQual(winstate->runcondition, econtext))
+	{
+		winstate->all_done = true;
+		return NULL;
+	}
+
+	return slot;
 }
 
 /* -----------------
@@ -2307,6 +2321,18 @@ ExecInitWindowAgg(WindowAgg *node, EState *estate, int eflags)
 	Assert(node->plan.qual == NIL);
 	winstate->ss.ps.qual = NULL;
 
+	/*
+	 * Setup the run condition, if we received one from the query planner.
+	 * When set, this can allow us to finish execution early because we know
+	 * some higher-level filter exists that would just filter out any further
+	 * results that we produce.
+	 */
+	if (node->runcondition != NIL)
+		winstate->runcondition = ExecInitQual(node->runcondition,
+											  (PlanState *) winstate);
+	else
+		winstate->runcondition = NULL;
+
 	/*
 	 * initialize child nodes
 	 */
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index d4f8455a2b..e2684a5dd0 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -1102,6 +1102,8 @@ _copyWindowAgg(const WindowAgg *from)
 	COPY_SCALAR_FIELD(frameOptions);
 	COPY_NODE_FIELD(startOffset);
 	COPY_NODE_FIELD(endOffset);
+	COPY_NODE_FIELD(runcondition);
+	COPY_NODE_FIELD(runconditionorig);
 	COPY_SCALAR_FIELD(startInRangeFunc);
 	COPY_SCALAR_FIELD(endInRangeFunc);
 	COPY_SCALAR_FIELD(inRangeColl);
@@ -2579,6 +2581,8 @@ _copyWindowClause(const WindowClause *from)
 	COPY_SCALAR_FIELD(frameOptions);
 	COPY_NODE_FIELD(startOffset);
 	COPY_NODE_FIELD(endOffset);
+	COPY_NODE_FIELD(runcondition);
+	COPY_NODE_FIELD(runconditionorig);
 	COPY_SCALAR_FIELD(startInRangeFunc);
 	COPY_SCALAR_FIELD(endInRangeFunc);
 	COPY_SCALAR_FIELD(inRangeColl);
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index f1002afe7a..149a9b1d30 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -2879,6 +2879,8 @@ _equalWindowClause(const WindowClause *a, const WindowClause *b)
 	COMPARE_SCALAR_FIELD(frameOptions);
 	COMPARE_NODE_FIELD(startOffset);
 	COMPARE_NODE_FIELD(endOffset);
+	COMPARE_NODE_FIELD(runcondition);
+	COMPARE_NODE_FIELD(runconditionorig);
 	COMPARE_SCALAR_FIELD(startInRangeFunc);
 	COMPARE_SCALAR_FIELD(endInRangeFunc);
 	COMPARE_SCALAR_FIELD(inRangeColl);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 6bdad462c7..349da28758 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -827,6 +827,8 @@ _outWindowAgg(StringInfo str, const WindowAgg *node)
 	WRITE_INT_FIELD(frameOptions);
 	WRITE_NODE_FIELD(startOffset);
 	WRITE_NODE_FIELD(endOffset);
+	WRITE_NODE_FIELD(runcondition);
+	WRITE_NODE_FIELD(runconditionorig);
 	WRITE_OID_FIELD(startInRangeFunc);
 	WRITE_OID_FIELD(endInRangeFunc);
 	WRITE_OID_FIELD(inRangeColl);
@@ -3148,6 +3150,8 @@ _outWindowClause(StringInfo str, const WindowClause *node)
 	WRITE_INT_FIELD(frameOptions);
 	WRITE_NODE_FIELD(startOffset);
 	WRITE_NODE_FIELD(endOffset);
+	WRITE_NODE_FIELD(runcondition);
+	WRITE_NODE_FIELD(runconditionorig);
 	WRITE_OID_FIELD(startInRangeFunc);
 	WRITE_OID_FIELD(endInRangeFunc);
 	WRITE_OID_FIELD(inRangeColl);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 3f68f7c18d..4882f2b72a 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -382,6 +382,8 @@ _readWindowClause(void)
 	READ_INT_FIELD(frameOptions);
 	READ_NODE_FIELD(startOffset);
 	READ_NODE_FIELD(endOffset);
+	READ_NODE_FIELD(runcondition);
+	READ_NODE_FIELD(runconditionorig);
 	READ_OID_FIELD(startInRangeFunc);
 	READ_OID_FIELD(endInRangeFunc);
 	READ_OID_FIELD(inRangeColl);
@@ -2347,6 +2349,8 @@ _readWindowAgg(void)
 	READ_INT_FIELD(frameOptions);
 	READ_NODE_FIELD(startOffset);
 	READ_NODE_FIELD(endOffset);
+	READ_NODE_FIELD(runcondition);
+	READ_NODE_FIELD(runconditionorig);
 	READ_OID_FIELD(startInRangeFunc);
 	READ_OID_FIELD(endInRangeFunc);
 	READ_OID_FIELD(inRangeColl);
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 169b1d53fc..5eb4343e5b 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -27,6 +27,7 @@
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
+#include "nodes/supportnodes.h"
 #ifdef OPTIMIZER_DEBUG
 #include "nodes/print.h"
 #endif
@@ -2157,6 +2158,313 @@ has_multiple_baserels(PlannerInfo *root)
 	return false;
 }
 
+/*
+ * find_window_run_conditions
+ *		Determine if 'wfunc' is really a WindowFunc and call its prosupport
+ *		function to determine the functions monotonic properties.  We then
+ *		see if 'opexpr' can be used to stop processing WindowAgg nodes early.
+ *
+ * For example row_number() over (order by ...) always produces a value one
+ * higher than the previous.  If someone has a window function such as that
+ * in a subquery and just wants, say all rows with a row number less than or
+ * equal to 10, then we may as well stop processing the windowagg once the row
+ * number reaches 11.  Here we look for opexprs that might help us to stop
+ * doing needless extra processing in WindowAgg nodes.
+ *
+ * To do this we make use of the window function's prosupport function to
+ * determine if the given window function with the given window clause is a
+ * monotonically increasing or decreasing function.
+ *
+ * '*keep_original' is set to true if the caller should also use 'opexpr' for
+ * its original purpose.  This is set to false if the caller can assume that
+ * the run condition will handle all of the required filtering or if the given
+ * 'opexpr' is not supported.
+ *
+ * Returns true if a run condition qual was found and added to the
+ * appropriate WindowClause 'runcondition' list and sets *keep_original
+ * accordingly.  Returns false if wfunc is not a WindowFunc or we're unable to
+ * use 'opexpr' as a run condition.
+ */
+static bool
+find_window_run_conditions(Query *subquery, RangeTblEntry *rte, Index rti,
+						   AttrNumber attno, WindowFunc *wfunc, OpExpr *opexpr,
+						   bool wfunc_left, bool *keep_original)
+{
+	Oid			prosupport;
+	Expr	   *otherexpr;
+	SupportRequestWFuncMonotonic req;
+	SupportRequestWFuncMonotonic *res;
+	WindowClause *wclause;
+	List	   *opinfos;
+	OpExpr	   *runopexpr;
+	ListCell   *lc;
+
+	*keep_original = true;
+
+	while (IsA(wfunc, RelabelType))
+		wfunc = (WindowFunc *) ((RelabelType *) wfunc)->arg;
+
+	/* we can only work with window functions */
+	if (!IsA(wfunc, WindowFunc))
+		return false;
+
+	/* find the window clause belonging to the window function */
+	wclause = (WindowClause *) list_nth(subquery->windowClause,
+										wfunc->winref - 1);
+
+	/*
+	 * Currently the WindowAgg node just stops when the run condition is no
+	 * longer true.  If there is a PARTITION BY clause then we cannot just
+	 * stop as other partitions still need to be processed.
+	 */
+	if (wclause->partitionClause != NIL)
+		return false;
+
+	prosupport = get_func_support(wfunc->winfnoid);
+
+	/* Check if there's a support function for 'wfunc' */
+	if (!OidIsValid(prosupport))
+		return false;
+
+	/* get the Expr from the other side of the OpExpr */
+	if (wfunc_left)
+		otherexpr = lsecond(opexpr->args);
+	else
+		otherexpr = linitial(opexpr->args);
+
+	/*
+	 * The value being compared must not change during the evaluation of the
+	 * window partition.
+	 */
+	if (!is_pseudo_constant_clause((Node *) otherexpr))
+		return false;
+
+	req.type = T_SupportRequestWFuncMonotonic;
+	req.window_func = wfunc;
+	req.window_clause = wclause;
+
+	/* call the support function */
+	res = (SupportRequestWFuncMonotonic *)
+		DatumGetPointer(OidFunctionCall1(prosupport,
+										 PointerGetDatum(&req)));
+
+	/*
+	 * Nothing to do if the function is neither monotonically increasing or
+	 * monotonically decreasing.
+	 */
+	if (res == NULL || res->monotonic == MONOTONICFUNC_NONE)
+		return false;
+
+	runopexpr = NULL;
+	opinfos = get_op_btree_interpretation(opexpr->opno);
+
+	foreach(lc, opinfos)
+	{
+		OpBtreeInterpretation *opinfo = (OpBtreeInterpretation *) lfirst(lc);
+		int			strategy = opinfo->strategy;
+
+		/* handle < / <= */
+		if (strategy == BTLessStrategyNumber ||
+			strategy == BTLessEqualStrategyNumber)
+		{
+			/*
+			 * < / <= is supported for monotonically increasing functions in
+			 * the form <wfunc> op <pseudoconst> and <pseudoconst> op <wfunc>
+			 * for monotonically decreasing functions.
+			 */
+			if ((wfunc_left && (res->monotonic & MONOTONICFUNC_INCREASING)) ||
+				(!wfunc_left && (res->monotonic & MONOTONICFUNC_DECREASING)))
+			{
+				*keep_original = false;
+				runopexpr = opexpr;
+			}
+			break;
+		}
+		/* handle > / >= */
+		else if (strategy == BTGreaterStrategyNumber ||
+				 strategy == BTGreaterEqualStrategyNumber)
+		{
+			/*
+			 * > / >= is supported for monotonically decreasing functions in
+			 * the form <wfunc> op <pseudoconst> and <pseudoconst> op <wfunc>
+			 * for monotonically increasing functions.
+			 */
+			if ((wfunc_left && (res->monotonic & MONOTONICFUNC_DECREASING)) ||
+				(!wfunc_left && (res->monotonic & MONOTONICFUNC_INCREASING)))
+			{
+				*keep_original = false;
+				runopexpr = opexpr;
+			}
+			break;
+		}
+		/* handle = */
+		else if (strategy == BTEqualStrategyNumber)
+		{
+			OpExpr	   *newopexpr;
+			Oid			op;
+			int16		newstrategy;
+
+			/*
+			 * When both monotonically increasing and decreasing then the
+			 * return value of the window function will be the same each time.
+			 * We can simply use 'opexpr' as the run condition without
+			 * modifying it.
+			 */
+			if ((res->monotonic & MONOTONICFUNC_BOTH) == MONOTONICFUNC_BOTH)
+			{
+				*keep_original = false;
+				runopexpr = opexpr;
+				break;
+			}
+
+			/*
+			 * When monotonically increasing we make a qual with <wfunc> <=
+			 * <value> or <value> >= <wfunc> in order to filter out values
+			 * which are above the value in the equality condition.  For
+			 * monotonically decreasing functions we want to filter values
+			 * below the value in the equality condition.
+			 */
+			if (res->monotonic & MONOTONICFUNC_INCREASING)
+				newstrategy = wfunc_left ? BTLessEqualStrategyNumber : BTGreaterEqualStrategyNumber;
+			else
+				newstrategy = wfunc_left ? BTGreaterEqualStrategyNumber : BTLessEqualStrategyNumber;
+
+			op = get_opfamily_member(opinfo->opfamily_id,
+									 opinfo->oplefttype,
+									 opinfo->oprighttype,
+									 newstrategy);
+
+			/*
+			 * Build a OpExpr with the window function on the same side as it
+			 * was in the original OpExpr.
+			 */
+			if (wfunc_left)
+				newopexpr = (OpExpr *) make_opclause(op,
+													 opexpr->opresulttype,
+													 opexpr->opretset,
+													 (Expr *) wfunc,
+													 otherexpr,
+													 opexpr->opcollid,
+													 opexpr->inputcollid);
+			else
+				newopexpr = (OpExpr *) make_opclause(op,
+													 opexpr->opresulttype,
+													 opexpr->opretset,
+													 otherexpr,
+													 (Expr *) wfunc,
+													 opexpr->opcollid,
+													 opexpr->inputcollid);
+
+			newopexpr->opfuncid = get_opcode(op);
+
+			*keep_original = true;
+			runopexpr = newopexpr;
+			break;
+		}
+	}
+
+	if (runopexpr != NULL)
+	{
+		Expr	   *origexpr;
+
+		wclause->runcondition = lappend(wclause->runcondition, runopexpr);
+
+		/* also create a version of the qual that we can display in EXPLAIN */
+		if (wfunc_left)
+			origexpr = make_opclause(runopexpr->opno,
+									 runopexpr->opresulttype,
+									 runopexpr->opretset, (Expr *) wfunc,
+									 otherexpr, runopexpr->opcollid,
+									 runopexpr->inputcollid);
+		else
+			origexpr = make_opclause(runopexpr->opno,
+									 runopexpr->opresulttype,
+									 runopexpr->opretset,
+									 otherexpr, (Expr *) wfunc,
+									 runopexpr->opcollid,
+									 runopexpr->inputcollid);
+
+		wclause->runconditionorig = lappend(wclause->runconditionorig, origexpr);
+		return true;
+	}
+
+	/* unsupported OpExpr */
+	return false;
+}
+
+/*
+ * check_and_push_window_quals
+ *		Check if 'rinfo' is a qual that can be pushed into a WindowFunc as a
+ *		'runcondition' qual.  These, when present, cause the window function
+ *		evaluation to stop when the condition becomes false.
+ *
+ * Returns true if the caller still must keep the original qual or false if
+ * the caller can safely ignore the original qual because the window function
+ * will use the run condition to stop at the right time.
+ */
+static bool
+check_and_push_window_quals(Query *subquery, RangeTblEntry *rte, Index rti,
+							Node *clause)
+{
+	OpExpr	   *opexpr = (OpExpr *) clause;
+	bool		keep_original = true;
+	Var		   *var1;
+	Var		   *var2;
+
+	/* We're only able to use OpExprs with 2 operands */
+	if (!IsA(opexpr, OpExpr))
+		return true;
+
+	if (list_length(opexpr->args) != 2)
+		return true;
+
+	/*
+	 * We don't currently attempt to generate window function run conditions
+	 * when there are multiple window clauses.  It may seem possible to relax
+	 * this a bit and ensure we only put run conditions on the final window
+	 * clause to be evaluated, however, currently we've yet to determine the
+	 * order that the window clauses will be evaluated.  This's done later in
+	 * select_active_windows().  If we were to put run conditions on anything
+	 * apart from the final window clause to be evaluated then we may filter
+	 * rows that are required for a yet-to-be-evaluated window clause.
+	 */
+	if (list_length(subquery->windowClause) > 1)
+		return true;
+
+	/*
+	 * Check for plain Vars which reference window functions in the subquery.
+	 * If we find any, we'll ask find_window_run_conditions() if 'opexpr' can
+	 * be used as a run condition to allow us to stop windowagg execution
+	 * early.
+	 */
+
+	/* Check the left side of the OpExpr */
+	var1 = linitial(opexpr->args);
+	if (IsA(var1, Var) && var1->varattno > 0)
+	{
+		TargetEntry *tle = list_nth(subquery->targetList, var1->varattno - 1);
+		WindowFunc *wfunc = (WindowFunc *) tle->expr;
+
+		if (find_window_run_conditions(subquery, rte, rti, tle->resno, wfunc,
+									   opexpr, true, &keep_original))
+			return keep_original;
+	}
+
+	/* and check the right side */
+	var2 = lsecond(opexpr->args);
+	if (IsA(var2, Var) && var2->varattno > 0)
+	{
+		TargetEntry *tle = list_nth(subquery->targetList, var2->varattno - 1);
+		WindowFunc *wfunc = (WindowFunc *) tle->expr;
+
+		if (find_window_run_conditions(subquery, rte, rti, tle->resno, wfunc,
+									   opexpr, false, &keep_original))
+			return keep_original;
+	}
+
+	return true;
+}
+
 /*
  * set_subquery_pathlist
  *		Generate SubqueryScan access paths for a subquery RTE
@@ -2245,19 +2553,30 @@ set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 		foreach(l, rel->baserestrictinfo)
 		{
 			RestrictInfo *rinfo = (RestrictInfo *) lfirst(l);
+			Node	   *clause = (Node *) rinfo->clause;
 
 			if (!rinfo->pseudoconstant &&
 				qual_is_pushdown_safe(subquery, rti, rinfo, &safetyInfo))
 			{
-				Node	   *clause = (Node *) rinfo->clause;
-
 				/* Push it down */
 				subquery_push_qual(subquery, rte, rti, clause);
 			}
 			else
 			{
-				/* Keep it in the upper query */
-				upperrestrictlist = lappend(upperrestrictlist, rinfo);
+				/*
+				 * Since we can't push the qual down into the subquery, check
+				 * if it happens to reference a window function.  If so then
+				 * it might be useful to allow the window evaluation to stop
+				 * early.
+				 */
+				if (check_and_push_window_quals(subquery, rte, rti, clause))
+				{
+					/*
+					 * It's not a suitable window run condition qual or it is,
+					 * but the original must also be kept in the upper query.
+					 */
+					upperrestrictlist = lappend(upperrestrictlist, rinfo);
+				}
 			}
 		}
 		rel->baserestrictinfo = upperrestrictlist;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index fa069a217c..1c750bd7a4 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -288,6 +288,7 @@ static WindowAgg *make_windowagg(List *tlist, Index winref,
 								 int frameOptions, Node *startOffset, Node *endOffset,
 								 Oid startInRangeFunc, Oid endInRangeFunc,
 								 Oid inRangeColl, bool inRangeAsc, bool inRangeNullsFirst,
+								 List *runcondition, List *runconditionorig,
 								 Plan *lefttree);
 static Group *make_group(List *tlist, List *qual, int numGroupCols,
 						 AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
@@ -2641,6 +2642,8 @@ create_windowagg_plan(PlannerInfo *root, WindowAggPath *best_path)
 						  wc->inRangeColl,
 						  wc->inRangeAsc,
 						  wc->inRangeNullsFirst,
+						  wc->runcondition,
+						  wc->runconditionorig,
 						  subplan);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -6525,7 +6528,7 @@ make_windowagg(List *tlist, Index winref,
 			   int frameOptions, Node *startOffset, Node *endOffset,
 			   Oid startInRangeFunc, Oid endInRangeFunc,
 			   Oid inRangeColl, bool inRangeAsc, bool inRangeNullsFirst,
-			   Plan *lefttree)
+			   List *runcondition, List *runconditionorig, Plan *lefttree)
 {
 	WindowAgg  *node = makeNode(WindowAgg);
 	Plan	   *plan = &node->plan;
@@ -6542,6 +6545,8 @@ make_windowagg(List *tlist, Index winref,
 	node->frameOptions = frameOptions;
 	node->startOffset = startOffset;
 	node->endOffset = endOffset;
+	node->runcondition = runcondition;
+	node->runconditionorig = runconditionorig;
 	node->startInRangeFunc = startInRangeFunc;
 	node->endInRangeFunc = endInRangeFunc;
 	node->inRangeColl = inRangeColl;
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index a7b11b7f03..eaeba17088 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -897,6 +897,14 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 					fix_scan_expr(root, wplan->startOffset, rtoffset, 1);
 				wplan->endOffset =
 					fix_scan_expr(root, wplan->endOffset, rtoffset, 1);
+				wplan->runcondition = fix_scan_list(root,
+													wplan->runcondition,
+													rtoffset,
+													NUM_EXEC_TLIST(plan));
+				wplan->runconditionorig = fix_scan_list(root,
+														wplan->runconditionorig,
+														rtoffset,
+														NUM_EXEC_TLIST(plan));
 			}
 			break;
 		case T_Result:
diff --git a/src/backend/utils/adt/int8.c b/src/backend/utils/adt/int8.c
index 4a87114a4f..b4bcc26d6c 100644
--- a/src/backend/utils/adt/int8.c
+++ b/src/backend/utils/adt/int8.c
@@ -24,6 +24,7 @@
 #include "nodes/supportnodes.h"
 #include "optimizer/optimizer.h"
 #include "utils/builtins.h"
+#include "utils/lsyscache.h"
 
 
 typedef struct
@@ -791,6 +792,7 @@ int8dec(PG_FUNCTION_ARGS)
 }
 
 
+
 /*
  * These functions are exactly like int8inc/int8dec but are used for
  * aggregates that count only non-null values.  Since the functions are
@@ -818,6 +820,49 @@ int8dec_any(PG_FUNCTION_ARGS)
 	return int8dec(fcinfo);
 }
 
+/*
+ * int8inc_support
+ *		prosupport function for int8inc() and int8inc_any()
+ */
+Datum
+int8inc_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq = (Node *) PG_GETARG_POINTER(0);
+
+	if (IsA(rawreq, SupportRequestWFuncMonotonic))
+	{
+		SupportRequestWFuncMonotonic *req = (SupportRequestWFuncMonotonic *) rawreq;
+		MonotonicFunction monotonic = MONOTONICFUNC_NONE;
+		int			frameOptions = req->window_clause->frameOptions;
+
+		/* No ORDER BY clause then all rows are peers */
+		if (req->window_clause->orderClause == NIL)
+			monotonic = MONOTONICFUNC_BOTH;
+		else
+		{
+			/*
+			 * Otherwise take into account the frame options.  When the frame
+			 * bound is the start of the window then the resulting value can
+			 * never decrease, therefore is monotonically increasing
+			 */
+			if (frameOptions & FRAMEOPTION_START_UNBOUNDED_PRECEDING)
+				monotonic |= MONOTONICFUNC_INCREASING;
+
+			/*
+			 * Likewise, if the frame bound is the end of the window then the
+			 * resulting value can never increase.
+			 */
+			if (frameOptions & FRAMEOPTION_END_UNBOUNDED_FOLLOWING)
+				monotonic |= MONOTONICFUNC_DECREASING;
+		}
+
+		req->monotonic = monotonic;
+		PG_RETURN_POINTER(req);
+	}
+
+	PG_RETURN_POINTER(NULL);
+}
+
 
 Datum
 int8larger(PG_FUNCTION_ARGS)
diff --git a/src/backend/utils/adt/windowfuncs.c b/src/backend/utils/adt/windowfuncs.c
index 3e0cc9be1a..596564fa15 100644
--- a/src/backend/utils/adt/windowfuncs.c
+++ b/src/backend/utils/adt/windowfuncs.c
@@ -13,6 +13,7 @@
  */
 #include "postgres.h"
 
+#include "nodes/supportnodes.h"
 #include "utils/builtins.h"
 #include "windowapi.h"
 
@@ -88,6 +89,26 @@ window_row_number(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(curpos + 1);
 }
 
+/*
+ * window_row_number_support
+ *		prosupport function for window_row_number()
+ */
+Datum
+window_row_number_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq = (Node *) PG_GETARG_POINTER(0);
+
+	if (IsA(rawreq, SupportRequestWFuncMonotonic))
+	{
+		SupportRequestWFuncMonotonic *req = (SupportRequestWFuncMonotonic *) rawreq;
+
+		/* row_number() is monotonically increasing */
+		req->monotonic = MONOTONICFUNC_INCREASING;
+		PG_RETURN_POINTER(req);
+	}
+
+	PG_RETURN_POINTER(NULL);
+}
 
 /*
  * rank
@@ -110,6 +131,27 @@ window_rank(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(context->rank);
 }
 
+/*
+ * window_rank_support
+ *		prosupport function for window_rank()
+ */
+Datum
+window_rank_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq = (Node *) PG_GETARG_POINTER(0);
+
+	if (IsA(rawreq, SupportRequestWFuncMonotonic))
+	{
+		SupportRequestWFuncMonotonic *req = (SupportRequestWFuncMonotonic *) rawreq;
+
+		/* rank() is monotonically increasing */
+		req->monotonic = MONOTONICFUNC_INCREASING;
+		PG_RETURN_POINTER(req);
+	}
+
+	PG_RETURN_POINTER(NULL);
+}
+
 /*
  * dense_rank
  * Rank increases by 1 when key columns change.
@@ -130,6 +172,27 @@ window_dense_rank(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(context->rank);
 }
 
+/*
+ * window_dense_rank_support
+ *		prosupport function for window_dense_rank()
+ */
+Datum
+window_dense_rank_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq = (Node *) PG_GETARG_POINTER(0);
+
+	if (IsA(rawreq, SupportRequestWFuncMonotonic))
+	{
+		SupportRequestWFuncMonotonic *req = (SupportRequestWFuncMonotonic *) rawreq;
+
+		/* dense_rank() is monotonically increasing */
+		req->monotonic = MONOTONICFUNC_INCREASING;
+		PG_RETURN_POINTER(req);
+	}
+
+	PG_RETURN_POINTER(NULL);
+}
+
 /*
  * percent_rank
  * return fraction between 0 and 1 inclusive,
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index d8e8715ed1..990a5098d5 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6641,11 +6641,16 @@
 # count has two forms: count(any) and count(*)
 { oid => '2147',
   descr => 'number of input rows for which the input expression is not null',
-  proname => 'count', prokind => 'a', proisstrict => 'f', prorettype => 'int8',
-  proargtypes => 'any', prosrc => 'aggregate_dummy' },
+  proname => 'count', prosupport => 'int8inc_support', prokind => 'a',
+  proisstrict => 'f', prorettype => 'int8', proargtypes => 'any',
+  prosrc => 'aggregate_dummy' },
 { oid => '2803', descr => 'number of input rows',
-  proname => 'count', prokind => 'a', proisstrict => 'f', prorettype => 'int8',
-  proargtypes => '', prosrc => 'aggregate_dummy' },
+  proname => 'count', prosupport => 'int8inc_support', prokind => 'a',
+  proisstrict => 'f', prorettype => 'int8', proargtypes => '',
+  prosrc => 'aggregate_dummy' },
+{ oid => '8802', descr => 'planner support for count run condition',
+  proname => 'int8inc_support', prorettype => 'internal',
+  proargtypes => 'internal', prosrc => 'int8inc_support' },
 
 { oid => '2718',
   descr => 'population variance of bigint input values (square of the population standard deviation)',
@@ -10079,14 +10084,26 @@
 
 # SQL-spec window functions
 { oid => '3100', descr => 'row number within partition',
-  proname => 'row_number', prokind => 'w', proisstrict => 'f',
-  prorettype => 'int8', proargtypes => '', prosrc => 'window_row_number' },
+  proname => 'row_number', prosupport => 'window_row_number_support',
+  prokind => 'w', proisstrict => 'f',  prorettype => 'int8',
+  proargtypes => '', prosrc => 'window_row_number' },
+{ oid => '8799', descr => 'planner support for row_number run condition',
+  proname => 'window_row_number_support', prorettype => 'internal',
+  proargtypes => 'internal', prosrc => 'window_row_number_support' },
 { oid => '3101', descr => 'integer rank with gaps',
-  proname => 'rank', prokind => 'w', proisstrict => 'f', prorettype => 'int8',
+  proname => 'rank', prosupport => 'window_rank_support',
+  prokind => 'w', proisstrict => 'f', prorettype => 'int8',
   proargtypes => '', prosrc => 'window_rank' },
+{ oid => '8800', descr => 'planner support for rank run condition',
+  proname => 'window_rank_support', prorettype => 'internal',
+  proargtypes => 'internal', prosrc => 'window_rank_support' },
 { oid => '3102', descr => 'integer rank without gaps',
-  proname => 'dense_rank', prokind => 'w', proisstrict => 'f',
-  prorettype => 'int8', proargtypes => '', prosrc => 'window_dense_rank' },
+  proname => 'dense_rank', prosupport => 'window_dense_rank_support',
+  prokind => 'w', proisstrict => 'f', prorettype => 'int8', proargtypes => '',
+  prosrc => 'window_dense_rank' },
+{ oid => '8801', descr => 'planner support for dense rank run condition',
+  proname => 'window_dense_rank_support', prorettype => 'internal',
+  proargtypes => 'internal', prosrc => 'window_dense_rank_support' },
 { oid => '3103', descr => 'fractional rank within partition',
   proname => 'percent_rank', prokind => 'w', proisstrict => 'f',
   prorettype => 'float8', proargtypes => '', prosrc => 'window_percent_rank' },
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 44dd73fc80..e41850eb9f 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2424,6 +2424,10 @@ typedef struct WindowAggState
 	MemoryContext curaggcontext;	/* current aggregate's working data */
 	ExprContext *tmpcontext;	/* short-term evaluation context */
 
+	ExprState  *runcondition;	/* Condition which must remain true otherwise
+								 * execution of the WindowAgg will finish, or
+								 * NULL if there is no such condition. */
+
 	bool		all_first;		/* true if the scan is starting */
 	bool		all_done;		/* true if the scan is finished */
 	bool		partition_spooled;	/* true if all tuples in current partition
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 5d075f0c34..5214575d89 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -527,7 +527,8 @@ typedef enum NodeTag
 	T_SupportRequestSelectivity,	/* in nodes/supportnodes.h */
 	T_SupportRequestCost,		/* in nodes/supportnodes.h */
 	T_SupportRequestRows,		/* in nodes/supportnodes.h */
-	T_SupportRequestIndexCondition	/* in nodes/supportnodes.h */
+	T_SupportRequestIndexCondition, /* in nodes/supportnodes.h */
+	T_SupportRequestWFuncMonotonic	/* in nodes/supportnodes.h */
 } NodeTag;
 
 /*
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 2f618cb229..4881daf887 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -1396,6 +1396,8 @@ typedef struct WindowClause
 	int			frameOptions;	/* frame_clause options, see WindowDef */
 	Node	   *startOffset;	/* expression for starting bound, if any */
 	Node	   *endOffset;		/* expression for ending bound, if any */
+	List	   *runcondition;	/* Exec WindowAgg while this is true */
+	List	   *runconditionorig;	/* EXPLAIN compatible version of above */
 	Oid			startInRangeFunc;	/* in_range function for startOffset */
 	Oid			endInRangeFunc; /* in_range function for endOffset */
 	Oid			inRangeColl;	/* collation for in_range tests */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 0b518ce6b2..2330b90a03 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -911,6 +911,9 @@ typedef struct WindowAgg
 	int			frameOptions;	/* frame_clause options, see WindowDef */
 	Node	   *startOffset;	/* expression for starting bound, if any */
 	Node	   *endOffset;		/* expression for ending bound, if any */
+	List	   *runcondition;	/* Conditions that must remain true in order
+								 * for execution to continue */
+	List	   *runconditionorig;	/* runcondition for display in EXPLAIN */
 	/* these fields are used with RANGE offset PRECEDING/FOLLOWING: */
 	Oid			startInRangeFunc;	/* in_range function for startOffset */
 	Oid			endInRangeFunc; /* in_range function for endOffset */
@@ -1309,4 +1312,21 @@ typedef struct PlanInvalItem
 	uint32		hashValue;		/* hash value of object's cache lookup key */
 } PlanInvalItem;
 
+/*
+ * MonotonicFunction
+ *
+ * Allows the planner to track monotonic properties of functions.  A function
+ * is monotonically increasing if a subsequent call cannot yield a lower value
+ * than the previous call.  A monotonically decreasing function cannot yield a
+ * higher value on subsequent calls, and a function which is both must return
+ * the same value on each call.
+ */
+typedef enum MonotonicFunction
+{
+	MONOTONICFUNC_NONE = 0,
+	MONOTONICFUNC_INCREASING = (1 << 0),
+	MONOTONICFUNC_DECREASING = (1 << 1),
+	MONOTONICFUNC_BOTH = MONOTONICFUNC_INCREASING | MONOTONICFUNC_DECREASING
+} MonotonicFunction;
+
 #endif							/* PLANNODES_H */
diff --git a/src/include/nodes/supportnodes.h b/src/include/nodes/supportnodes.h
index 88b61b3ab3..0538a45094 100644
--- a/src/include/nodes/supportnodes.h
+++ b/src/include/nodes/supportnodes.h
@@ -33,12 +33,12 @@
 #ifndef SUPPORTNODES_H
 #define SUPPORTNODES_H
 
-#include "nodes/primnodes.h"
+#include "nodes/plannodes.h"
 
 struct PlannerInfo;				/* avoid including pathnodes.h here */
 struct IndexOptInfo;
 struct SpecialJoinInfo;
-
+struct WindowClause;
 
 /*
  * The Simplify request allows the support function to perform plan-time
@@ -239,4 +239,61 @@ typedef struct SupportRequestIndexCondition
 								 * equivalent of the function call */
 } SupportRequestIndexCondition;
 
+/* ----------
+ * To allow more efficient execution of any monotonically increasing and/or
+ * monotonically decreasing window functions, we support calling the window
+ * function's prosupport function passing along this struct whenever the
+ * planner sees an OpExpr qual directly reference a window function in a
+ * subquery.  When the planner encounters this, we populate this struct and
+ * pass it along to the window function's prosupport function so that it can
+ * evaluate if the given WindowFunc is;
+ *
+ * a) monotonically increasing, or
+ * b) monotonically decreasing, or
+ * c) both monotonically increasing and decreasing, or
+ * d) none of the above.
+ *
+ * A function that is monotonically increasing can never return a value that
+ * is lower than a value returned in a "previous call".  A monotonically
+ * decreasing function can never return a value higher than a value returned
+ * in a previous call.  A function that is both must return the same value
+ * each time.
+ *
+ * We define "previous call" to mean a previous call to the same WindowFunc
+ * struct in the same window partition.
+ *
+ * row_number() is an example of a monotonically increasing function.  The
+ * return value will be reset back to 1 in each new partition.  An example of
+ * a monotonically increasing and decreasing function is COUNT(*) OVER ().
+ * Since there is no ORDER BY clause in this example, all rows in the
+ * partition are peers and all rows within the partition will be within the
+ * frame bound.  Likewise for COUNT(*) OVER(ORDER BY a ROWS BETWEEN UNBOUNDED
+ * PRECEDING AND UNBOUNDED FOLLOWING).
+ *
+ * COUNT(*) OVER (ORDER BY a ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
+ * is an example of a monotonically decreasing function.
+ *
+ * Inputs:
+ *	'window_func' is the pointer to the window function being called.
+ *
+ *	'window_clause' pointer to the WindowClause data.  Support functions can
+ *	use this to check frame bounds and partition by clauses, etc.
+ *
+ * Outputs:
+ *	'monotonic' the resulting MonotonicFunction value for the given input
+ *	window function and window clause.
+ * ----------
+ */
+typedef struct SupportRequestWFuncMonotonic
+{
+	NodeTag		type;
+
+	/* Input fields: */
+	WindowFunc *window_func;	/* Pointer to the window function data */
+	struct WindowClause *window_clause; /* Pointer to the window clause data */
+
+	/* Output fields: */
+	MonotonicFunction monotonic;
+} SupportRequestWFuncMonotonic;
+
 #endif							/* SUPPORTNODES_H */
diff --git a/src/test/regress/expected/window.out b/src/test/regress/expected/window.out
index bb9ff7f07b..fdf053ac83 100644
--- a/src/test/regress/expected/window.out
+++ b/src/test/regress/expected/window.out
@@ -3336,6 +3336,321 @@ WHERE depname = 'sales';
                            ->  Seq Scan on empsalary
 (9 rows)
 
+-- Test window function run conditions are properly pushed down into the
+-- WindowAgg
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+                  QUERY PLAN                  
+----------------------------------------------
+ WindowAgg
+   Run Condition: (row_number() OVER (?) < 3)
+   ->  Sort
+         Sort Key: empsalary.empno
+         ->  Seq Scan on empsalary
+(5 rows)
+
+-- The following 3 statements should result the same result.
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+ empno | rn 
+-------+----
+     1 |  1
+     2 |  2
+(2 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE 3 > rn;
+ empno | rn 
+-------+----
+     1 |  1
+     2 |  2
+(2 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE 2 >= rn;
+ empno | rn 
+-------+----
+     1 |  1
+     2 |  2
+(2 rows)
+
+-- Ensure r <= 3 is pushed down into the run condition of the window agg
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          rank() OVER (ORDER BY salary DESC) r
+   FROM empsalary) emp
+WHERE r <= 3;
+               QUERY PLAN                
+-----------------------------------------
+ WindowAgg
+   Run Condition: (rank() OVER (?) <= 3)
+   ->  Sort
+         Sort Key: empsalary.salary DESC
+         ->  Seq Scan on empsalary
+(5 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          rank() OVER (ORDER BY salary DESC) r
+   FROM empsalary) emp
+WHERE r <= 3;
+ empno | salary | r 
+-------+--------+---
+     8 |   6000 | 1
+    10 |   5200 | 2
+    11 |   5200 | 2
+(3 rows)
+
+-- Ensure dr = 1 is converted to dr <= 1 to get all rows leading up to dr = 1
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+                     QUERY PLAN                      
+-----------------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.dr = 1)
+   ->  WindowAgg
+         Run Condition: (dense_rank() OVER (?) <= 1)
+         ->  Sort
+               Sort Key: empsalary.salary DESC
+               ->  Seq Scan on empsalary
+(7 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+ empno | salary | dr 
+-------+--------+----
+     8 |   6000 |  1
+(1 row)
+
+-- Check COUNT() and COUNT(*)
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+                QUERY PLAN                 
+-------------------------------------------
+ WindowAgg
+   Run Condition: (count(*) OVER (?) <= 3)
+   ->  Sort
+         Sort Key: empsalary.salary DESC
+         ->  Seq Scan on empsalary
+(5 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+ empno | salary | c 
+-------+--------+---
+     8 |   6000 | 1
+    10 |   5200 | 3
+    11 |   5200 | 3
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+                       QUERY PLAN                        
+---------------------------------------------------------
+ WindowAgg
+   Run Condition: (count(empsalary.empno) OVER (?) <= 3)
+   ->  Sort
+         Sort Key: empsalary.salary DESC
+         ->  Seq Scan on empsalary
+(5 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+ empno | salary | c 
+-------+--------+---
+     8 |   6000 | 1
+    10 |   5200 | 3
+    11 |   5200 | 3
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) c
+   FROM empsalary) emp
+WHERE c >= 3;
+                QUERY PLAN                 
+-------------------------------------------
+ WindowAgg
+   Run Condition: (count(*) OVER (?) >= 3)
+   ->  Sort
+         Sort Key: empsalary.salary DESC
+         ->  Seq Scan on empsalary
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER () c
+   FROM empsalary) emp
+WHERE 11 <= c;
+                 QUERY PLAN                 
+--------------------------------------------
+ WindowAgg
+   Run Condition: (11 <= count(*) OVER (?))
+   ->  Seq Scan on empsalary
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+                     QUERY PLAN                      
+-----------------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.dr = 1)
+   ->  WindowAgg
+         Run Condition: (dense_rank() OVER (?) <= 1)
+         ->  Sort
+               Sort Key: empsalary.salary DESC
+               ->  Seq Scan on empsalary
+(7 rows)
+
+-- Tests to ensure we don't push down the run condition when it's not valid to
+-- do so.
+-- Ensure we don't push down the clause when there is a PARTITION BY clause
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (PARTITION BY depname ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.rn < 3)
+   ->  WindowAgg
+         ->  Sort
+               Sort Key: empsalary.depname, empsalary.empno
+               ->  Seq Scan on empsalary
+(6 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (PARTITION BY depname ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+                            QUERY PLAN                            
+------------------------------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.c <= 3)
+   ->  WindowAgg
+         ->  Sort
+               Sort Key: empsalary.depname, empsalary.salary DESC
+               ->  Seq Scan on empsalary
+(6 rows)
+
+-- Ensure we don't push down when the frame options show that the window
+-- function is not monotonically increasing
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) c
+   FROM empsalary) emp
+WHERE c <= 3;
+                  QUERY PLAN                   
+-----------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.c <= 3)
+   ->  WindowAgg
+         ->  Sort
+               Sort Key: empsalary.salary DESC
+               ->  Seq Scan on empsalary
+(6 rows)
+
+-- Ensure we don't push down when the window function's monotonic properties
+-- don't match that of the clauses.
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary) c
+   FROM empsalary) emp
+WHERE 3 <= c;
+                QUERY PLAN                
+------------------------------------------
+ Subquery Scan on emp
+   Filter: (3 <= emp.c)
+   ->  WindowAgg
+         ->  Sort
+               Sort Key: empsalary.salary
+               ->  Seq Scan on empsalary
+(6 rows)
+
+-- Ensure we don't pushdown when there are multiple window clauses to evaluate
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY empno DESC) c,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+                        QUERY PLAN                         
+-----------------------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.dr = 1)
+   ->  WindowAgg
+         ->  Sort
+               Sort Key: empsalary.empno DESC
+               ->  WindowAgg
+                     ->  Sort
+                           Sort Key: empsalary.salary DESC
+                           ->  Seq Scan on empsalary
+(9 rows)
+
 -- Test Sort node collapsing
 EXPLAIN (COSTS OFF)
 SELECT * FROM
diff --git a/src/test/regress/sql/window.sql b/src/test/regress/sql/window.sql
index 41a8e0d152..094045b873 100644
--- a/src/test/regress/sql/window.sql
+++ b/src/test/regress/sql/window.sql
@@ -988,6 +988,171 @@ SELECT * FROM
    FROM empsalary) emp
 WHERE depname = 'sales';
 
+-- Test window function run conditions are properly pushed down into the
+-- WindowAgg
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+
+-- The following 3 statements should result the same result.
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE 3 > rn;
+
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE 2 >= rn;
+
+-- Ensure r <= 3 is pushed down into the run condition of the window agg
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          rank() OVER (ORDER BY salary DESC) r
+   FROM empsalary) emp
+WHERE r <= 3;
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          rank() OVER (ORDER BY salary DESC) r
+   FROM empsalary) emp
+WHERE r <= 3;
+
+-- Ensure dr = 1 is converted to dr <= 1 to get all rows leading up to dr = 1
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+
+-- Check COUNT() and COUNT(*)
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) c
+   FROM empsalary) emp
+WHERE c >= 3;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER () c
+   FROM empsalary) emp
+WHERE 11 <= c;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+
+-- Tests to ensure we don't push down the run condition when it's not valid to
+-- do so.
+
+-- Ensure we don't push down the clause when there is a PARTITION BY clause
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (PARTITION BY depname ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (PARTITION BY depname ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+-- Ensure we don't push down when the frame options show that the window
+-- function is not monotonically increasing
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+-- Ensure we don't push down when the window function's monotonic properties
+-- don't match that of the clauses.
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary) c
+   FROM empsalary) emp
+WHERE 3 <= c;
+
+-- Ensure we don't pushdown when there are multiple window clauses to evaluate
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY empno DESC) c,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+
 -- Test Sort node collapsing
 EXPLAIN (COSTS OFF)
 SELECT * FROM
-- 
2.32.0

#17

David Rowley

dgrowleyml@gmail.com

almost 4 years ago

In reply to: David Rowley (#12)

Re: Window Function "Run Conditions"

On Wed, 23 Mar 2022 at 11:24, David Rowley <dgrowleyml@gmail.com> wrote:

I think it's safer to just disable the optimisation when there are
multiple window clauses. Multiple matching clauses are merged
already, so it's perfectly valid to have multiple window functions,
it's just they must share the same window clause. I don't think
that's terrible as with the major use case that I have in mind for
this, the window function is only added to limit the number of rows.
In most cases I can imagine, there'd be no reason to have an
additional window function with different frame options.

I've not looked into the feasibility of it, but I had a thought that
we could just accumulate all the run-conditions in a new field in the
PlannerInfo then just tag them onto the top-level WindowAgg when
building the plan.

I'm just not sure it would be any more useful than what the v3 patch
is currently doing as intermediate WindowAggs would still need to
process all rows. I think it would only save the window function
evaluation of the top-level WindowAgg for rows that don't match the
run-condition. All the supported window functions are quite cheap, so
it's not a huge saving. I'd bet there would be example cases where it
would be measurable though.

David

#18

David G. Johnston

david.g.johnston@gmail.com

almost 4 years ago

In reply to: David Rowley (#14)

Re: Window Function "Run Conditions"

On Tue, Mar 22, 2022 at 3:39 PM David Rowley <dgrowleyml@gmail.com> wrote:

On Thu, 17 Mar 2022 at 17:04, Corey Huinker <corey.huinker@gmail.com>
wrote:

It seems like this effort would aid in implementing what some other

databases implement via the QUALIFY clause, which is to window functions
what HAVING is to aggregate functions.

example:

https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#qualify_clause

Isn't that just syntactic sugar? You could get the same from adding a
subquery where a WHERE clause to filter rows evaluated after the
window clause.

I'd like some of that syntactic sugar please. It goes nicely with my
HAVING syntactic coffee.

David J.

#19

David Rowley

dgrowleyml@gmail.com

almost 4 years ago

In reply to: David Rowley (#17)

1 attachment(s)

Re: Window Function "Run Conditions"

On Wed, 23 Mar 2022 at 16:30, David Rowley <dgrowleyml@gmail.com> wrote:

On Wed, 23 Mar 2022 at 11:24, David Rowley <dgrowleyml@gmail.com> wrote:

I think it's safer to just disable the optimisation when there are
multiple window clauses. Multiple matching clauses are merged
already, so it's perfectly valid to have multiple window functions,
it's just they must share the same window clause. I don't think
that's terrible as with the major use case that I have in mind for
this, the window function is only added to limit the number of rows.
In most cases I can imagine, there'd be no reason to have an
additional window function with different frame options.

I've not looked into the feasibility of it, but I had a thought that
we could just accumulate all the run-conditions in a new field in the
PlannerInfo then just tag them onto the top-level WindowAgg when
building the plan.

I'm just not sure it would be any more useful than what the v3 patch
is currently doing as intermediate WindowAggs would still need to
process all rows. I think it would only save the window function
evaluation of the top-level WindowAgg for rows that don't match the
run-condition. All the supported window functions are quite cheap, so
it's not a huge saving. I'd bet there would be example cases where it
would be measurable though.

Another way of doing this that seems better is to make it so only the
top-level WindowAgg will stop processing when the run condition
becomes false. Any intermediate WindowAggs must continue processing
tuples, but may skip evaluation of their WindowFuncs.

Doing things this way also allows us to handle cases where there is a
PARTITION BY clause, however, in this case, the top-level WindowAgg
must not stop processing and return NULL, instead, it can just act as
if it were an intermediate WindowAgg and just stop evaluating
WindowFuncs. The top-level WindowAgg must continue processing the
tuples so that the other partitions are also processed.

I made the v4 patch do things this way and tested the performance of
it vs current master. Test 1 and 2 have PARTITION BY clauses. There's
a small performance increase from not evaluating the row_number()
function once rn <= 2 is no longer true.

Test 3 shows the same speedup as the original patch where we just stop
processing any further tuples when the run condition is no longer true
and there is no PARTITION BY clause.

Setup:
create table xy (x int, y int);
insert into xy select x,y from generate_series(1,1000)x,
generate_Series(1,1000)y;
create index on xy(x,y);
vacuum analyze xy;

Test 1:

explain analyze select * from (select x,y,row_number() over (partition
by x order by y) rn from xy) as xy where rn <= 2;

Master:

Execution Time: 359.553 ms
Execution Time: 354.235 ms
Execution Time: 357.646 ms

v4 patch:

Execution Time: 346.641 ms
Execution Time: 337.131 ms
Execution Time: 336.531 ms

(5% faster)

Test 2:

explain analyze select * from (select x,y,row_number() over (partition
by x order by y) rn from xy) as xy where rn = 1;

Master:

Execution Time: 359.046 ms
Execution Time: 357.601 ms
Execution Time: 357.977 ms

v4 patch:

Execution Time: 336.540 ms
Execution Time: 337.024 ms
Execution Time: 342.706 ms

(5.7% faster)

Test 3:

explain analyze select * from (select x,y,row_number() over (order by
x,y) rn from xy) as xy where rn <= 2;

Master:

Execution Time: 362.322 ms
Execution Time: 348.812 ms
Execution Time: 349.471 ms

v4 patch:

Execution Time: 0.060 ms
Execution Time: 0.037 ms
Execution Time: 0.037 ms

(~8000x faster)

One thing which I'm not sure about with the patch is how I'm handling
the evaluation of the runcondition in nodeWindowAgg.c. Instead of
having ExecQual() evaluate an OpExpr such as "row_number() over (...)
<= 10", I'm replacing the WindowFunc with the Var in the targetlist
that corresponds to the given WindowFunc. This saves having to double
evaluate the WindowFunc. Instead, the value of the Var can be taken
directly from the slot. I don't know of anywhere else we do things
quite like that. The runcondition is slightly similar to HAVING
clauses, but HAVING clauses don't work this way. Maybe they would
have if slots had existed back then. Or maybe it's a bad idea to set a
precedent that the targetlist Vars must be evaluated already. Does
anyone have any thoughts on this part?

v4 patch attached.

David

Attachments:

v4-0001-Teach-planner-and-executor-about-monotonic-window.patchtext/plain; charset=US-ASCII; name=v4-0001-Teach-planner-and-executor-about-monotonic-window.patchDownload

From 40dc5b3d521bc3490396bb6c3c2fb78714ef5242 Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Sat, 8 May 2021 23:43:25 +1200
Subject: [PATCH v4] Teach planner and executor about monotonic window funcs

Window functions such as row_number() always return a value higher than
the one previously in any given partition.  If a query were to only
request the first few row numbers, then traditionally we would continue
evaluating the WindowAgg node until all tuples are exhausted.  However, it
is possible if someone, say only wanted all row numbers <= 10, then we
could just stop once we get number 11.

Here we implement means to do just that.  This is done by way of adding a
pg_proc.prosupport function to various of the built-in window functions
and adding supporting code to allow the support function to inform the
planner if the function is monotonically increasing, monotonically
decreasing, both or neither.  The planner is then able to make use of that
information and possibly allow the executor to short-circuit execution by
way of adding a "run condition" to the WindowAgg to instruct it to run
while this condition is true and stop when it becomes false.

When there are multiple WindowAgg nodes to evaluate then this complicates
the situation as if we were to stop execution on a lower-level WindowAgg
then an upper-level WindowAgg may not receive all of the tuples it should.
This may lead to incorrect query results.  To get around this problem all
non-top-level WindowAggs go into "passthrough" mode when their
runcondition is no longer true.  This means that they continue to pull
tuples from their subnode but no longer evaluate their window functions.
Only the top-level WindowAgg node may stop when the runcondition is no
longer true.

Here we add prosupport functions to allow this to work for; row_number(),
rank(), dense_rank(), count(*) and count(expr).  It appears technically
possible to do the same for min() and max(), however, it seems unlikely to
be useful enough, so that's not done here.

Author: David Rowley
Reviewed-by: Andy Fan, Zhihong Yu
Discussion: https://postgr.es/m/CAApHDvqvp3At8++yF8ij06sdcoo1S_b2YoaT9D4Nf+MObzsrLQ@mail.gmail.com
---
 src/backend/commands/explain.c          |   4 +
 src/backend/executor/nodeWindowAgg.c    |  95 +++++--
 src/backend/nodes/copyfuncs.c           |   5 +
 src/backend/nodes/equalfuncs.c          |   2 +
 src/backend/nodes/outfuncs.c            |   6 +
 src/backend/nodes/readfuncs.c           |   5 +
 src/backend/optimizer/path/allpaths.c   | 297 +++++++++++++++++++-
 src/backend/optimizer/plan/createplan.c |  12 +-
 src/backend/optimizer/plan/planner.c    |  18 +-
 src/backend/optimizer/plan/setrefs.c    | 102 +++++++
 src/backend/optimizer/util/pathnode.c   |   4 +-
 src/backend/utils/adt/int8.c            |  45 +++
 src/backend/utils/adt/windowfuncs.c     |  63 +++++
 src/include/catalog/pg_proc.dat         |  35 ++-
 src/include/nodes/execnodes.h           |  20 +-
 src/include/nodes/nodes.h               |   3 +-
 src/include/nodes/parsenodes.h          |   2 +
 src/include/nodes/pathnodes.h           |   2 +
 src/include/nodes/plannodes.h           |  23 ++
 src/include/nodes/supportnodes.h        |  64 ++++-
 src/include/optimizer/pathnode.h        |   3 +-
 src/test/regress/expected/window.out    | 359 ++++++++++++++++++++++++
 src/test/regress/sql/window.sql         | 186 ++++++++++++
 23 files changed, 1312 insertions(+), 43 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index cb13227db1..5936093ab2 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1988,6 +1988,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
 			break;
+		case T_WindowAgg:
+			show_upper_qual(((WindowAgg *) plan)->runConditionOrig,
+							"Run Condition", planstate, ancestors, es);
+			break;
 		case T_Group:
 			show_group_keys(castNode(GroupState, planstate), ancestors, es);
 			show_upper_qual(plan->qual, "Filter", planstate, ancestors, es);
diff --git a/src/backend/executor/nodeWindowAgg.c b/src/backend/executor/nodeWindowAgg.c
index 08ce05ca5a..65051eb9ea 100644
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -2023,13 +2023,14 @@ static TupleTableSlot *
 ExecWindowAgg(PlanState *pstate)
 {
 	WindowAggState *winstate = castNode(WindowAggState, pstate);
+	TupleTableSlot *slot;
 	ExprContext *econtext;
 	int			i;
 	int			numfuncs;
 
 	CHECK_FOR_INTERRUPTS();
 
-	if (winstate->all_done)
+	if (winstate->status == WINDOWAGG_DONE)
 		return NULL;
 
 	/*
@@ -2131,10 +2132,14 @@ ExecWindowAgg(PlanState *pstate)
 		{
 			begin_partition(winstate);
 			Assert(winstate->spooled_rows > 0);
+
+			/* Come out of pass-through mode when changing partition */
+			winstate->status = WINDOWAGG_RUN;
 		}
 		else
 		{
-			winstate->all_done = true;
+			/* No further partitions?  We're done */
+			winstate->status = WINDOWAGG_DONE;
 			return NULL;
 		}
 	}
@@ -2185,26 +2190,30 @@ ExecWindowAgg(PlanState *pstate)
 			elog(ERROR, "unexpected end of tuplestore");
 	}
 
-	/*
-	 * Evaluate true window functions
-	 */
-	numfuncs = winstate->numfuncs;
-	for (i = 0; i < numfuncs; i++)
+	/* don't evaluate the window functions when we're in pass-through mode */
+	if (winstate->status == WINDOWAGG_RUN)
 	{
-		WindowStatePerFunc perfuncstate = &(winstate->perfunc[i]);
+		/*
+			* Evaluate true window functions
+			*/
+		numfuncs = winstate->numfuncs;
+		for (i = 0; i < numfuncs; i++)
+		{
+			WindowStatePerFunc perfuncstate = &(winstate->perfunc[i]);
 
-		if (perfuncstate->plain_agg)
-			continue;
-		eval_windowfunction(winstate, perfuncstate,
-							&(econtext->ecxt_aggvalues[perfuncstate->wfuncstate->wfuncno]),
-							&(econtext->ecxt_aggnulls[perfuncstate->wfuncstate->wfuncno]));
-	}
+			if (perfuncstate->plain_agg)
+				continue;
+			eval_windowfunction(winstate, perfuncstate,
+				&(econtext->ecxt_aggvalues[perfuncstate->wfuncstate->wfuncno]),
+				&(econtext->ecxt_aggnulls[perfuncstate->wfuncstate->wfuncno]));
+		}
 
-	/*
-	 * Evaluate aggregates
-	 */
-	if (winstate->numaggs > 0)
-		eval_windowaggregates(winstate);
+		/*
+			* Evaluate aggregates
+			*/
+		if (winstate->numaggs > 0)
+			eval_windowaggregates(winstate);
+	}
 
 	/*
 	 * If we have created auxiliary read pointers for the frame or group
@@ -2235,7 +2244,33 @@ ExecWindowAgg(PlanState *pstate)
 	 */
 	econtext->ecxt_outertuple = winstate->ss.ss_ScanTupleSlot;
 
-	return ExecProject(winstate->ss.ps.ps_ProjInfo);
+	slot = ExecProject(winstate->ss.ps.ps_ProjInfo);
+
+	/*
+	 * Now evaluate the run condition to see if need to continue evaluating
+	 * window function or if we can stop completely.
+	 */
+	econtext->ecxt_scantuple = slot;
+	if (winstate->status == WINDOWAGG_RUN &&
+		!ExecQual(winstate->runcondition, econtext))
+	{
+		/*
+		 * When the runcondition is no longer true we can either abort
+		 * execution or go into pass-through mode so that we continue to pull
+		 * tuples from our subnode but just skip evaluation of the window
+		 * functions.  Which of these we perform depends on the value of the
+		 * use_pass_through field.
+		 */
+		if (winstate->use_pass_through)
+			winstate->status = WINDOWAGG_PASSTHROUGH;
+		else
+		{
+			winstate->status = WINDOWAGG_DONE;
+			return NULL;
+		}
+	}
+
+	return slot;
 }
 
 /* -----------------
@@ -2307,6 +2342,21 @@ ExecInitWindowAgg(WindowAgg *node, EState *estate, int eflags)
 	Assert(node->plan.qual == NIL);
 	winstate->ss.ps.qual = NULL;
 
+	/*
+	 * Setup the run condition, if we received one from the query planner.
+	 * When set, this may allow us to move into pass-through mode so that we
+	 * don't have to perform any further evaluation of WindowFuncs in the
+	 * current partition or possibly stop returning tuples altogether when all
+	 * tuples are in the same partition.
+	 */
+	if (node->runCondition != NIL)
+		winstate->runcondition = ExecInitQual(node->runCondition,
+											  (PlanState *) winstate);
+	else
+		winstate->runcondition = NULL;
+
+	winstate->use_pass_through = node->usePassThrough;
+
 	/*
 	 * initialize child nodes
 	 */
@@ -2500,6 +2550,9 @@ ExecInitWindowAgg(WindowAgg *node, EState *estate, int eflags)
 		winstate->agg_winobj = agg_winobj;
 	}
 
+	/* Set the status to running */
+	winstate->status = WINDOWAGG_RUN;
+
 	/* copy frame options to state node for easy access */
 	winstate->frameOptions = frameOptions;
 
@@ -2579,7 +2632,7 @@ ExecReScanWindowAgg(WindowAggState *node)
 	PlanState  *outerPlan = outerPlanState(node);
 	ExprContext *econtext = node->ss.ps.ps_ExprContext;
 
-	node->all_done = false;
+	node->status = WINDOWAGG_RUN;
 	node->all_first = true;
 
 	/* release tuplestore et al */
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index cb4b4d01f8..a544d2f918 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -1103,11 +1103,14 @@ _copyWindowAgg(const WindowAgg *from)
 	COPY_SCALAR_FIELD(frameOptions);
 	COPY_NODE_FIELD(startOffset);
 	COPY_NODE_FIELD(endOffset);
+	COPY_NODE_FIELD(runCondition);
+	COPY_NODE_FIELD(runConditionOrig);
 	COPY_SCALAR_FIELD(startInRangeFunc);
 	COPY_SCALAR_FIELD(endInRangeFunc);
 	COPY_SCALAR_FIELD(inRangeColl);
 	COPY_SCALAR_FIELD(inRangeAsc);
 	COPY_SCALAR_FIELD(inRangeNullsFirst);
+	COPY_SCALAR_FIELD(usePassThrough);
 
 	return newnode;
 }
@@ -2789,6 +2792,8 @@ _copyWindowClause(const WindowClause *from)
 	COPY_SCALAR_FIELD(frameOptions);
 	COPY_NODE_FIELD(startOffset);
 	COPY_NODE_FIELD(endOffset);
+	COPY_NODE_FIELD(runCondition);
+	COPY_NODE_FIELD(runConditionOrig);
 	COPY_SCALAR_FIELD(startInRangeFunc);
 	COPY_SCALAR_FIELD(endInRangeFunc);
 	COPY_SCALAR_FIELD(inRangeColl);
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 084d98b34c..d8f4861eb8 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -3041,6 +3041,8 @@ _equalWindowClause(const WindowClause *a, const WindowClause *b)
 	COMPARE_SCALAR_FIELD(frameOptions);
 	COMPARE_NODE_FIELD(startOffset);
 	COMPARE_NODE_FIELD(endOffset);
+	COMPARE_NODE_FIELD(runCondition);
+	COMPARE_NODE_FIELD(runConditionOrig);
 	COMPARE_SCALAR_FIELD(startInRangeFunc);
 	COMPARE_SCALAR_FIELD(endInRangeFunc);
 	COMPARE_SCALAR_FIELD(inRangeColl);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 278e87259d..29df4012b7 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -828,11 +828,14 @@ _outWindowAgg(StringInfo str, const WindowAgg *node)
 	WRITE_INT_FIELD(frameOptions);
 	WRITE_NODE_FIELD(startOffset);
 	WRITE_NODE_FIELD(endOffset);
+	WRITE_NODE_FIELD(runCondition);
+	WRITE_NODE_FIELD(runConditionOrig);
 	WRITE_OID_FIELD(startInRangeFunc);
 	WRITE_OID_FIELD(endInRangeFunc);
 	WRITE_OID_FIELD(inRangeColl);
 	WRITE_BOOL_FIELD(inRangeAsc);
 	WRITE_BOOL_FIELD(inRangeNullsFirst);
+	WRITE_BOOL_FIELD(usePassThrough);
 }
 
 static void
@@ -2198,6 +2201,7 @@ _outWindowAggPath(StringInfo str, const WindowAggPath *node)
 
 	WRITE_NODE_FIELD(subpath);
 	WRITE_NODE_FIELD(winclause);
+	WRITE_BOOL_FIELD(usepassthrough);
 }
 
 static void
@@ -3208,6 +3212,8 @@ _outWindowClause(StringInfo str, const WindowClause *node)
 	WRITE_INT_FIELD(frameOptions);
 	WRITE_NODE_FIELD(startOffset);
 	WRITE_NODE_FIELD(endOffset);
+	WRITE_NODE_FIELD(runCondition);
+	WRITE_NODE_FIELD(runConditionOrig);
 	WRITE_OID_FIELD(startInRangeFunc);
 	WRITE_OID_FIELD(endInRangeFunc);
 	WRITE_OID_FIELD(inRangeColl);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 5b9e235e9a..191c1872b5 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -384,6 +384,8 @@ _readWindowClause(void)
 	READ_INT_FIELD(frameOptions);
 	READ_NODE_FIELD(startOffset);
 	READ_NODE_FIELD(endOffset);
+	READ_NODE_FIELD(runCondition);
+	READ_NODE_FIELD(runConditionOrig);
 	READ_OID_FIELD(startInRangeFunc);
 	READ_OID_FIELD(endInRangeFunc);
 	READ_OID_FIELD(inRangeColl);
@@ -2467,11 +2469,14 @@ _readWindowAgg(void)
 	READ_INT_FIELD(frameOptions);
 	READ_NODE_FIELD(startOffset);
 	READ_NODE_FIELD(endOffset);
+	READ_NODE_FIELD(runCondition);
+	READ_NODE_FIELD(runConditionOrig);
 	READ_OID_FIELD(startInRangeFunc);
 	READ_OID_FIELD(endInRangeFunc);
 	READ_OID_FIELD(inRangeColl);
 	READ_BOOL_FIELD(inRangeAsc);
 	READ_BOOL_FIELD(inRangeNullsFirst);
+	READ_BOOL_FIELD(usePassThrough);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 169b1d53fc..2bb697fce3 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -27,6 +27,7 @@
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
+#include "nodes/supportnodes.h"
 #ifdef OPTIMIZER_DEBUG
 #include "nodes/print.h"
 #endif
@@ -2157,6 +2158,284 @@ has_multiple_baserels(PlannerInfo *root)
 	return false;
 }
 
+/*
+ * find_window_run_conditions
+ *		Determine if 'wfunc' is really a WindowFunc and call its prosupport
+ *		function to determine the function's monotonic properties.  We then
+ *		see if 'opexpr' can be used to short-circuit execution.
+ *
+ * For example row_number() over (order by ...) always produces a value one
+ * higher than the previous.  If someone has a window function such as that
+ * in a subquery and just wants, say all rows with a row number less than or
+ * equal to 10, then we may as well stop processing the windowagg once the row
+ * number reaches 11.  Here we look for opexprs that might help us to stop
+ * doing needless extra processing in WindowAgg nodes.
+ *
+ * To do this we make use of the window function's prosupport function to
+ * determine if the given window function with the given window clause is a
+ * monotonically increasing or decreasing function.
+ *
+ * '*keep_original' is set to true if the caller should also use 'opexpr' for
+ * its original purpose.  This is set to false if the caller can assume that
+ * the run condition will handle all of the required filtering.
+ *
+ * Returns true if a run condition qual was found and added to the
+ * appropriate WindowClause's 'runCondition' list and sets *keep_original
+ * accordingly.  Returns false if wfunc is not a WindowFunc or we're unable to
+ * use 'opexpr' as a run condition.
+ */
+static bool
+find_window_run_conditions(Query *subquery, RangeTblEntry *rte, Index rti,
+						   AttrNumber attno, WindowFunc *wfunc, OpExpr *opexpr,
+						   bool wfunc_left, bool *keep_original)
+{
+	Oid			prosupport;
+	Expr	   *otherexpr;
+	SupportRequestWFuncMonotonic req;
+	SupportRequestWFuncMonotonic *res;
+	WindowClause *wclause;
+	List	   *opinfos;
+	OpExpr	   *runopexpr;
+	Oid			runoperator;
+	ListCell   *lc;
+
+	*keep_original = true;
+
+	while (IsA(wfunc, RelabelType))
+		wfunc = (WindowFunc *) ((RelabelType *) wfunc)->arg;
+
+	/* we can only work with window functions */
+	if (!IsA(wfunc, WindowFunc))
+		return false;
+
+	/* find the window clause belonging to the window function */
+	wclause = (WindowClause *) list_nth(subquery->windowClause,
+										wfunc->winref - 1);
+
+	prosupport = get_func_support(wfunc->winfnoid);
+
+	/* Check if there's a support function for 'wfunc' */
+	if (!OidIsValid(prosupport))
+		return false;
+
+	/* get the Expr from the other side of the OpExpr */
+	if (wfunc_left)
+		otherexpr = lsecond(opexpr->args);
+	else
+		otherexpr = linitial(opexpr->args);
+
+	/*
+	 * The value being compared must not change during the evaluation of the
+	 * window partition.
+	 */
+	if (!is_pseudo_constant_clause((Node *) otherexpr))
+		return false;
+
+	req.type = T_SupportRequestWFuncMonotonic;
+	req.window_func = wfunc;
+	req.window_clause = wclause;
+
+	/* call the support function */
+	res = (SupportRequestWFuncMonotonic *)
+		DatumGetPointer(OidFunctionCall1(prosupport,
+										 PointerGetDatum(&req)));
+
+	/*
+	 * Nothing to do if the function is neither monotonically increasing nor
+	 * monotonically decreasing.
+	 */
+	if (res == NULL || res->monotonic == MONOTONICFUNC_NONE)
+		return false;
+
+	runopexpr = NULL;
+	runoperator = InvalidOid;
+	opinfos = get_op_btree_interpretation(opexpr->opno);
+
+	foreach(lc, opinfos)
+	{
+		OpBtreeInterpretation *opinfo = (OpBtreeInterpretation *) lfirst(lc);
+		int			strategy = opinfo->strategy;
+
+		/* handle < / <= */
+		if (strategy == BTLessStrategyNumber ||
+			strategy == BTLessEqualStrategyNumber)
+		{
+			/*
+			 * < / <= is supported for monotonically increasing functions in
+			 * the form <wfunc> op <pseudoconst> and <pseudoconst> op <wfunc>
+			 * for monotonically decreasing functions.
+			 */
+			if ((wfunc_left && (res->monotonic & MONOTONICFUNC_INCREASING)) ||
+				(!wfunc_left && (res->monotonic & MONOTONICFUNC_DECREASING)))
+			{
+				/*
+				 * We must keep the original qual in place if there is a
+				 * PARTITION BY clause as the top-level WindowAgg remains
+				 * in pass-through mode and does nothing to filter out
+				 * unwanted tuples.
+				 */
+				*keep_original = (wclause->partitionClause != NIL);;
+				runopexpr = opexpr;
+				runoperator = opexpr->opno;
+			}
+			break;
+		}
+		/* handle > / >= */
+		else if (strategy == BTGreaterStrategyNumber ||
+				 strategy == BTGreaterEqualStrategyNumber)
+		{
+			/*
+			 * > / >= is supported for monotonically decreasing functions in
+			 * the form <wfunc> op <pseudoconst> and <pseudoconst> op <wfunc>
+			 * for monotonically increasing functions.
+			 */
+			if ((wfunc_left && (res->monotonic & MONOTONICFUNC_DECREASING)) ||
+				(!wfunc_left && (res->monotonic & MONOTONICFUNC_INCREASING)))
+			{
+				*keep_original = (wclause->partitionClause != NIL);;
+				runopexpr = opexpr;
+				runoperator = opexpr->opno;
+			}
+			break;
+		}
+		/* handle = */
+		else if (strategy == BTEqualStrategyNumber)
+		{
+			int16		newstrategy;
+
+			/*
+			 * When both monotonically increasing and decreasing then the
+			 * return value of the window function will be the same each time.
+			 * We can simply use 'opexpr' as the run condition without
+			 * modifying it.
+			 */
+			if ((res->monotonic & MONOTONICFUNC_BOTH) == MONOTONICFUNC_BOTH)
+			{
+				*keep_original = (wclause->partitionClause != NIL);;
+				runopexpr = opexpr;
+				break;
+			}
+
+			/*
+			 * When monotonically increasing we make a qual with <wfunc> <=
+			 * <value> or <value> >= <wfunc> in order to filter out values
+			 * which are above the value in the equality condition.  For
+			 * monotonically decreasing functions we want to filter values
+			 * below the value in the equality condition.
+			 */
+			if (res->monotonic & MONOTONICFUNC_INCREASING)
+				newstrategy = wfunc_left ? BTLessEqualStrategyNumber : BTGreaterEqualStrategyNumber;
+			else
+				newstrategy = wfunc_left ? BTGreaterEqualStrategyNumber : BTLessEqualStrategyNumber;
+
+			/* We must keep the original equality qual */
+			*keep_original = true;
+			runopexpr = opexpr;
+
+			/* determine the operator to use for the runCondition qual */
+			runoperator = get_opfamily_member(opinfo->opfamily_id,
+											  opinfo->oplefttype,
+											  opinfo->oprighttype,
+											  newstrategy);
+			break;
+		}
+	}
+
+	if (runopexpr != NULL)
+	{
+		Expr	   *newexpr;
+
+		/*
+		 * Build the qual required for the run condition keeping the
+		 * WindowFunc on the same side as it was originally.
+		 */
+		if (wfunc_left)
+			newexpr = make_opclause(runoperator,
+									runopexpr->opresulttype,
+									runopexpr->opretset, (Expr *) wfunc,
+									otherexpr, runopexpr->opcollid,
+									runopexpr->inputcollid);
+		else
+			newexpr = make_opclause(runoperator,
+									runopexpr->opresulttype,
+									runopexpr->opretset,
+									otherexpr, (Expr *) wfunc,
+									runopexpr->opcollid,
+									runopexpr->inputcollid);
+
+		wclause->runCondition = lappend(wclause->runCondition, newexpr);
+
+		/*
+		 * Store the qual twice.  This one will be unmodified for use in
+		 * EXPLAIN.
+		 */
+		wclause->runConditionOrig = lappend(wclause->runConditionOrig, newexpr);
+		return true;
+	}
+
+	/* unsupported OpExpr */
+	return false;
+}
+
+/*
+ * check_and_push_window_quals
+ *		Check if 'rinfo' is a qual that can be pushed into a WindowFunc's
+ *		WindowClause as a 'runCondition' qual.  These, when present, allow
+ *		some unnecessary work to be skipped during execution.
+ *
+ * Returns true if the caller still must keep the original qual or false if
+ * the caller can safely ignore the original qual because the window function
+ * will use the run condition to stop at the right time.
+ */
+static bool
+check_and_push_window_quals(Query *subquery, RangeTblEntry *rte, Index rti,
+							Node *clause)
+{
+	OpExpr	   *opexpr = (OpExpr *) clause;
+	bool		keep_original = true;
+	Var		   *var1;
+	Var		   *var2;
+
+	/* We're only able to use OpExprs with 2 operands */
+	if (!IsA(opexpr, OpExpr))
+		return true;
+
+	if (list_length(opexpr->args) != 2)
+		return true;
+
+	/*
+	 * Check for plain Vars that reference window functions in the subquery.
+	 * If we find any, we'll ask find_window_run_conditions() if 'opexpr' can
+	 * be used as a run condition.
+	 */
+
+	/* Check the left side of the OpExpr */
+	var1 = linitial(opexpr->args);
+	if (IsA(var1, Var) && var1->varattno > 0)
+	{
+		TargetEntry *tle = list_nth(subquery->targetList, var1->varattno - 1);
+		WindowFunc *wfunc = (WindowFunc *) tle->expr;
+
+		if (find_window_run_conditions(subquery, rte, rti, tle->resno, wfunc,
+									   opexpr, true, &keep_original))
+			return keep_original;
+	}
+
+	/* and check the right side */
+	var2 = lsecond(opexpr->args);
+	if (IsA(var2, Var) && var2->varattno > 0)
+	{
+		TargetEntry *tle = list_nth(subquery->targetList, var2->varattno - 1);
+		WindowFunc *wfunc = (WindowFunc *) tle->expr;
+
+		if (find_window_run_conditions(subquery, rte, rti, tle->resno, wfunc,
+									   opexpr, false, &keep_original))
+			return keep_original;
+	}
+
+	return true;
+}
+
 /*
  * set_subquery_pathlist
  *		Generate SubqueryScan access paths for a subquery RTE
@@ -2245,19 +2524,29 @@ set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 		foreach(l, rel->baserestrictinfo)
 		{
 			RestrictInfo *rinfo = (RestrictInfo *) lfirst(l);
+			Node	   *clause = (Node *) rinfo->clause;
 
 			if (!rinfo->pseudoconstant &&
 				qual_is_pushdown_safe(subquery, rti, rinfo, &safetyInfo))
 			{
-				Node	   *clause = (Node *) rinfo->clause;
-
 				/* Push it down */
 				subquery_push_qual(subquery, rte, rti, clause);
 			}
 			else
 			{
-				/* Keep it in the upper query */
-				upperrestrictlist = lappend(upperrestrictlist, rinfo);
+				/*
+				 * Since we can't push the qual down into the subquery, check
+				 * if it happens to reference a window function.  If so then
+				 * it might be useful to use for the WindowAgg's runCondition.
+				 */
+				if (check_and_push_window_quals(subquery, rte, rti, clause))
+				{
+					/*
+					 * It's not a suitable window run condition qual or it is,
+					 * but the original must also be kept in the upper query.
+					 */
+					upperrestrictlist = lappend(upperrestrictlist, rinfo);
+				}
 			}
 		}
 		rel->baserestrictinfo = upperrestrictlist;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 179c87c671..707fc888e8 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -288,7 +288,8 @@ static WindowAgg *make_windowagg(List *tlist, Index winref,
 								 int frameOptions, Node *startOffset, Node *endOffset,
 								 Oid startInRangeFunc, Oid endInRangeFunc,
 								 Oid inRangeColl, bool inRangeAsc, bool inRangeNullsFirst,
-								 Plan *lefttree);
+								 List *runCondition, List *runConditionOrig,
+								 bool usePassThrough, Plan *lefttree);
 static Group *make_group(List *tlist, List *qual, int numGroupCols,
 						 AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
 						 Plan *lefttree);
@@ -2642,6 +2643,9 @@ create_windowagg_plan(PlannerInfo *root, WindowAggPath *best_path)
 						  wc->inRangeColl,
 						  wc->inRangeAsc,
 						  wc->inRangeNullsFirst,
+						  wc->runCondition,
+						  wc->runConditionOrig,
+						  best_path->usepassthrough,
 						  subplan);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -6527,7 +6531,8 @@ make_windowagg(List *tlist, Index winref,
 			   int frameOptions, Node *startOffset, Node *endOffset,
 			   Oid startInRangeFunc, Oid endInRangeFunc,
 			   Oid inRangeColl, bool inRangeAsc, bool inRangeNullsFirst,
-			   Plan *lefttree)
+			   List *runCondition, List *runConditionOrig,
+			   bool usePassThrough, Plan *lefttree)
 {
 	WindowAgg  *node = makeNode(WindowAgg);
 	Plan	   *plan = &node->plan;
@@ -6544,11 +6549,14 @@ make_windowagg(List *tlist, Index winref,
 	node->frameOptions = frameOptions;
 	node->startOffset = startOffset;
 	node->endOffset = endOffset;
+	node->runCondition = runCondition;
+	node->runConditionOrig = runConditionOrig;
 	node->startInRangeFunc = startInRangeFunc;
 	node->endInRangeFunc = endInRangeFunc;
 	node->inRangeColl = inRangeColl;
 	node->inRangeAsc = inRangeAsc;
 	node->inRangeNullsFirst = inRangeNullsFirst;
+	node->usePassThrough = usePassThrough;
 
 	plan->targetlist = tlist;
 	plan->lefttree = lefttree;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 547fda20a2..65bcbf9df9 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4214,6 +4214,7 @@ create_one_window_path(PlannerInfo *root,
 		List	   *window_pathkeys;
 		int			presorted_keys;
 		bool		is_sorted;
+		bool		usepassthrough;
 
 		window_pathkeys = make_pathkeys_for_window(root,
 												   wc,
@@ -4277,10 +4278,25 @@ create_one_window_path(PlannerInfo *root,
 			window_target = output_target;
 		}
 
+		/*
+		 * Determine the pass-through mode for the WindowAgg.  This only has
+		 * an effect when the WindowAgg has a runCondition.  In all cases,
+		 * all apart from the top-level WindowAgg must run in pass-through
+		 * mode.  This means it will continue to pull tuples from the subplan
+		 * even when the runCondition is no longer true.  When there is no
+		 * PARTITION BY clause, the top-level WindowAgg can operate without
+		 * pass-through mode.  This allows the WindowAgg to return NULL to
+		 * indicate there are no more tuples.  However, when there is a
+		 * PARTITION BY clause this is not possible as we still need to
+		 * process other partitions.
+		 */
+		usepassthrough = wc->partitionClause != NIL ||
+						 foreach_current_index(l) < list_length(activeWindows) - 1;
+
 		path = (Path *)
 			create_windowagg_path(root, window_rel, path, window_target,
 								  wflists->windowFuncs[wc->winref],
-								  wc);
+								  wc, usepassthrough);
 	}
 
 	add_path(window_rel, path);
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index bf4c722c02..ecfa4836ae 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -71,6 +71,13 @@ typedef struct
 	double		num_exec;
 } fix_upper_expr_context;
 
+typedef struct
+{
+	PlannerInfo *root;
+	indexed_tlist *subplan_itlist;
+	int			newvarno;
+} fix_windowagg_cond_context;
+
 /*
  * Selecting the best alternative in an AlternativeSubPlan expression requires
  * estimating how many times that expression will be evaluated.  For an
@@ -172,6 +179,9 @@ static List *set_returning_clause_references(PlannerInfo *root,
 											 Plan *topplan,
 											 Index resultRelation,
 											 int rtoffset);
+static List *set_windowagg_runcondition_references(PlannerInfo *root,
+												   List *runcondition,
+												   Plan *plan);
 
 
 /*****************************************************************************
@@ -886,6 +896,18 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 			{
 				WindowAgg  *wplan = (WindowAgg *) plan;
 
+				/*
+				 * Adjust the WindowAgg's run conditions so that instead of
+				 * the expressions having to re-evaluate the WindowFuncs all
+				 * over again, swap the WindowFunc references out for Vars
+				 * that directly reference the Var in the slot that the result
+				 * of the WindowFunc evaluation will have already been stored
+				 * into when the executor evaluates the runCondition.
+				 */
+				wplan->runCondition = set_windowagg_runcondition_references(root,
+																			wplan->runCondition,
+																			(Plan *) wplan);
+
 				set_upper_references(root, plan, rtoffset);
 
 				/*
@@ -897,6 +919,14 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 					fix_scan_expr(root, wplan->startOffset, rtoffset, 1);
 				wplan->endOffset =
 					fix_scan_expr(root, wplan->endOffset, rtoffset, 1);
+				wplan->runCondition = fix_scan_list(root,
+													wplan->runCondition,
+													rtoffset,
+													NUM_EXEC_TLIST(plan));
+				wplan->runConditionOrig = fix_scan_list(root,
+														wplan->runConditionOrig,
+														rtoffset,
+														NUM_EXEC_TLIST(plan));
 			}
 			break;
 		case T_Result:
@@ -3050,6 +3080,78 @@ set_returning_clause_references(PlannerInfo *root,
 	return rlist;
 }
 
+/*
+ * fix_windowagg_condition_expr_mutator
+ *		Mutator function for replacing WindowFuncs with the corresponding Var
+ *		in the targetlist which references that WindowFunc.
+ */
+static Node *
+fix_windowagg_condition_expr_mutator(Node *node,
+									 fix_windowagg_cond_context *context)
+{
+	if (node == NULL)
+		return NULL;
+
+	if (IsA(node, WindowFunc))
+	{
+		Var		   *newvar;
+
+		newvar = search_indexed_tlist_for_non_var((Expr *) node,
+												  context->subplan_itlist,
+												  context->newvarno);
+		if (newvar)
+			return (Node *) newvar;
+		elog(ERROR, "WindowFunc not found in subplan target lists");
+	}
+
+	return expression_tree_mutator(node,
+								   fix_windowagg_condition_expr_mutator,
+								   (void *) context);
+}
+
+/*
+ * fix_windowagg_condition_expr
+ *		Converts references in 'runcondition' so that any WindowFunc
+ *		references are swapped out for a Var which references the matching
+ *		WindowFunc in 'subplan_itlist'.
+ */
+static List *
+fix_windowagg_condition_expr(PlannerInfo *root,
+							 List *runcondition,
+							 indexed_tlist *subplan_itlist)
+{
+	fix_windowagg_cond_context context;
+
+	context.root = root;
+	context.subplan_itlist = subplan_itlist;
+	context.newvarno = 0;
+
+	return (List *) fix_windowagg_condition_expr_mutator((Node *) runcondition,
+														 &context);
+}
+
+/*
+ * set_windowagg_runcondition_references
+ *		Converts references in 'runcondition' so that any WindowFunc
+ *		references are swapped out for a Var which references the matching
+ *		WindowFunc in 'plan' targetlist.
+ */
+static List *
+set_windowagg_runcondition_references(PlannerInfo *root,
+									  List *runcondition,
+									  Plan *plan)
+{
+	List	   *newlist;
+	indexed_tlist *itlist;
+
+	itlist = build_tlist_index(plan->targetlist);
+
+	newlist = fix_windowagg_condition_expr(root, runcondition, itlist);
+
+	pfree(itlist);
+
+	return newlist;
+}
 
 /*****************************************************************************
  *					QUERY DEPENDENCY MANAGEMENT
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 99df76b6b7..8da7179c53 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3398,7 +3398,8 @@ create_windowagg_path(PlannerInfo *root,
 					  Path *subpath,
 					  PathTarget *target,
 					  List *windowFuncs,
-					  WindowClause *winclause)
+					  WindowClause *winclause,
+					  bool usepassthrough)
 {
 	WindowAggPath *pathnode = makeNode(WindowAggPath);
 
@@ -3416,6 +3417,7 @@ create_windowagg_path(PlannerInfo *root,
 
 	pathnode->subpath = subpath;
 	pathnode->winclause = winclause;
+	pathnode->usepassthrough = usepassthrough;
 
 	/*
 	 * For costing purposes, assume that there are no redundant partitioning
diff --git a/src/backend/utils/adt/int8.c b/src/backend/utils/adt/int8.c
index 4a87114a4f..f100c9f4ec 100644
--- a/src/backend/utils/adt/int8.c
+++ b/src/backend/utils/adt/int8.c
@@ -24,6 +24,7 @@
 #include "nodes/supportnodes.h"
 #include "optimizer/optimizer.h"
 #include "utils/builtins.h"
+#include "utils/lsyscache.h"
 
 
 typedef struct
@@ -791,6 +792,7 @@ int8dec(PG_FUNCTION_ARGS)
 }
 
 
+
 /*
  * These functions are exactly like int8inc/int8dec but are used for
  * aggregates that count only non-null values.  Since the functions are
@@ -818,6 +820,49 @@ int8dec_any(PG_FUNCTION_ARGS)
 	return int8dec(fcinfo);
 }
 
+/*
+ * int8inc_support
+ *		prosupport function for int8inc() and int8inc_any()
+ */
+Datum
+int8inc_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq = (Node *) PG_GETARG_POINTER(0);
+
+	if (IsA(rawreq, SupportRequestWFuncMonotonic))
+	{
+		SupportRequestWFuncMonotonic *req = (SupportRequestWFuncMonotonic *) rawreq;
+		MonotonicFunction monotonic = MONOTONICFUNC_NONE;
+		int			frameOptions = req->window_clause->frameOptions;
+
+		/* No ORDER BY clause then all rows are peers */
+		if (req->window_clause->orderClause == NIL)
+			monotonic = MONOTONICFUNC_BOTH;
+		else
+		{
+			/*
+			 * Otherwise take into account the frame options.  When the frame
+			 * bound is the start of the window then the resulting value can
+			 * never decrease, therefore is monotonically increasing
+			 */
+			if (frameOptions & FRAMEOPTION_START_UNBOUNDED_PRECEDING)
+				monotonic |= MONOTONICFUNC_INCREASING;
+
+			/*
+			 * Likewise, if the frame bound is the end of the window then the
+			 * resulting value can never decrease.
+			 */
+			if (frameOptions & FRAMEOPTION_END_UNBOUNDED_FOLLOWING)
+				monotonic |= MONOTONICFUNC_DECREASING;
+		}
+
+		req->monotonic = monotonic;
+		PG_RETURN_POINTER(req);
+	}
+
+	PG_RETURN_POINTER(NULL);
+}
+
 
 Datum
 int8larger(PG_FUNCTION_ARGS)
diff --git a/src/backend/utils/adt/windowfuncs.c b/src/backend/utils/adt/windowfuncs.c
index 3e0cc9be1a..596564fa15 100644
--- a/src/backend/utils/adt/windowfuncs.c
+++ b/src/backend/utils/adt/windowfuncs.c
@@ -13,6 +13,7 @@
  */
 #include "postgres.h"
 
+#include "nodes/supportnodes.h"
 #include "utils/builtins.h"
 #include "windowapi.h"
 
@@ -88,6 +89,26 @@ window_row_number(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(curpos + 1);
 }
 
+/*
+ * window_row_number_support
+ *		prosupport function for window_row_number()
+ */
+Datum
+window_row_number_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq = (Node *) PG_GETARG_POINTER(0);
+
+	if (IsA(rawreq, SupportRequestWFuncMonotonic))
+	{
+		SupportRequestWFuncMonotonic *req = (SupportRequestWFuncMonotonic *) rawreq;
+
+		/* row_number() is monotonically increasing */
+		req->monotonic = MONOTONICFUNC_INCREASING;
+		PG_RETURN_POINTER(req);
+	}
+
+	PG_RETURN_POINTER(NULL);
+}
 
 /*
  * rank
@@ -110,6 +131,27 @@ window_rank(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(context->rank);
 }
 
+/*
+ * window_rank_support
+ *		prosupport function for window_rank()
+ */
+Datum
+window_rank_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq = (Node *) PG_GETARG_POINTER(0);
+
+	if (IsA(rawreq, SupportRequestWFuncMonotonic))
+	{
+		SupportRequestWFuncMonotonic *req = (SupportRequestWFuncMonotonic *) rawreq;
+
+		/* rank() is monotonically increasing */
+		req->monotonic = MONOTONICFUNC_INCREASING;
+		PG_RETURN_POINTER(req);
+	}
+
+	PG_RETURN_POINTER(NULL);
+}
+
 /*
  * dense_rank
  * Rank increases by 1 when key columns change.
@@ -130,6 +172,27 @@ window_dense_rank(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(context->rank);
 }
 
+/*
+ * window_dense_rank_support
+ *		prosupport function for window_dense_rank()
+ */
+Datum
+window_dense_rank_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq = (Node *) PG_GETARG_POINTER(0);
+
+	if (IsA(rawreq, SupportRequestWFuncMonotonic))
+	{
+		SupportRequestWFuncMonotonic *req = (SupportRequestWFuncMonotonic *) rawreq;
+
+		/* dense_rank() is monotonically increasing */
+		req->monotonic = MONOTONICFUNC_INCREASING;
+		PG_RETURN_POINTER(req);
+	}
+
+	PG_RETURN_POINTER(NULL);
+}
+
 /*
  * percent_rank
  * return fraction between 0 and 1 inclusive,
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 01e1dd4d6d..b2066ebb4e 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6647,11 +6647,16 @@
 # count has two forms: count(any) and count(*)
 { oid => '2147',
   descr => 'number of input rows for which the input expression is not null',
-  proname => 'count', prokind => 'a', proisstrict => 'f', prorettype => 'int8',
-  proargtypes => 'any', prosrc => 'aggregate_dummy' },
+  proname => 'count', prosupport => 'int8inc_support', prokind => 'a',
+  proisstrict => 'f', prorettype => 'int8', proargtypes => 'any',
+  prosrc => 'aggregate_dummy' },
 { oid => '2803', descr => 'number of input rows',
-  proname => 'count', prokind => 'a', proisstrict => 'f', prorettype => 'int8',
-  proargtypes => '', prosrc => 'aggregate_dummy' },
+  proname => 'count', prosupport => 'int8inc_support', prokind => 'a',
+  proisstrict => 'f', prorettype => 'int8', proargtypes => '',
+  prosrc => 'aggregate_dummy' },
+{ oid => '8802', descr => 'planner support for count run condition',
+  proname => 'int8inc_support', prorettype => 'internal',
+  proargtypes => 'internal', prosrc => 'int8inc_support' },
 
 { oid => '2718',
   descr => 'population variance of bigint input values (square of the population standard deviation)',
@@ -10155,14 +10160,26 @@
 
 # SQL-spec window functions
 { oid => '3100', descr => 'row number within partition',
-  proname => 'row_number', prokind => 'w', proisstrict => 'f',
-  prorettype => 'int8', proargtypes => '', prosrc => 'window_row_number' },
+  proname => 'row_number', prosupport => 'window_row_number_support',
+  prokind => 'w', proisstrict => 'f',  prorettype => 'int8',
+  proargtypes => '', prosrc => 'window_row_number' },
+{ oid => '8799', descr => 'planner support for row_number run condition',
+  proname => 'window_row_number_support', prorettype => 'internal',
+  proargtypes => 'internal', prosrc => 'window_row_number_support' },
 { oid => '3101', descr => 'integer rank with gaps',
-  proname => 'rank', prokind => 'w', proisstrict => 'f', prorettype => 'int8',
+  proname => 'rank', prosupport => 'window_rank_support',
+  prokind => 'w', proisstrict => 'f', prorettype => 'int8',
   proargtypes => '', prosrc => 'window_rank' },
+{ oid => '8800', descr => 'planner support for rank run condition',
+  proname => 'window_rank_support', prorettype => 'internal',
+  proargtypes => 'internal', prosrc => 'window_rank_support' },
 { oid => '3102', descr => 'integer rank without gaps',
-  proname => 'dense_rank', prokind => 'w', proisstrict => 'f',
-  prorettype => 'int8', proargtypes => '', prosrc => 'window_dense_rank' },
+  proname => 'dense_rank', prosupport => 'window_dense_rank_support',
+  prokind => 'w', proisstrict => 'f', prorettype => 'int8', proargtypes => '',
+  prosrc => 'window_dense_rank' },
+{ oid => '8801', descr => 'planner support for dense rank run condition',
+  proname => 'window_dense_rank_support', prorettype => 'internal',
+  proargtypes => 'internal', prosrc => 'window_dense_rank_support' },
 { oid => '3103', descr => 'fractional rank within partition',
   proname => 'percent_rank', prokind => 'w', proisstrict => 'f',
   prorettype => 'float8', proargtypes => '', prosrc => 'window_percent_rank' },
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index cbbcff81d2..70c88b345a 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2406,6 +2406,16 @@ typedef struct AggState
 typedef struct WindowStatePerFuncData *WindowStatePerFunc;
 typedef struct WindowStatePerAggData *WindowStatePerAgg;
 
+/*
+ * WindowAggStatus -- Used to track the status of WindowAggState
+ */
+typedef enum WindowAggStatus
+{
+	WINDOWAGG_DONE,
+	WINDOWAGG_RUN,
+	WINDOWAGG_PASSTHROUGH
+} WindowAggStatus;
+
 typedef struct WindowAggState
 {
 	ScanState	ss;				/* its first field is NodeTag */
@@ -2432,6 +2442,7 @@ typedef struct WindowAggState
 	struct WindowObjectData *agg_winobj;	/* winobj for aggregate fetches */
 	int64		aggregatedbase; /* start row for current aggregates */
 	int64		aggregatedupto; /* rows before this one are aggregated */
+	WindowAggStatus		status; /* run status of WindowAggState */
 
 	int			frameOptions;	/* frame_clause options, see WindowDef */
 	ExprState  *startOffset;	/* expression for starting bound offset */
@@ -2458,8 +2469,15 @@ typedef struct WindowAggState
 	MemoryContext curaggcontext;	/* current aggregate's working data */
 	ExprContext *tmpcontext;	/* short-term evaluation context */
 
+	ExprState  *runcondition;	/* Condition which must remain true otherwise
+								 * execution of the WindowAgg will finish or
+								 * go into pass-through mode.  NULL when there
+								 * is no such condition. */
+
+	bool		use_pass_through;	/* When false, stop execution when
+									 * runcondition is no longer true.  Else
+									 * just stop evaluating window funcs. */
 	bool		all_first;		/* true if the scan is starting */
-	bool		all_done;		/* true if the scan is finished */
 	bool		partition_spooled;	/* true if all tuples in current partition
 									 * have been spooled into tuplestore */
 	bool		more_partitions;	/* true if there's more partitions after
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index e8f30367a4..d2601e117f 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -544,7 +544,8 @@ typedef enum NodeTag
 	T_SupportRequestSelectivity,	/* in nodes/supportnodes.h */
 	T_SupportRequestCost,		/* in nodes/supportnodes.h */
 	T_SupportRequestRows,		/* in nodes/supportnodes.h */
-	T_SupportRequestIndexCondition	/* in nodes/supportnodes.h */
+	T_SupportRequestIndexCondition, /* in nodes/supportnodes.h */
+	T_SupportRequestWFuncMonotonic	/* in nodes/supportnodes.h */
 } NodeTag;
 
 /*
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 6bf212b01a..4814525b53 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -1402,6 +1402,8 @@ typedef struct WindowClause
 	int			frameOptions;	/* frame_clause options, see WindowDef */
 	Node	   *startOffset;	/* expression for starting bound, if any */
 	Node	   *endOffset;		/* expression for ending bound, if any */
+	List	   *runCondition;	/* Evaluate WindowFuncs while this is true */
+	List	   *runConditionOrig;	/* EXPLAIN compatible version of above */
 	Oid			startInRangeFunc;	/* in_range function for startOffset */
 	Oid			endInRangeFunc; /* in_range function for endOffset */
 	Oid			inRangeColl;	/* collation for in_range tests */
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 365000bdcd..e5962c9f7e 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1833,6 +1833,8 @@ typedef struct WindowAggPath
 	Path		path;
 	Path	   *subpath;		/* path representing input source */
 	WindowClause *winclause;	/* WindowClause we'll be using */
+	bool		usepassthrough;	/* use "pass-through" mode when winclause's
+								 * runCondition becomes false */
 } WindowAggPath;
 
 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 50ef3dda05..98dd79a6bd 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -914,12 +914,18 @@ typedef struct WindowAgg
 	int			frameOptions;	/* frame_clause options, see WindowDef */
 	Node	   *startOffset;	/* expression for starting bound, if any */
 	Node	   *endOffset;		/* expression for ending bound, if any */
+	List	   *runCondition;	/* Conditions that must remain true in order
+								 * for WindowFunc evaluation to continue */
+	List	   *runConditionOrig;	/* runCondition for display in EXPLAIN */
 	/* these fields are used with RANGE offset PRECEDING/FOLLOWING: */
 	Oid			startInRangeFunc;	/* in_range function for startOffset */
 	Oid			endInRangeFunc; /* in_range function for endOffset */
 	Oid			inRangeColl;	/* collation for in_range tests */
 	bool		inRangeAsc;		/* use ASC sort order for in_range tests? */
 	bool		inRangeNullsFirst;	/* nulls sort first for in_range tests? */
+	bool		usePassThrough;	/* Go into pass-through mode when runCondition
+								 * is not longer true.  When false, we just
+								 * stop execution */
 } WindowAgg;
 
 /* ----------------
@@ -1312,4 +1318,21 @@ typedef struct PlanInvalItem
 	uint32		hashValue;		/* hash value of object's cache lookup key */
 } PlanInvalItem;
 
+/*
+ * MonotonicFunction
+ *
+ * Allows the planner to track monotonic properties of functions.  A function
+ * is monotonically increasing if a subsequent call cannot yield a lower value
+ * than the previous call.  A monotonically decreasing function cannot yield a
+ * higher value on subsequent calls, and a function which is both must return
+ * the same value on each call.
+ */
+typedef enum MonotonicFunction
+{
+	MONOTONICFUNC_NONE = 0,
+	MONOTONICFUNC_INCREASING = (1 << 0),
+	MONOTONICFUNC_DECREASING = (1 << 1),
+	MONOTONICFUNC_BOTH = MONOTONICFUNC_INCREASING | MONOTONICFUNC_DECREASING
+} MonotonicFunction;
+
 #endif							/* PLANNODES_H */
diff --git a/src/include/nodes/supportnodes.h b/src/include/nodes/supportnodes.h
index 88b61b3ab3..bdd43fc614 100644
--- a/src/include/nodes/supportnodes.h
+++ b/src/include/nodes/supportnodes.h
@@ -33,12 +33,12 @@
 #ifndef SUPPORTNODES_H
 #define SUPPORTNODES_H
 
-#include "nodes/primnodes.h"
+#include "nodes/plannodes.h"
 
 struct PlannerInfo;				/* avoid including pathnodes.h here */
 struct IndexOptInfo;
 struct SpecialJoinInfo;
-
+struct WindowClause;
 
 /*
  * The Simplify request allows the support function to perform plan-time
@@ -239,4 +239,64 @@ typedef struct SupportRequestIndexCondition
 								 * equivalent of the function call */
 } SupportRequestIndexCondition;
 
+/* ----------
+ * To allow more efficient execution of any monotonically increasing and/or
+ * monotonically decreasing window functions, we support calling the window
+ * function's prosupport function passing along this struct whenever the
+ * planner sees an OpExpr qual directly reference a window function in a
+ * subquery.  When the planner encounters this, we populate this struct and
+ * pass it along to the window function's prosupport function so that it can
+ * evaluate if the given WindowFunc is;
+ *
+ * a) monotonically increasing, or
+ * b) monotonically decreasing, or
+ * c) both monotonically increasing and decreasing, or
+ * d) none of the above.
+ *
+ * A function that is monotonically increasing can never return a value that
+ * is lower than a value returned in a "previous call".  A monotonically
+ * decreasing function can never return a value higher than a value returned
+ * in a previous call.  A function that is both must return the same value
+ * each time.
+ *
+ * We define "previous call" to mean a previous call to the same WindowFunc
+ * struct in the same window partition.
+ *
+ * row_number() is an example of a monotonically increasing function.  The
+ * return value will be reset back to 1 in each new partition.  An example of
+ * a monotonically increasing and decreasing function is COUNT(*) OVER ().
+ * Since there is no ORDER BY clause in this example, all rows in the
+ * partition are peers and all rows within the partition will be within the
+ * frame bound.  Likewise for COUNT(*) OVER(ORDER BY a ROWS BETWEEN UNBOUNDED
+ * PRECEDING AND UNBOUNDED FOLLOWING).
+ *
+ * COUNT(*) OVER (ORDER BY a ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
+ * is an example of a monotonically decreasing function.
+ *
+ * Implementations must only concern themselves with the given WindowFunc
+ * being monotonic in a single partition.
+ *
+ * Inputs:
+ *	'window_func' is the pointer to the window function being called.
+ *
+ *	'window_clause' pointer to the WindowClause data.  Support functions can
+ *	use this to check frame bounds, etc.
+ *
+ * Outputs:
+ *	'monotonic' the resulting MonotonicFunction value for the given input
+ *	window function and window clause.
+ * ----------
+ */
+typedef struct SupportRequestWFuncMonotonic
+{
+	NodeTag		type;
+
+	/* Input fields: */
+	WindowFunc *window_func;	/* Pointer to the window function data */
+	struct WindowClause *window_clause; /* Pointer to the window clause data */
+
+	/* Output fields: */
+	MonotonicFunction monotonic;
+} SupportRequestWFuncMonotonic;
+
 #endif							/* SUPPORTNODES_H */
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 6eca547af8..c449b8ea88 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -245,7 +245,8 @@ extern WindowAggPath *create_windowagg_path(PlannerInfo *root,
 											Path *subpath,
 											PathTarget *target,
 											List *windowFuncs,
-											WindowClause *winclause);
+											WindowClause *winclause,
+											bool usepassthrough);
 extern SetOpPath *create_setop_path(PlannerInfo *root,
 									RelOptInfo *rel,
 									Path *subpath,
diff --git a/src/test/regress/expected/window.out b/src/test/regress/expected/window.out
index bb9ff7f07b..4b22080d13 100644
--- a/src/test/regress/expected/window.out
+++ b/src/test/regress/expected/window.out
@@ -3336,6 +3336,365 @@ WHERE depname = 'sales';
                            ->  Seq Scan on empsalary
 (9 rows)
 
+-- Test window function run conditions are properly pushed down into the
+-- WindowAgg
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+                  QUERY PLAN                  
+----------------------------------------------
+ WindowAgg
+   Run Condition: (row_number() OVER (?) < 3)
+   ->  Sort
+         Sort Key: empsalary.empno
+         ->  Seq Scan on empsalary
+(5 rows)
+
+-- The following 3 statements should result the same result.
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+ empno | rn 
+-------+----
+     1 |  1
+     2 |  2
+(2 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE 3 > rn;
+ empno | rn 
+-------+----
+     1 |  1
+     2 |  2
+(2 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE 2 >= rn;
+ empno | rn 
+-------+----
+     1 |  1
+     2 |  2
+(2 rows)
+
+-- Ensure r <= 3 is pushed down into the run condition of the window agg
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          rank() OVER (ORDER BY salary DESC) r
+   FROM empsalary) emp
+WHERE r <= 3;
+               QUERY PLAN                
+-----------------------------------------
+ WindowAgg
+   Run Condition: (rank() OVER (?) <= 3)
+   ->  Sort
+         Sort Key: empsalary.salary DESC
+         ->  Seq Scan on empsalary
+(5 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          rank() OVER (ORDER BY salary DESC) r
+   FROM empsalary) emp
+WHERE r <= 3;
+ empno | salary | r 
+-------+--------+---
+     8 |   6000 | 1
+    10 |   5200 | 2
+    11 |   5200 | 2
+(3 rows)
+
+-- Ensure dr = 1 is converted to dr <= 1 to get all rows leading up to dr = 1
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+                     QUERY PLAN                      
+-----------------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.dr = 1)
+   ->  WindowAgg
+         Run Condition: (dense_rank() OVER (?) <= 1)
+         ->  Sort
+               Sort Key: empsalary.salary DESC
+               ->  Seq Scan on empsalary
+(7 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+ empno | salary | dr 
+-------+--------+----
+     8 |   6000 |  1
+(1 row)
+
+-- Check COUNT() and COUNT(*)
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+                QUERY PLAN                 
+-------------------------------------------
+ WindowAgg
+   Run Condition: (count(*) OVER (?) <= 3)
+   ->  Sort
+         Sort Key: empsalary.salary DESC
+         ->  Seq Scan on empsalary
+(5 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+ empno | salary | c 
+-------+--------+---
+     8 |   6000 | 1
+    10 |   5200 | 3
+    11 |   5200 | 3
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+                       QUERY PLAN                        
+---------------------------------------------------------
+ WindowAgg
+   Run Condition: (count(empsalary.empno) OVER (?) <= 3)
+   ->  Sort
+         Sort Key: empsalary.salary DESC
+         ->  Seq Scan on empsalary
+(5 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+ empno | salary | c 
+-------+--------+---
+     8 |   6000 | 1
+    10 |   5200 | 3
+    11 |   5200 | 3
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) c
+   FROM empsalary) emp
+WHERE c >= 3;
+                QUERY PLAN                 
+-------------------------------------------
+ WindowAgg
+   Run Condition: (count(*) OVER (?) >= 3)
+   ->  Sort
+         Sort Key: empsalary.salary DESC
+         ->  Seq Scan on empsalary
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER () c
+   FROM empsalary) emp
+WHERE 11 <= c;
+                 QUERY PLAN                 
+--------------------------------------------
+ WindowAgg
+   Run Condition: (11 <= count(*) OVER (?))
+   ->  Seq Scan on empsalary
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+                     QUERY PLAN                      
+-----------------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.dr = 1)
+   ->  WindowAgg
+         Run Condition: (dense_rank() OVER (?) <= 1)
+         ->  Sort
+               Sort Key: empsalary.salary DESC
+               ->  Seq Scan on empsalary
+(7 rows)
+
+-- Ensure we get a run condition when there's a PARTITION BY clause
+-- also ensure that the original qual remains as a filter.
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          depname,
+          row_number() OVER (PARTITION BY depname ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.rn < 3)
+   ->  WindowAgg
+         Run Condition: (row_number() OVER (?) < 3)
+         ->  Sort
+               Sort Key: empsalary.depname, empsalary.empno
+               ->  Seq Scan on empsalary
+(7 rows)
+
+-- and ensure we get the correct results from the above plan
+SELECT * FROM
+  (SELECT empno,
+          depname,
+          row_number() OVER (PARTITION BY depname ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+ empno |  depname  | rn 
+-------+-----------+----
+     7 | develop   |  1
+     8 | develop   |  2
+     2 | personnel |  1
+     5 | personnel |  2
+     1 | sales     |  1
+     3 | sales     |  2
+(6 rows)
+
+-- likewise with count(empno) instead of row_number()
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          depname,
+          salary,
+          count(empno) OVER (PARTITION BY depname ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+                            QUERY PLAN                            
+------------------------------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.c <= 3)
+   ->  WindowAgg
+         Run Condition: (count(empsalary.empno) OVER (?) <= 3)
+         ->  Sort
+               Sort Key: empsalary.depname, empsalary.salary DESC
+               ->  Seq Scan on empsalary
+(7 rows)
+
+-- and again, check the results are what we expect.
+SELECT * FROM
+  (SELECT empno,
+          depname,
+          salary,
+          count(empno) OVER (PARTITION BY depname ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+ empno |  depname  | salary | c 
+-------+-----------+--------+---
+     8 | develop   |   6000 | 1
+    10 | develop   |   5200 | 3
+    11 | develop   |   5200 | 3
+     2 | personnel |   3900 | 1
+     5 | personnel |   3500 | 2
+     1 | sales     |   5000 | 1
+     4 | sales     |   4800 | 3
+     3 | sales     |   4800 | 3
+(8 rows)
+
+-- Tests to ensure we don't push down the run condition when it's not valid to
+-- do so.
+-- Ensure we don't push down when the frame options show that the window
+-- function is not monotonically increasing
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) c
+   FROM empsalary) emp
+WHERE c <= 3;
+                  QUERY PLAN                   
+-----------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.c <= 3)
+   ->  WindowAgg
+         ->  Sort
+               Sort Key: empsalary.salary DESC
+               ->  Seq Scan on empsalary
+(6 rows)
+
+-- Ensure we don't push down when the window function's monotonic properties
+-- don't match that of the clauses.
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary) c
+   FROM empsalary) emp
+WHERE 3 <= c;
+                QUERY PLAN                
+------------------------------------------
+ Subquery Scan on emp
+   Filter: (3 <= emp.c)
+   ->  WindowAgg
+         ->  Sort
+               Sort Key: empsalary.salary
+               ->  Seq Scan on empsalary
+(6 rows)
+
+-- Ensure we don't pushdown when there are multiple window clauses to evaluate
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY empno DESC) c,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+                           QUERY PLAN                            
+-----------------------------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.dr = 1)
+   ->  WindowAgg
+         ->  Sort
+               Sort Key: empsalary.empno DESC
+               ->  WindowAgg
+                     Run Condition: (dense_rank() OVER (?) <= 1)
+                     ->  Sort
+                           Sort Key: empsalary.salary DESC
+                           ->  Seq Scan on empsalary
+(10 rows)
+
 -- Test Sort node collapsing
 EXPLAIN (COSTS OFF)
 SELECT * FROM
diff --git a/src/test/regress/sql/window.sql b/src/test/regress/sql/window.sql
index 41a8e0d152..83db8a4201 100644
--- a/src/test/regress/sql/window.sql
+++ b/src/test/regress/sql/window.sql
@@ -988,6 +988,192 @@ SELECT * FROM
    FROM empsalary) emp
 WHERE depname = 'sales';
 
+-- Test window function run conditions are properly pushed down into the
+-- WindowAgg
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+
+-- The following 3 statements should result the same result.
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE 3 > rn;
+
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE 2 >= rn;
+
+-- Ensure r <= 3 is pushed down into the run condition of the window agg
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          rank() OVER (ORDER BY salary DESC) r
+   FROM empsalary) emp
+WHERE r <= 3;
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          rank() OVER (ORDER BY salary DESC) r
+   FROM empsalary) emp
+WHERE r <= 3;
+
+-- Ensure dr = 1 is converted to dr <= 1 to get all rows leading up to dr = 1
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+
+-- Check COUNT() and COUNT(*)
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) c
+   FROM empsalary) emp
+WHERE c >= 3;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER () c
+   FROM empsalary) emp
+WHERE 11 <= c;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+
+-- Ensure we get a run condition when there's a PARTITION BY clause
+-- also ensure that the original qual remains as a filter.
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          depname,
+          row_number() OVER (PARTITION BY depname ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+
+-- and ensure we get the correct results from the above plan
+SELECT * FROM
+  (SELECT empno,
+          depname,
+          row_number() OVER (PARTITION BY depname ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+
+-- likewise with count(empno) instead of row_number()
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          depname,
+          salary,
+          count(empno) OVER (PARTITION BY depname ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+-- and again, check the results are what we expect.
+SELECT * FROM
+  (SELECT empno,
+          depname,
+          salary,
+          count(empno) OVER (PARTITION BY depname ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+-- Tests to ensure we don't push down the run condition when it's not valid to
+-- do so.
+
+-- Ensure we don't push down when the frame options show that the window
+-- function is not monotonically increasing
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+-- Ensure we don't push down when the window function's monotonic properties
+-- don't match that of the clauses.
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary) c
+   FROM empsalary) emp
+WHERE 3 <= c;
+
+-- Ensure we don't pushdown when there are multiple window clauses to evaluate
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY empno DESC) c,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+
 -- Test Sort node collapsing
 EXPLAIN (COSTS OFF)
 SELECT * FROM
-- 
2.32.0

#20

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: David Rowley (#19)

Re: Window Function "Run Conditions"

Hi,

On 2022-03-29 15:11:52 +1300, David Rowley wrote:

One thing which I'm not sure about with the patch is how I'm handling
the evaluation of the runcondition in nodeWindowAgg.c. Instead of
having ExecQual() evaluate an OpExpr such as "row_number() over (...)
<= 10", I'm replacing the WindowFunc with the Var in the targetlist
that corresponds to the given WindowFunc. This saves having to double
evaluate the WindowFunc. Instead, the value of the Var can be taken
directly from the slot. I don't know of anywhere else we do things
quite like that. The runcondition is slightly similar to HAVING
clauses, but HAVING clauses don't work this way.

Don't HAVING clauses actually work pretty similar? Yes, they don't have a Var,
but for expression evaluation purposes an Aggref is nearly the same as a plain
Var:

EEO_CASE(EEOP_INNER_VAR)
{
int attnum = op->d.var.attnum;

/*
* Since we already extracted all referenced columns from the
* tuple with a FETCHSOME step, we can just grab the value
* directly out of the slot's decomposed-data arrays. But let's
* have an Assert to check that that did happen.
*/
Assert(attnum >= 0 && attnum < innerslot->tts_nvalid);
*op->resvalue = innerslot->tts_values[attnum];
*op->resnull = innerslot->tts_isnull[attnum];

EEO_NEXT();
}
vs
EEO_CASE(EEOP_AGGREF)
{
/*
* Returns a Datum whose value is the precomputed aggregate value
* found in the given expression context.
*/
int aggno = op->d.aggref.aggno;

Assert(econtext->ecxt_aggvalues != NULL);

*op->resvalue = econtext->ecxt_aggvalues[aggno];
*op->resnull = econtext->ecxt_aggnulls[aggno];

EEO_NEXT();
}

specifically we don't re-evaluate expressions?

This is afaics slightly cheaper than referencing a variable in a slot.

Greetings,

Andres Freund

#21

David Rowley

dgrowleyml@gmail.com

almost 4 years ago

In reply to: Andres Freund (#20)

1 attachment(s)

Re: Window Function "Run Conditions"

Thanks for having a look at this.

On Wed, 30 Mar 2022 at 11:16, Andres Freund <andres@anarazel.de> wrote:

On 2022-03-29 15:11:52 +1300, David Rowley wrote:

One thing which I'm not sure about with the patch is how I'm handling
the evaluation of the runcondition in nodeWindowAgg.c. Instead of
having ExecQual() evaluate an OpExpr such as "row_number() over (...)
<= 10", I'm replacing the WindowFunc with the Var in the targetlist
that corresponds to the given WindowFunc. This saves having to double
evaluate the WindowFunc. Instead, the value of the Var can be taken
directly from the slot. I don't know of anywhere else we do things
quite like that. The runcondition is slightly similar to HAVING
clauses, but HAVING clauses don't work this way.

Don't HAVING clauses actually work pretty similar? Yes, they don't have a Var,
but for expression evaluation purposes an Aggref is nearly the same as a plain
Var:

EEO_CASE(EEOP_INNER_VAR)
{
int attnum = op->d.var.attnum;

/*
* Since we already extracted all referenced columns from the
* tuple with a FETCHSOME step, we can just grab the value
* directly out of the slot's decomposed-data arrays. But let's
* have an Assert to check that that did happen.
*/
Assert(attnum >= 0 && attnum < innerslot->tts_nvalid);
*op->resvalue = innerslot->tts_values[attnum];
*op->resnull = innerslot->tts_isnull[attnum];

EEO_NEXT();
}
vs
EEO_CASE(EEOP_AGGREF)
{
/*
* Returns a Datum whose value is the precomputed aggregate value
* found in the given expression context.
*/
int aggno = op->d.aggref.aggno;

Assert(econtext->ecxt_aggvalues != NULL);

*op->resvalue = econtext->ecxt_aggvalues[aggno];
*op->resnull = econtext->ecxt_aggnulls[aggno];

EEO_NEXT();
}

specifically we don't re-evaluate expressions?

Thanks for highlighting the similarities. I'm feeling better about the
choice now.

I've made another pass over the patch and updated a few comments and
made a small code change to delay the initialisation of a variable.

I'm pretty happy with this now. If anyone wants to have a look at
this, can they do so or let me know they're going to within the next
24 hours. Otherwise I plan to move into commit mode with it.

This is afaics slightly cheaper than referencing a variable in a slot.

I guess you must mean cheaper because it means there will be no
EEOP_*_FETCHSOME step? Otherwise it seems a fairly similar amount of
work.

David

Attachments:

v5-0001-Teach-planner-and-executor-about-monotonic-window.patchtext/plain; charset=US-ASCII; name=v5-0001-Teach-planner-and-executor-about-monotonic-window.patchDownload

From f5dab05872f0f06c05fae1ee2024285b89de8c44 Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Tue, 5 Apr 2022 12:00:40 +1200
Subject: [PATCH v5] Teach planner and executor about monotonic window funcs

Window functions such as row_number() always return a value higher than
the one previously in any given partition.  If a query were to only
request the first few row numbers, then traditionally we would continue
evaluating the WindowAgg node until all tuples are exhausted.  However, it
is possible if someone, say only wanted all row numbers <= 10, then we
could just stop once we get number 11.

Here we implement means to do just that.  This is done by way of adding a
pg_proc.prosupport function to various of the built-in window functions
and adding supporting code to allow the support function to inform the
planner if the function is monotonically increasing, monotonically
decreasing, both or neither.  The planner is then able to make use of that
information and possibly allow the executor to short-circuit execution by
way of adding a "run condition" to the WindowAgg to instruct it to run
while this condition is true and stop when it becomes false.

When there are multiple WindowAgg nodes to evaluate then this complicates
the situation as if we were to stop execution on a lower-level WindowAgg
then an upper-level WindowAgg may not receive all of the tuples it should.
This may lead to incorrect query results.  To get around this problem all
non-top-level WindowAggs go into "passthrough" mode when their
runcondition is no longer true.  This means that they continue to pull
tuples from their subnode but no longer evaluate their window functions.
Only the top-level WindowAgg node may stop when the runcondition is no
longer true.

Here we add prosupport functions to allow this to work for; row_number(),
rank(), dense_rank(), count(*) and count(expr).  It appears technically
possible to do the same for min() and max(), however, it seems unlikely to
be useful enough, so that's not done here.

Author: David Rowley
Reviewed-by: Andy Fan, Zhihong Yu
Discussion: https://postgr.es/m/CAApHDvqvp3At8++yF8ij06sdcoo1S_b2YoaT9D4Nf+MObzsrLQ@mail.gmail.com
---
 src/backend/commands/explain.c          |   4 +
 src/backend/executor/nodeWindowAgg.c    |  95 +++++--
 src/backend/nodes/copyfuncs.c           |   5 +
 src/backend/nodes/equalfuncs.c          |   2 +
 src/backend/nodes/outfuncs.c            |   6 +
 src/backend/nodes/readfuncs.c           |   5 +
 src/backend/optimizer/path/allpaths.c   | 293 ++++++++++++++++++-
 src/backend/optimizer/plan/createplan.c |  12 +-
 src/backend/optimizer/plan/planner.c    |  18 +-
 src/backend/optimizer/plan/setrefs.c    | 102 +++++++
 src/backend/optimizer/util/pathnode.c   |   4 +-
 src/backend/utils/adt/int8.c            |  44 +++
 src/backend/utils/adt/windowfuncs.c     |  63 +++++
 src/include/catalog/pg_proc.dat         |  35 ++-
 src/include/nodes/execnodes.h           |  20 +-
 src/include/nodes/nodes.h               |   3 +-
 src/include/nodes/parsenodes.h          |   2 +
 src/include/nodes/pathnodes.h           |   2 +
 src/include/nodes/plannodes.h           |  22 ++
 src/include/nodes/supportnodes.h        |  64 ++++-
 src/include/optimizer/pathnode.h        |   3 +-
 src/test/regress/expected/window.out    | 359 ++++++++++++++++++++++++
 src/test/regress/sql/window.sql         | 186 ++++++++++++
 23 files changed, 1306 insertions(+), 43 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 1e5701b8eb..78e6f6a5ed 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1988,6 +1988,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
 			break;
+		case T_WindowAgg:
+			show_upper_qual(((WindowAgg *) plan)->runConditionOrig,
+							"Run Condition", planstate, ancestors, es);
+			break;
 		case T_Group:
 			show_group_keys(castNode(GroupState, planstate), ancestors, es);
 			show_upper_qual(plan->qual, "Filter", planstate, ancestors, es);
diff --git a/src/backend/executor/nodeWindowAgg.c b/src/backend/executor/nodeWindowAgg.c
index 08ce05ca5a..869dcd74df 100644
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -2023,13 +2023,14 @@ static TupleTableSlot *
 ExecWindowAgg(PlanState *pstate)
 {
 	WindowAggState *winstate = castNode(WindowAggState, pstate);
+	TupleTableSlot *slot;
 	ExprContext *econtext;
 	int			i;
 	int			numfuncs;
 
 	CHECK_FOR_INTERRUPTS();
 
-	if (winstate->all_done)
+	if (winstate->status == WINDOWAGG_DONE)
 		return NULL;
 
 	/*
@@ -2131,10 +2132,14 @@ ExecWindowAgg(PlanState *pstate)
 		{
 			begin_partition(winstate);
 			Assert(winstate->spooled_rows > 0);
+
+			/* Come out of pass-through mode when changing partition */
+			winstate->status = WINDOWAGG_RUN;
 		}
 		else
 		{
-			winstate->all_done = true;
+			/* No further partitions?  We're done */
+			winstate->status = WINDOWAGG_DONE;
 			return NULL;
 		}
 	}
@@ -2185,26 +2190,30 @@ ExecWindowAgg(PlanState *pstate)
 			elog(ERROR, "unexpected end of tuplestore");
 	}
 
-	/*
-	 * Evaluate true window functions
-	 */
-	numfuncs = winstate->numfuncs;
-	for (i = 0; i < numfuncs; i++)
+	/* don't evaluate the window functions when we're in pass-through mode */
+	if (winstate->status == WINDOWAGG_RUN)
 	{
-		WindowStatePerFunc perfuncstate = &(winstate->perfunc[i]);
+		/*
+		 * Evaluate true window functions
+		 */
+		numfuncs = winstate->numfuncs;
+		for (i = 0; i < numfuncs; i++)
+		{
+			WindowStatePerFunc perfuncstate = &(winstate->perfunc[i]);
 
-		if (perfuncstate->plain_agg)
-			continue;
-		eval_windowfunction(winstate, perfuncstate,
-							&(econtext->ecxt_aggvalues[perfuncstate->wfuncstate->wfuncno]),
-							&(econtext->ecxt_aggnulls[perfuncstate->wfuncstate->wfuncno]));
-	}
+			if (perfuncstate->plain_agg)
+				continue;
+			eval_windowfunction(winstate, perfuncstate,
+								&(econtext->ecxt_aggvalues[perfuncstate->wfuncstate->wfuncno]),
+								&(econtext->ecxt_aggnulls[perfuncstate->wfuncstate->wfuncno]));
+		}
 
-	/*
-	 * Evaluate aggregates
-	 */
-	if (winstate->numaggs > 0)
-		eval_windowaggregates(winstate);
+		/*
+		 * Evaluate aggregates
+		 */
+		if (winstate->numaggs > 0)
+			eval_windowaggregates(winstate);
+	}
 
 	/*
 	 * If we have created auxiliary read pointers for the frame or group
@@ -2235,7 +2244,33 @@ ExecWindowAgg(PlanState *pstate)
 	 */
 	econtext->ecxt_outertuple = winstate->ss.ss_ScanTupleSlot;
 
-	return ExecProject(winstate->ss.ps.ps_ProjInfo);
+	slot = ExecProject(winstate->ss.ps.ps_ProjInfo);
+
+	/*
+	 * Now evaluate the run condition to see if we need to continue evaluating
+	 * window function or if we can stop completely.
+	 */
+	econtext->ecxt_scantuple = slot;
+	if (winstate->status == WINDOWAGG_RUN &&
+		!ExecQual(winstate->runcondition, econtext))
+	{
+		/*
+		 * When the runcondition is no longer true we can either abort
+		 * execution or go into pass-through mode so that we continue to pull
+		 * tuples from our subnode but just skip evaluation of the window
+		 * functions.  Which of these we perform depends on the value of the
+		 * use_pass_through field.
+		 */
+		if (winstate->use_pass_through)
+			winstate->status = WINDOWAGG_PASSTHROUGH;
+		else
+		{
+			winstate->status = WINDOWAGG_DONE;
+			return NULL;
+		}
+	}
+
+	return slot;
 }
 
 /* -----------------
@@ -2307,6 +2342,21 @@ ExecInitWindowAgg(WindowAgg *node, EState *estate, int eflags)
 	Assert(node->plan.qual == NIL);
 	winstate->ss.ps.qual = NULL;
 
+	/*
+	 * Setup the run condition, if we received one from the query planner.
+	 * When set, this may allow us to move into pass-through mode so that we
+	 * don't have to perform any further evaluation of WindowFuncs in the
+	 * current partition or possibly stop returning tuples altogether when all
+	 * tuples are in the same partition.
+	 */
+	if (node->runCondition != NIL)
+		winstate->runcondition = ExecInitQual(node->runCondition,
+											  (PlanState *) winstate);
+	else
+		winstate->runcondition = NULL;
+
+	winstate->use_pass_through = node->usePassThrough;
+
 	/*
 	 * initialize child nodes
 	 */
@@ -2500,6 +2550,9 @@ ExecInitWindowAgg(WindowAgg *node, EState *estate, int eflags)
 		winstate->agg_winobj = agg_winobj;
 	}
 
+	/* Set the status to running */
+	winstate->status = WINDOWAGG_RUN;
+
 	/* copy frame options to state node for easy access */
 	winstate->frameOptions = frameOptions;
 
@@ -2579,7 +2632,7 @@ ExecReScanWindowAgg(WindowAggState *node)
 	PlanState  *outerPlan = outerPlanState(node);
 	ExprContext *econtext = node->ss.ps.ps_ExprContext;
 
-	node->all_done = false;
+	node->status = WINDOWAGG_RUN;
 	node->all_first = true;
 
 	/* release tuplestore et al */
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 1a74122f13..061bd05075 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -1103,11 +1103,14 @@ _copyWindowAgg(const WindowAgg *from)
 	COPY_SCALAR_FIELD(frameOptions);
 	COPY_NODE_FIELD(startOffset);
 	COPY_NODE_FIELD(endOffset);
+	COPY_NODE_FIELD(runCondition);
+	COPY_NODE_FIELD(runConditionOrig);
 	COPY_SCALAR_FIELD(startInRangeFunc);
 	COPY_SCALAR_FIELD(endInRangeFunc);
 	COPY_SCALAR_FIELD(inRangeColl);
 	COPY_SCALAR_FIELD(inRangeAsc);
 	COPY_SCALAR_FIELD(inRangeNullsFirst);
+	COPY_SCALAR_FIELD(usePassThrough);
 
 	return newnode;
 }
@@ -3037,6 +3040,8 @@ _copyWindowClause(const WindowClause *from)
 	COPY_SCALAR_FIELD(frameOptions);
 	COPY_NODE_FIELD(startOffset);
 	COPY_NODE_FIELD(endOffset);
+	COPY_NODE_FIELD(runCondition);
+	COPY_NODE_FIELD(runConditionOrig);
 	COPY_SCALAR_FIELD(startInRangeFunc);
 	COPY_SCALAR_FIELD(endInRangeFunc);
 	COPY_SCALAR_FIELD(inRangeColl);
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 5c21850c97..8c02d14591 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -3231,6 +3231,8 @@ _equalWindowClause(const WindowClause *a, const WindowClause *b)
 	COMPARE_SCALAR_FIELD(frameOptions);
 	COMPARE_NODE_FIELD(startOffset);
 	COMPARE_NODE_FIELD(endOffset);
+	COMPARE_NODE_FIELD(runCondition);
+	COMPARE_NODE_FIELD(runConditionOrig);
 	COMPARE_SCALAR_FIELD(startInRangeFunc);
 	COMPARE_SCALAR_FIELD(endInRangeFunc);
 	COMPARE_SCALAR_FIELD(inRangeColl);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 213396f999..554b75866c 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -828,11 +828,14 @@ _outWindowAgg(StringInfo str, const WindowAgg *node)
 	WRITE_INT_FIELD(frameOptions);
 	WRITE_NODE_FIELD(startOffset);
 	WRITE_NODE_FIELD(endOffset);
+	WRITE_NODE_FIELD(runCondition);
+	WRITE_NODE_FIELD(runConditionOrig);
 	WRITE_OID_FIELD(startInRangeFunc);
 	WRITE_OID_FIELD(endInRangeFunc);
 	WRITE_OID_FIELD(inRangeColl);
 	WRITE_BOOL_FIELD(inRangeAsc);
 	WRITE_BOOL_FIELD(inRangeNullsFirst);
+	WRITE_BOOL_FIELD(usePassThrough);
 }
 
 static void
@@ -2279,6 +2282,7 @@ _outWindowAggPath(StringInfo str, const WindowAggPath *node)
 
 	WRITE_NODE_FIELD(subpath);
 	WRITE_NODE_FIELD(winclause);
+	WRITE_BOOL_FIELD(usepassthrough);
 }
 
 static void
@@ -3289,6 +3293,8 @@ _outWindowClause(StringInfo str, const WindowClause *node)
 	WRITE_INT_FIELD(frameOptions);
 	WRITE_NODE_FIELD(startOffset);
 	WRITE_NODE_FIELD(endOffset);
+	WRITE_NODE_FIELD(runCondition);
+	WRITE_NODE_FIELD(runConditionOrig);
 	WRITE_OID_FIELD(startInRangeFunc);
 	WRITE_OID_FIELD(endInRangeFunc);
 	WRITE_OID_FIELD(inRangeColl);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 19e257684c..ce0c47cba3 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -384,6 +384,8 @@ _readWindowClause(void)
 	READ_INT_FIELD(frameOptions);
 	READ_NODE_FIELD(startOffset);
 	READ_NODE_FIELD(endOffset);
+	READ_NODE_FIELD(runCondition);
+	READ_NODE_FIELD(runConditionOrig);
 	READ_OID_FIELD(startInRangeFunc);
 	READ_OID_FIELD(endInRangeFunc);
 	READ_OID_FIELD(inRangeColl);
@@ -2572,11 +2574,14 @@ _readWindowAgg(void)
 	READ_INT_FIELD(frameOptions);
 	READ_NODE_FIELD(startOffset);
 	READ_NODE_FIELD(endOffset);
+	READ_NODE_FIELD(runCondition);
+	READ_NODE_FIELD(runConditionOrig);
 	READ_OID_FIELD(startInRangeFunc);
 	READ_OID_FIELD(endInRangeFunc);
 	READ_OID_FIELD(inRangeColl);
 	READ_BOOL_FIELD(inRangeAsc);
 	READ_BOOL_FIELD(inRangeNullsFirst);
+	READ_BOOL_FIELD(usePassThrough);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 169b1d53fc..a8573b6dbf 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -27,6 +27,7 @@
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
+#include "nodes/supportnodes.h"
 #ifdef OPTIMIZER_DEBUG
 #include "nodes/print.h"
 #endif
@@ -2157,6 +2158,280 @@ has_multiple_baserels(PlannerInfo *root)
 	return false;
 }
 
+/*
+ * find_window_run_conditions
+ *		Determine if 'wfunc' is really a WindowFunc and call its prosupport
+ *		function to determine the function's monotonic properties.  We then
+ *		see if 'opexpr' can be used to short-circuit execution.
+ *
+ * For example row_number() over (order by ...) always produces a value one
+ * higher than the previous.  If someone has a window function such as that
+ * in a subquery and just wants, say all rows with a row number less than or
+ * equal to 10, then we may as well stop processing the windowagg once the row
+ * number reaches 11.  Here we check if 'opexpr' might help us to stop doing
+ * needless extra processing in WindowAgg nodes.
+ *
+ * '*keep_original' is set to true if the caller should also use 'opexpr' for
+ * its original purpose.  This is set to false if the caller can assume that
+ * the run condition will handle all of the required filtering.
+ *
+ * Returns true if 'opexpr' was found to be useful and was added to the
+ * WindowClauses runCondition. We also set *keep_original accordingly.
+ * If the 'opexpr' cannot be used then we set *keep_original to true and
+ * return false.
+ */
+static bool
+find_window_run_conditions(Query *subquery, RangeTblEntry *rte, Index rti,
+						   AttrNumber attno, WindowFunc *wfunc, OpExpr *opexpr,
+						   bool wfunc_left, bool *keep_original)
+{
+	Oid			prosupport;
+	Expr	   *otherexpr;
+	SupportRequestWFuncMonotonic req;
+	SupportRequestWFuncMonotonic *res;
+	WindowClause *wclause;
+	List	   *opinfos;
+	OpExpr	   *runopexpr;
+	Oid			runoperator;
+	ListCell   *lc;
+
+	*keep_original = true;
+
+	while (IsA(wfunc, RelabelType))
+		wfunc = (WindowFunc *) ((RelabelType *) wfunc)->arg;
+
+	/* we can only work with window functions */
+	if (!IsA(wfunc, WindowFunc))
+		return false;
+
+	prosupport = get_func_support(wfunc->winfnoid);
+
+	/* Check if there's a support function for 'wfunc' */
+	if (!OidIsValid(prosupport))
+		return false;
+
+	/* get the Expr from the other side of the OpExpr */
+	if (wfunc_left)
+		otherexpr = lsecond(opexpr->args);
+	else
+		otherexpr = linitial(opexpr->args);
+
+	/*
+	 * The value being compared must not change during the evaluation of the
+	 * window partition.
+	 */
+	if (!is_pseudo_constant_clause((Node *) otherexpr))
+		return false;
+
+	/* find the window clause belonging to the window function */
+	wclause = (WindowClause *) list_nth(subquery->windowClause,
+										wfunc->winref - 1);
+
+	req.type = T_SupportRequestWFuncMonotonic;
+	req.window_func = wfunc;
+	req.window_clause = wclause;
+
+	/* call the support function */
+	res = (SupportRequestWFuncMonotonic *)
+		DatumGetPointer(OidFunctionCall1(prosupport,
+										 PointerGetDatum(&req)));
+
+	/*
+	 * Nothing to do if the function is neither monotonically increasing nor
+	 * monotonically decreasing.
+	 */
+	if (res == NULL || res->monotonic == MONOTONICFUNC_NONE)
+		return false;
+
+	runopexpr = NULL;
+	runoperator = InvalidOid;
+	opinfos = get_op_btree_interpretation(opexpr->opno);
+
+	foreach(lc, opinfos)
+	{
+		OpBtreeInterpretation *opinfo = (OpBtreeInterpretation *) lfirst(lc);
+		int			strategy = opinfo->strategy;
+
+		/* handle < / <= */
+		if (strategy == BTLessStrategyNumber ||
+			strategy == BTLessEqualStrategyNumber)
+		{
+			/*
+			 * < / <= is supported for monotonically increasing functions in
+			 * the form <wfunc> op <pseudoconst> and <pseudoconst> op <wfunc>
+			 * for monotonically decreasing functions.
+			 */
+			if ((wfunc_left && (res->monotonic & MONOTONICFUNC_INCREASING)) ||
+				(!wfunc_left && (res->monotonic & MONOTONICFUNC_DECREASING)))
+			{
+				/*
+				 * We must keep the original qual in place if there is a
+				 * PARTITION BY clause as the top-level WindowAgg remains in
+				 * pass-through mode and does nothing to filter out unwanted
+				 * tuples.
+				 */
+				*keep_original = (wclause->partitionClause != NIL);;
+				runopexpr = opexpr;
+				runoperator = opexpr->opno;
+			}
+			break;
+		}
+		/* handle > / >= */
+		else if (strategy == BTGreaterStrategyNumber ||
+				 strategy == BTGreaterEqualStrategyNumber)
+		{
+			/*
+			 * > / >= is supported for monotonically decreasing functions in
+			 * the form <wfunc> op <pseudoconst> and <pseudoconst> op <wfunc>
+			 * for monotonically increasing functions.
+			 */
+			if ((wfunc_left && (res->monotonic & MONOTONICFUNC_DECREASING)) ||
+				(!wfunc_left && (res->monotonic & MONOTONICFUNC_INCREASING)))
+			{
+				*keep_original = (wclause->partitionClause != NIL);;
+				runopexpr = opexpr;
+				runoperator = opexpr->opno;
+			}
+			break;
+		}
+		/* handle = */
+		else if (strategy == BTEqualStrategyNumber)
+		{
+			int16		newstrategy;
+
+			/*
+			 * When both monotonically increasing and decreasing then the
+			 * return value of the window function will be the same each time.
+			 * We can simply use 'opexpr' as the run condition without
+			 * modifying it.
+			 */
+			if ((res->monotonic & MONOTONICFUNC_BOTH) == MONOTONICFUNC_BOTH)
+			{
+				*keep_original = (wclause->partitionClause != NIL);;
+				runopexpr = opexpr;
+				break;
+			}
+
+			/*
+			 * When monotonically increasing we make a qual with <wfunc> <=
+			 * <value> or <value> >= <wfunc> in order to filter out values
+			 * which are above the value in the equality condition.  For
+			 * monotonically decreasing functions we want to filter values
+			 * below the value in the equality condition.
+			 */
+			if (res->monotonic & MONOTONICFUNC_INCREASING)
+				newstrategy = wfunc_left ? BTLessEqualStrategyNumber : BTGreaterEqualStrategyNumber;
+			else
+				newstrategy = wfunc_left ? BTGreaterEqualStrategyNumber : BTLessEqualStrategyNumber;
+
+			/* We must keep the original equality qual */
+			*keep_original = true;
+			runopexpr = opexpr;
+
+			/* determine the operator to use for the runCondition qual */
+			runoperator = get_opfamily_member(opinfo->opfamily_id,
+											  opinfo->oplefttype,
+											  opinfo->oprighttype,
+											  newstrategy);
+			break;
+		}
+	}
+
+	if (runopexpr != NULL)
+	{
+		Expr	   *newexpr;
+
+		/*
+		 * Build the qual required for the run condition keeping the
+		 * WindowFunc on the same side as it was originally.
+		 */
+		if (wfunc_left)
+			newexpr = make_opclause(runoperator,
+									runopexpr->opresulttype,
+									runopexpr->opretset, (Expr *) wfunc,
+									otherexpr, runopexpr->opcollid,
+									runopexpr->inputcollid);
+		else
+			newexpr = make_opclause(runoperator,
+									runopexpr->opresulttype,
+									runopexpr->opretset,
+									otherexpr, (Expr *) wfunc,
+									runopexpr->opcollid,
+									runopexpr->inputcollid);
+
+		wclause->runCondition = lappend(wclause->runCondition, newexpr);
+
+		/*
+		 * Store the qual twice.  This one will be unmodified for use in
+		 * EXPLAIN.
+		 */
+		wclause->runConditionOrig = lappend(wclause->runConditionOrig, newexpr);
+		return true;
+	}
+
+	/* unsupported OpExpr */
+	return false;
+}
+
+/*
+ * check_and_push_window_quals
+ *		Check if 'clause' is a qual that can be pushed into a WindowFunc's
+ *		WindowClause as a 'runCondition' qual.  These, when present, allow
+ *		some unnecessary work to be skipped during execution.
+ *
+ * Returns true if the caller still must keep the original qual or false if
+ * the caller can safely ignore the original qual because the WindowAgg node
+ * will use the runCondition to stop returning tuples.
+ */
+static bool
+check_and_push_window_quals(Query *subquery, RangeTblEntry *rte, Index rti,
+							Node *clause)
+{
+	OpExpr	   *opexpr = (OpExpr *) clause;
+	bool		keep_original = true;
+	Var		   *var1;
+	Var		   *var2;
+
+	/* We're only able to use OpExprs with 2 operands */
+	if (!IsA(opexpr, OpExpr))
+		return true;
+
+	if (list_length(opexpr->args) != 2)
+		return true;
+
+	/*
+	 * Check for plain Vars that reference window functions in the subquery.
+	 * If we find any, we'll ask find_window_run_conditions() if 'opexpr' can
+	 * be used as a run condition.
+	 */
+
+	/* Check the left side of the OpExpr */
+	var1 = linitial(opexpr->args);
+	if (IsA(var1, Var) && var1->varattno > 0)
+	{
+		TargetEntry *tle = list_nth(subquery->targetList, var1->varattno - 1);
+		WindowFunc *wfunc = (WindowFunc *) tle->expr;
+
+		if (find_window_run_conditions(subquery, rte, rti, tle->resno, wfunc,
+									   opexpr, true, &keep_original))
+			return keep_original;
+	}
+
+	/* and check the right side */
+	var2 = lsecond(opexpr->args);
+	if (IsA(var2, Var) && var2->varattno > 0)
+	{
+		TargetEntry *tle = list_nth(subquery->targetList, var2->varattno - 1);
+		WindowFunc *wfunc = (WindowFunc *) tle->expr;
+
+		if (find_window_run_conditions(subquery, rte, rti, tle->resno, wfunc,
+									   opexpr, false, &keep_original))
+			return keep_original;
+	}
+
+	return true;
+}
+
 /*
  * set_subquery_pathlist
  *		Generate SubqueryScan access paths for a subquery RTE
@@ -2245,19 +2520,29 @@ set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 		foreach(l, rel->baserestrictinfo)
 		{
 			RestrictInfo *rinfo = (RestrictInfo *) lfirst(l);
+			Node	   *clause = (Node *) rinfo->clause;
 
 			if (!rinfo->pseudoconstant &&
 				qual_is_pushdown_safe(subquery, rti, rinfo, &safetyInfo))
 			{
-				Node	   *clause = (Node *) rinfo->clause;
-
 				/* Push it down */
 				subquery_push_qual(subquery, rte, rti, clause);
 			}
 			else
 			{
-				/* Keep it in the upper query */
-				upperrestrictlist = lappend(upperrestrictlist, rinfo);
+				/*
+				 * Since we can't push the qual down into the subquery, check
+				 * if it happens to reference a window function.  If so then
+				 * it might be useful to use for the WindowAgg's runCondition.
+				 */
+				if (check_and_push_window_quals(subquery, rte, rti, clause))
+				{
+					/*
+					 * It's not a suitable window run condition qual or it is,
+					 * but the original must also be kept in the upper query.
+					 */
+					upperrestrictlist = lappend(upperrestrictlist, rinfo);
+				}
 			}
 		}
 		rel->baserestrictinfo = upperrestrictlist;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 179c87c671..707fc888e8 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -288,7 +288,8 @@ static WindowAgg *make_windowagg(List *tlist, Index winref,
 								 int frameOptions, Node *startOffset, Node *endOffset,
 								 Oid startInRangeFunc, Oid endInRangeFunc,
 								 Oid inRangeColl, bool inRangeAsc, bool inRangeNullsFirst,
-								 Plan *lefttree);
+								 List *runCondition, List *runConditionOrig,
+								 bool usePassThrough, Plan *lefttree);
 static Group *make_group(List *tlist, List *qual, int numGroupCols,
 						 AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
 						 Plan *lefttree);
@@ -2642,6 +2643,9 @@ create_windowagg_plan(PlannerInfo *root, WindowAggPath *best_path)
 						  wc->inRangeColl,
 						  wc->inRangeAsc,
 						  wc->inRangeNullsFirst,
+						  wc->runCondition,
+						  wc->runConditionOrig,
+						  best_path->usepassthrough,
 						  subplan);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -6527,7 +6531,8 @@ make_windowagg(List *tlist, Index winref,
 			   int frameOptions, Node *startOffset, Node *endOffset,
 			   Oid startInRangeFunc, Oid endInRangeFunc,
 			   Oid inRangeColl, bool inRangeAsc, bool inRangeNullsFirst,
-			   Plan *lefttree)
+			   List *runCondition, List *runConditionOrig,
+			   bool usePassThrough, Plan *lefttree)
 {
 	WindowAgg  *node = makeNode(WindowAgg);
 	Plan	   *plan = &node->plan;
@@ -6544,11 +6549,14 @@ make_windowagg(List *tlist, Index winref,
 	node->frameOptions = frameOptions;
 	node->startOffset = startOffset;
 	node->endOffset = endOffset;
+	node->runCondition = runCondition;
+	node->runConditionOrig = runConditionOrig;
 	node->startInRangeFunc = startInRangeFunc;
 	node->endInRangeFunc = endInRangeFunc;
 	node->inRangeColl = inRangeColl;
 	node->inRangeAsc = inRangeAsc;
 	node->inRangeNullsFirst = inRangeNullsFirst;
+	node->usePassThrough = usePassThrough;
 
 	plan->targetlist = tlist;
 	plan->lefttree = lefttree;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b2569c5d0c..ce0e26d517 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4214,6 +4214,7 @@ create_one_window_path(PlannerInfo *root,
 		List	   *window_pathkeys;
 		int			presorted_keys;
 		bool		is_sorted;
+		bool		usepassthrough;
 
 		window_pathkeys = make_pathkeys_for_window(root,
 												   wc,
@@ -4277,10 +4278,25 @@ create_one_window_path(PlannerInfo *root,
 			window_target = output_target;
 		}
 
+		/*
+		 * Determine the pass-through mode for the WindowAgg.  This only has
+		 * an effect when the WindowAgg has a runCondition.  In all cases, all
+		 * apart from the top-level WindowAgg must run in pass-through mode.
+		 * This means it will continue to pull tuples from the subplan even
+		 * when the runCondition is no longer true.  When there is no
+		 * PARTITION BY clause, the top-level WindowAgg can operate without
+		 * pass-through mode.  This allows the WindowAgg to return NULL to
+		 * indicate there are no more tuples.  However, when there is a
+		 * PARTITION BY clause we must always run the top-level WindowAgg in
+		 * pass-through mode as we need to process all partitions.
+		 */
+		usepassthrough = wc->partitionClause != NIL ||
+			foreach_current_index(l) < list_length(activeWindows) - 1;
+
 		path = (Path *)
 			create_windowagg_path(root, window_rel, path, window_target,
 								  wflists->windowFuncs[wc->winref],
-								  wc);
+								  wc, usepassthrough);
 	}
 
 	add_path(window_rel, path);
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index bf4c722c02..6ffd208a11 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -71,6 +71,13 @@ typedef struct
 	double		num_exec;
 } fix_upper_expr_context;
 
+typedef struct
+{
+	PlannerInfo *root;
+	indexed_tlist *subplan_itlist;
+	int			newvarno;
+} fix_windowagg_cond_context;
+
 /*
  * Selecting the best alternative in an AlternativeSubPlan expression requires
  * estimating how many times that expression will be evaluated.  For an
@@ -172,6 +179,9 @@ static List *set_returning_clause_references(PlannerInfo *root,
 											 Plan *topplan,
 											 Index resultRelation,
 											 int rtoffset);
+static List *set_windowagg_runcondition_references(PlannerInfo *root,
+												   List *runcondition,
+												   Plan *plan);
 
 
 /*****************************************************************************
@@ -886,6 +896,18 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 			{
 				WindowAgg  *wplan = (WindowAgg *) plan;
 
+				/*
+				 * Adjust the WindowAgg's run conditions by swapping the
+				 * WindowFuncs references out to instead reference the Var in
+				 * the scan slot so that when the executor evaluates the
+				 * runCondition, it receives the WindowFunc's value from the
+				 * slot that the result has just been stored into rather than
+				 * evaluating the WindowFunc all over again.
+				 */
+				wplan->runCondition = set_windowagg_runcondition_references(root,
+																			wplan->runCondition,
+																			(Plan *) wplan);
+
 				set_upper_references(root, plan, rtoffset);
 
 				/*
@@ -897,6 +919,14 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 					fix_scan_expr(root, wplan->startOffset, rtoffset, 1);
 				wplan->endOffset =
 					fix_scan_expr(root, wplan->endOffset, rtoffset, 1);
+				wplan->runCondition = fix_scan_list(root,
+													wplan->runCondition,
+													rtoffset,
+													NUM_EXEC_TLIST(plan));
+				wplan->runConditionOrig = fix_scan_list(root,
+														wplan->runConditionOrig,
+														rtoffset,
+														NUM_EXEC_TLIST(plan));
 			}
 			break;
 		case T_Result:
@@ -3050,6 +3080,78 @@ set_returning_clause_references(PlannerInfo *root,
 	return rlist;
 }
 
+/*
+ * fix_windowagg_condition_expr_mutator
+ *		Mutator function for replacing WindowFuncs with the corresponding Var
+ *		in the targetlist which references that WindowFunc.
+ */
+static Node *
+fix_windowagg_condition_expr_mutator(Node *node,
+									 fix_windowagg_cond_context *context)
+{
+	if (node == NULL)
+		return NULL;
+
+	if (IsA(node, WindowFunc))
+	{
+		Var		   *newvar;
+
+		newvar = search_indexed_tlist_for_non_var((Expr *) node,
+												  context->subplan_itlist,
+												  context->newvarno);
+		if (newvar)
+			return (Node *) newvar;
+		elog(ERROR, "WindowFunc not found in subplan target lists");
+	}
+
+	return expression_tree_mutator(node,
+								   fix_windowagg_condition_expr_mutator,
+								   (void *) context);
+}
+
+/*
+ * fix_windowagg_condition_expr
+ *		Converts references in 'runcondition' so that any WindowFunc
+ *		references are swapped out for a Var which references the matching
+ *		WindowFunc in 'subplan_itlist'.
+ */
+static List *
+fix_windowagg_condition_expr(PlannerInfo *root,
+							 List *runcondition,
+							 indexed_tlist *subplan_itlist)
+{
+	fix_windowagg_cond_context context;
+
+	context.root = root;
+	context.subplan_itlist = subplan_itlist;
+	context.newvarno = 0;
+
+	return (List *) fix_windowagg_condition_expr_mutator((Node *) runcondition,
+														 &context);
+}
+
+/*
+ * set_windowagg_runcondition_references
+ *		Converts references in 'runcondition' so that any WindowFunc
+ *		references are swapped out for a Var which references the matching
+ *		WindowFunc in 'plan' targetlist.
+ */
+static List *
+set_windowagg_runcondition_references(PlannerInfo *root,
+									  List *runcondition,
+									  Plan *plan)
+{
+	List	   *newlist;
+	indexed_tlist *itlist;
+
+	itlist = build_tlist_index(plan->targetlist);
+
+	newlist = fix_windowagg_condition_expr(root, runcondition, itlist);
+
+	pfree(itlist);
+
+	return newlist;
+}
 
 /*****************************************************************************
  *					QUERY DEPENDENCY MANAGEMENT
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 1670e54644..34d107a03b 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3398,7 +3398,8 @@ create_windowagg_path(PlannerInfo *root,
 					  Path *subpath,
 					  PathTarget *target,
 					  List *windowFuncs,
-					  WindowClause *winclause)
+					  WindowClause *winclause,
+					  bool usepassthrough)
 {
 	WindowAggPath *pathnode = makeNode(WindowAggPath);
 
@@ -3416,6 +3417,7 @@ create_windowagg_path(PlannerInfo *root,
 
 	pathnode->subpath = subpath;
 	pathnode->winclause = winclause;
+	pathnode->usepassthrough = usepassthrough;
 
 	/*
 	 * For costing purposes, assume that there are no redundant partitioning
diff --git a/src/backend/utils/adt/int8.c b/src/backend/utils/adt/int8.c
index 4a87114a4f..98d4323755 100644
--- a/src/backend/utils/adt/int8.c
+++ b/src/backend/utils/adt/int8.c
@@ -24,6 +24,7 @@
 #include "nodes/supportnodes.h"
 #include "optimizer/optimizer.h"
 #include "utils/builtins.h"
+#include "utils/lsyscache.h"
 
 
 typedef struct
@@ -818,6 +819,49 @@ int8dec_any(PG_FUNCTION_ARGS)
 	return int8dec(fcinfo);
 }
 
+/*
+ * int8inc_support
+ *		prosupport function for int8inc() and int8inc_any()
+ */
+Datum
+int8inc_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq = (Node *) PG_GETARG_POINTER(0);
+
+	if (IsA(rawreq, SupportRequestWFuncMonotonic))
+	{
+		SupportRequestWFuncMonotonic *req = (SupportRequestWFuncMonotonic *) rawreq;
+		MonotonicFunction monotonic = MONOTONICFUNC_NONE;
+		int			frameOptions = req->window_clause->frameOptions;
+
+		/* No ORDER BY clause then all rows are peers */
+		if (req->window_clause->orderClause == NIL)
+			monotonic = MONOTONICFUNC_BOTH;
+		else
+		{
+			/*
+			 * Otherwise take into account the frame options.  When the frame
+			 * bound is the start of the window then the resulting value can
+			 * never decrease, therefore is monotonically increasing
+			 */
+			if (frameOptions & FRAMEOPTION_START_UNBOUNDED_PRECEDING)
+				monotonic |= MONOTONICFUNC_INCREASING;
+
+			/*
+			 * Likewise, if the frame bound is the end of the window then the
+			 * resulting value can never decrease.
+			 */
+			if (frameOptions & FRAMEOPTION_END_UNBOUNDED_FOLLOWING)
+				monotonic |= MONOTONICFUNC_DECREASING;
+		}
+
+		req->monotonic = monotonic;
+		PG_RETURN_POINTER(req);
+	}
+
+	PG_RETURN_POINTER(NULL);
+}
+
 
 Datum
 int8larger(PG_FUNCTION_ARGS)
diff --git a/src/backend/utils/adt/windowfuncs.c b/src/backend/utils/adt/windowfuncs.c
index 3e0cc9be1a..596564fa15 100644
--- a/src/backend/utils/adt/windowfuncs.c
+++ b/src/backend/utils/adt/windowfuncs.c
@@ -13,6 +13,7 @@
  */
 #include "postgres.h"
 
+#include "nodes/supportnodes.h"
 #include "utils/builtins.h"
 #include "windowapi.h"
 
@@ -88,6 +89,26 @@ window_row_number(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(curpos + 1);
 }
 
+/*
+ * window_row_number_support
+ *		prosupport function for window_row_number()
+ */
+Datum
+window_row_number_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq = (Node *) PG_GETARG_POINTER(0);
+
+	if (IsA(rawreq, SupportRequestWFuncMonotonic))
+	{
+		SupportRequestWFuncMonotonic *req = (SupportRequestWFuncMonotonic *) rawreq;
+
+		/* row_number() is monotonically increasing */
+		req->monotonic = MONOTONICFUNC_INCREASING;
+		PG_RETURN_POINTER(req);
+	}
+
+	PG_RETURN_POINTER(NULL);
+}
 
 /*
  * rank
@@ -110,6 +131,27 @@ window_rank(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(context->rank);
 }
 
+/*
+ * window_rank_support
+ *		prosupport function for window_rank()
+ */
+Datum
+window_rank_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq = (Node *) PG_GETARG_POINTER(0);
+
+	if (IsA(rawreq, SupportRequestWFuncMonotonic))
+	{
+		SupportRequestWFuncMonotonic *req = (SupportRequestWFuncMonotonic *) rawreq;
+
+		/* rank() is monotonically increasing */
+		req->monotonic = MONOTONICFUNC_INCREASING;
+		PG_RETURN_POINTER(req);
+	}
+
+	PG_RETURN_POINTER(NULL);
+}
+
 /*
  * dense_rank
  * Rank increases by 1 when key columns change.
@@ -130,6 +172,27 @@ window_dense_rank(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(context->rank);
 }
 
+/*
+ * window_dense_rank_support
+ *		prosupport function for window_dense_rank()
+ */
+Datum
+window_dense_rank_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq = (Node *) PG_GETARG_POINTER(0);
+
+	if (IsA(rawreq, SupportRequestWFuncMonotonic))
+	{
+		SupportRequestWFuncMonotonic *req = (SupportRequestWFuncMonotonic *) rawreq;
+
+		/* dense_rank() is monotonically increasing */
+		req->monotonic = MONOTONICFUNC_INCREASING;
+		PG_RETURN_POINTER(req);
+	}
+
+	PG_RETURN_POINTER(NULL);
+}
+
 /*
  * percent_rank
  * return fraction between 0 and 1 inclusive,
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 25304430f4..8076a90895 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6647,11 +6647,16 @@
 # count has two forms: count(any) and count(*)
 { oid => '2147',
   descr => 'number of input rows for which the input expression is not null',
-  proname => 'count', prokind => 'a', proisstrict => 'f', prorettype => 'int8',
-  proargtypes => 'any', prosrc => 'aggregate_dummy' },
+  proname => 'count', prosupport => 'int8inc_support', prokind => 'a',
+  proisstrict => 'f', prorettype => 'int8', proargtypes => 'any',
+  prosrc => 'aggregate_dummy' },
 { oid => '2803', descr => 'number of input rows',
-  proname => 'count', prokind => 'a', proisstrict => 'f', prorettype => 'int8',
-  proargtypes => '', prosrc => 'aggregate_dummy' },
+  proname => 'count', prosupport => 'int8inc_support', prokind => 'a',
+  proisstrict => 'f', prorettype => 'int8', proargtypes => '',
+  prosrc => 'aggregate_dummy' },
+{ oid => '8802', descr => 'planner support for count run condition',
+  proname => 'int8inc_support', prorettype => 'internal',
+  proargtypes => 'internal', prosrc => 'int8inc_support' },
 
 { oid => '2718',
   descr => 'population variance of bigint input values (square of the population standard deviation)',
@@ -10155,14 +10160,26 @@
 
 # SQL-spec window functions
 { oid => '3100', descr => 'row number within partition',
-  proname => 'row_number', prokind => 'w', proisstrict => 'f',
-  prorettype => 'int8', proargtypes => '', prosrc => 'window_row_number' },
+  proname => 'row_number', prosupport => 'window_row_number_support',
+  prokind => 'w', proisstrict => 'f',  prorettype => 'int8',
+  proargtypes => '', prosrc => 'window_row_number' },
+{ oid => '8799', descr => 'planner support for row_number run condition',
+  proname => 'window_row_number_support', prorettype => 'internal',
+  proargtypes => 'internal', prosrc => 'window_row_number_support' },
 { oid => '3101', descr => 'integer rank with gaps',
-  proname => 'rank', prokind => 'w', proisstrict => 'f', prorettype => 'int8',
+  proname => 'rank', prosupport => 'window_rank_support',
+  prokind => 'w', proisstrict => 'f', prorettype => 'int8',
   proargtypes => '', prosrc => 'window_rank' },
+{ oid => '8800', descr => 'planner support for rank run condition',
+  proname => 'window_rank_support', prorettype => 'internal',
+  proargtypes => 'internal', prosrc => 'window_rank_support' },
 { oid => '3102', descr => 'integer rank without gaps',
-  proname => 'dense_rank', prokind => 'w', proisstrict => 'f',
-  prorettype => 'int8', proargtypes => '', prosrc => 'window_dense_rank' },
+  proname => 'dense_rank', prosupport => 'window_dense_rank_support',
+  prokind => 'w', proisstrict => 'f', prorettype => 'int8', proargtypes => '',
+  prosrc => 'window_dense_rank' },
+{ oid => '8801', descr => 'planner support for dense rank run condition',
+  proname => 'window_dense_rank_support', prorettype => 'internal',
+  proargtypes => 'internal', prosrc => 'window_dense_rank_support' },
 { oid => '3103', descr => 'fractional rank within partition',
   proname => 'percent_rank', prokind => 'w', proisstrict => 'f',
   prorettype => 'float8', proargtypes => '', prosrc => 'window_percent_rank' },
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index cbbcff81d2..8224a91f77 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2406,6 +2406,16 @@ typedef struct AggState
 typedef struct WindowStatePerFuncData *WindowStatePerFunc;
 typedef struct WindowStatePerAggData *WindowStatePerAgg;
 
+/*
+ * WindowAggStatus -- Used to track the status of WindowAggState
+ */
+typedef enum WindowAggStatus
+{
+	WINDOWAGG_DONE,
+	WINDOWAGG_RUN,
+	WINDOWAGG_PASSTHROUGH
+} WindowAggStatus;
+
 typedef struct WindowAggState
 {
 	ScanState	ss;				/* its first field is NodeTag */
@@ -2432,6 +2442,7 @@ typedef struct WindowAggState
 	struct WindowObjectData *agg_winobj;	/* winobj for aggregate fetches */
 	int64		aggregatedbase; /* start row for current aggregates */
 	int64		aggregatedupto; /* rows before this one are aggregated */
+	WindowAggStatus status;		/* run status of WindowAggState */
 
 	int			frameOptions;	/* frame_clause options, see WindowDef */
 	ExprState  *startOffset;	/* expression for starting bound offset */
@@ -2458,8 +2469,15 @@ typedef struct WindowAggState
 	MemoryContext curaggcontext;	/* current aggregate's working data */
 	ExprContext *tmpcontext;	/* short-term evaluation context */
 
+	ExprState  *runcondition;	/* Condition which must remain true otherwise
+								 * execution of the WindowAgg will finish or
+								 * go into pass-through mode.  NULL when there
+								 * is no such condition. */
+
+	bool		use_pass_through;	/* When false, stop execution when
+									 * runcondition is no longer true.  Else
+									 * just stop evaluating window funcs. */
 	bool		all_first;		/* true if the scan is starting */
-	bool		all_done;		/* true if the scan is finished */
 	bool		partition_spooled;	/* true if all tuples in current partition
 									 * have been spooled into tuplestore */
 	bool		more_partitions;	/* true if there's more partitions after
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index aefce33e28..ddac2ae0b6 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -559,7 +559,8 @@ typedef enum NodeTag
 	T_SupportRequestSelectivity,	/* in nodes/supportnodes.h */
 	T_SupportRequestCost,		/* in nodes/supportnodes.h */
 	T_SupportRequestRows,		/* in nodes/supportnodes.h */
-	T_SupportRequestIndexCondition	/* in nodes/supportnodes.h */
+	T_SupportRequestIndexCondition, /* in nodes/supportnodes.h */
+	T_SupportRequestWFuncMonotonic	/* in nodes/supportnodes.h */
 } NodeTag;
 
 /*
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index e58211eac1..54551bd6bd 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -1402,6 +1402,8 @@ typedef struct WindowClause
 	int			frameOptions;	/* frame_clause options, see WindowDef */
 	Node	   *startOffset;	/* expression for starting bound, if any */
 	Node	   *endOffset;		/* expression for ending bound, if any */
+	List	   *runCondition;	/* qual to help short-circuit execution */
+	List	   *runConditionOrig;	/* EXPLAIN compatible version of above */
 	Oid			startInRangeFunc;	/* in_range function for startOffset */
 	Oid			endInRangeFunc; /* in_range function for endOffset */
 	Oid			inRangeColl;	/* collation for in_range tests */
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 6cbcb67bdf..048a1a64e6 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1843,6 +1843,8 @@ typedef struct WindowAggPath
 	Path		path;
 	Path	   *subpath;		/* path representing input source */
 	WindowClause *winclause;	/* WindowClause we'll be using */
+	bool		usepassthrough; /* use "pass-through" mode when winclause's
+								 * runCondition becomes false */
 } WindowAggPath;
 
 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 50ef3dda05..c450972018 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -914,12 +914,17 @@ typedef struct WindowAgg
 	int			frameOptions;	/* frame_clause options, see WindowDef */
 	Node	   *startOffset;	/* expression for starting bound, if any */
 	Node	   *endOffset;		/* expression for ending bound, if any */
+	List	   *runCondition;	/* qual to help short-circuit execution */
+	List	   *runConditionOrig;	/* runCondition for display in EXPLAIN */
 	/* these fields are used with RANGE offset PRECEDING/FOLLOWING: */
 	Oid			startInRangeFunc;	/* in_range function for startOffset */
 	Oid			endInRangeFunc; /* in_range function for endOffset */
 	Oid			inRangeColl;	/* collation for in_range tests */
 	bool		inRangeAsc;		/* use ASC sort order for in_range tests? */
 	bool		inRangeNullsFirst;	/* nulls sort first for in_range tests? */
+	bool		usePassThrough; /* Go into pass-through mode when runCondition
+								 * is not longer true.  When false, we just
+								 * stop execution */
 } WindowAgg;
 
 /* ----------------
@@ -1312,4 +1317,21 @@ typedef struct PlanInvalItem
 	uint32		hashValue;		/* hash value of object's cache lookup key */
 } PlanInvalItem;
 
+/*
+ * MonotonicFunction
+ *
+ * Allows the planner to track monotonic properties of functions.  A function
+ * is monotonically increasing if a subsequent call cannot yield a lower value
+ * than the previous call.  A monotonically decreasing function cannot yield a
+ * higher value on subsequent calls, and a function which is both must return
+ * the same value on each call.
+ */
+typedef enum MonotonicFunction
+{
+	MONOTONICFUNC_NONE = 0,
+	MONOTONICFUNC_INCREASING = (1 << 0),
+	MONOTONICFUNC_DECREASING = (1 << 1),
+	MONOTONICFUNC_BOTH = MONOTONICFUNC_INCREASING | MONOTONICFUNC_DECREASING
+} MonotonicFunction;
+
 #endif							/* PLANNODES_H */
diff --git a/src/include/nodes/supportnodes.h b/src/include/nodes/supportnodes.h
index 88b61b3ab3..bdd43fc614 100644
--- a/src/include/nodes/supportnodes.h
+++ b/src/include/nodes/supportnodes.h
@@ -33,12 +33,12 @@
 #ifndef SUPPORTNODES_H
 #define SUPPORTNODES_H
 
-#include "nodes/primnodes.h"
+#include "nodes/plannodes.h"
 
 struct PlannerInfo;				/* avoid including pathnodes.h here */
 struct IndexOptInfo;
 struct SpecialJoinInfo;
-
+struct WindowClause;
 
 /*
  * The Simplify request allows the support function to perform plan-time
@@ -239,4 +239,64 @@ typedef struct SupportRequestIndexCondition
 								 * equivalent of the function call */
 } SupportRequestIndexCondition;
 
+/* ----------
+ * To allow more efficient execution of any monotonically increasing and/or
+ * monotonically decreasing window functions, we support calling the window
+ * function's prosupport function passing along this struct whenever the
+ * planner sees an OpExpr qual directly reference a window function in a
+ * subquery.  When the planner encounters this, we populate this struct and
+ * pass it along to the window function's prosupport function so that it can
+ * evaluate if the given WindowFunc is;
+ *
+ * a) monotonically increasing, or
+ * b) monotonically decreasing, or
+ * c) both monotonically increasing and decreasing, or
+ * d) none of the above.
+ *
+ * A function that is monotonically increasing can never return a value that
+ * is lower than a value returned in a "previous call".  A monotonically
+ * decreasing function can never return a value higher than a value returned
+ * in a previous call.  A function that is both must return the same value
+ * each time.
+ *
+ * We define "previous call" to mean a previous call to the same WindowFunc
+ * struct in the same window partition.
+ *
+ * row_number() is an example of a monotonically increasing function.  The
+ * return value will be reset back to 1 in each new partition.  An example of
+ * a monotonically increasing and decreasing function is COUNT(*) OVER ().
+ * Since there is no ORDER BY clause in this example, all rows in the
+ * partition are peers and all rows within the partition will be within the
+ * frame bound.  Likewise for COUNT(*) OVER(ORDER BY a ROWS BETWEEN UNBOUNDED
+ * PRECEDING AND UNBOUNDED FOLLOWING).
+ *
+ * COUNT(*) OVER (ORDER BY a ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
+ * is an example of a monotonically decreasing function.
+ *
+ * Implementations must only concern themselves with the given WindowFunc
+ * being monotonic in a single partition.
+ *
+ * Inputs:
+ *	'window_func' is the pointer to the window function being called.
+ *
+ *	'window_clause' pointer to the WindowClause data.  Support functions can
+ *	use this to check frame bounds, etc.
+ *
+ * Outputs:
+ *	'monotonic' the resulting MonotonicFunction value for the given input
+ *	window function and window clause.
+ * ----------
+ */
+typedef struct SupportRequestWFuncMonotonic
+{
+	NodeTag		type;
+
+	/* Input fields: */
+	WindowFunc *window_func;	/* Pointer to the window function data */
+	struct WindowClause *window_clause; /* Pointer to the window clause data */
+
+	/* Output fields: */
+	MonotonicFunction monotonic;
+} SupportRequestWFuncMonotonic;
+
 #endif							/* SUPPORTNODES_H */
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 6eca547af8..c449b8ea88 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -245,7 +245,8 @@ extern WindowAggPath *create_windowagg_path(PlannerInfo *root,
 											Path *subpath,
 											PathTarget *target,
 											List *windowFuncs,
-											WindowClause *winclause);
+											WindowClause *winclause,
+											bool usepassthrough);
 extern SetOpPath *create_setop_path(PlannerInfo *root,
 									RelOptInfo *rel,
 									Path *subpath,
diff --git a/src/test/regress/expected/window.out b/src/test/regress/expected/window.out
index bb9ff7f07b..4b22080d13 100644
--- a/src/test/regress/expected/window.out
+++ b/src/test/regress/expected/window.out
@@ -3336,6 +3336,365 @@ WHERE depname = 'sales';
                            ->  Seq Scan on empsalary
 (9 rows)
 
+-- Test window function run conditions are properly pushed down into the
+-- WindowAgg
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+                  QUERY PLAN                  
+----------------------------------------------
+ WindowAgg
+   Run Condition: (row_number() OVER (?) < 3)
+   ->  Sort
+         Sort Key: empsalary.empno
+         ->  Seq Scan on empsalary
+(5 rows)
+
+-- The following 3 statements should result the same result.
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+ empno | rn 
+-------+----
+     1 |  1
+     2 |  2
+(2 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE 3 > rn;
+ empno | rn 
+-------+----
+     1 |  1
+     2 |  2
+(2 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE 2 >= rn;
+ empno | rn 
+-------+----
+     1 |  1
+     2 |  2
+(2 rows)
+
+-- Ensure r <= 3 is pushed down into the run condition of the window agg
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          rank() OVER (ORDER BY salary DESC) r
+   FROM empsalary) emp
+WHERE r <= 3;
+               QUERY PLAN                
+-----------------------------------------
+ WindowAgg
+   Run Condition: (rank() OVER (?) <= 3)
+   ->  Sort
+         Sort Key: empsalary.salary DESC
+         ->  Seq Scan on empsalary
+(5 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          rank() OVER (ORDER BY salary DESC) r
+   FROM empsalary) emp
+WHERE r <= 3;
+ empno | salary | r 
+-------+--------+---
+     8 |   6000 | 1
+    10 |   5200 | 2
+    11 |   5200 | 2
+(3 rows)
+
+-- Ensure dr = 1 is converted to dr <= 1 to get all rows leading up to dr = 1
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+                     QUERY PLAN                      
+-----------------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.dr = 1)
+   ->  WindowAgg
+         Run Condition: (dense_rank() OVER (?) <= 1)
+         ->  Sort
+               Sort Key: empsalary.salary DESC
+               ->  Seq Scan on empsalary
+(7 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+ empno | salary | dr 
+-------+--------+----
+     8 |   6000 |  1
+(1 row)
+
+-- Check COUNT() and COUNT(*)
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+                QUERY PLAN                 
+-------------------------------------------
+ WindowAgg
+   Run Condition: (count(*) OVER (?) <= 3)
+   ->  Sort
+         Sort Key: empsalary.salary DESC
+         ->  Seq Scan on empsalary
+(5 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+ empno | salary | c 
+-------+--------+---
+     8 |   6000 | 1
+    10 |   5200 | 3
+    11 |   5200 | 3
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+                       QUERY PLAN                        
+---------------------------------------------------------
+ WindowAgg
+   Run Condition: (count(empsalary.empno) OVER (?) <= 3)
+   ->  Sort
+         Sort Key: empsalary.salary DESC
+         ->  Seq Scan on empsalary
+(5 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+ empno | salary | c 
+-------+--------+---
+     8 |   6000 | 1
+    10 |   5200 | 3
+    11 |   5200 | 3
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) c
+   FROM empsalary) emp
+WHERE c >= 3;
+                QUERY PLAN                 
+-------------------------------------------
+ WindowAgg
+   Run Condition: (count(*) OVER (?) >= 3)
+   ->  Sort
+         Sort Key: empsalary.salary DESC
+         ->  Seq Scan on empsalary
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER () c
+   FROM empsalary) emp
+WHERE 11 <= c;
+                 QUERY PLAN                 
+--------------------------------------------
+ WindowAgg
+   Run Condition: (11 <= count(*) OVER (?))
+   ->  Seq Scan on empsalary
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+                     QUERY PLAN                      
+-----------------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.dr = 1)
+   ->  WindowAgg
+         Run Condition: (dense_rank() OVER (?) <= 1)
+         ->  Sort
+               Sort Key: empsalary.salary DESC
+               ->  Seq Scan on empsalary
+(7 rows)
+
+-- Ensure we get a run condition when there's a PARTITION BY clause
+-- also ensure that the original qual remains as a filter.
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          depname,
+          row_number() OVER (PARTITION BY depname ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.rn < 3)
+   ->  WindowAgg
+         Run Condition: (row_number() OVER (?) < 3)
+         ->  Sort
+               Sort Key: empsalary.depname, empsalary.empno
+               ->  Seq Scan on empsalary
+(7 rows)
+
+-- and ensure we get the correct results from the above plan
+SELECT * FROM
+  (SELECT empno,
+          depname,
+          row_number() OVER (PARTITION BY depname ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+ empno |  depname  | rn 
+-------+-----------+----
+     7 | develop   |  1
+     8 | develop   |  2
+     2 | personnel |  1
+     5 | personnel |  2
+     1 | sales     |  1
+     3 | sales     |  2
+(6 rows)
+
+-- likewise with count(empno) instead of row_number()
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          depname,
+          salary,
+          count(empno) OVER (PARTITION BY depname ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+                            QUERY PLAN                            
+------------------------------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.c <= 3)
+   ->  WindowAgg
+         Run Condition: (count(empsalary.empno) OVER (?) <= 3)
+         ->  Sort
+               Sort Key: empsalary.depname, empsalary.salary DESC
+               ->  Seq Scan on empsalary
+(7 rows)
+
+-- and again, check the results are what we expect.
+SELECT * FROM
+  (SELECT empno,
+          depname,
+          salary,
+          count(empno) OVER (PARTITION BY depname ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+ empno |  depname  | salary | c 
+-------+-----------+--------+---
+     8 | develop   |   6000 | 1
+    10 | develop   |   5200 | 3
+    11 | develop   |   5200 | 3
+     2 | personnel |   3900 | 1
+     5 | personnel |   3500 | 2
+     1 | sales     |   5000 | 1
+     4 | sales     |   4800 | 3
+     3 | sales     |   4800 | 3
+(8 rows)
+
+-- Tests to ensure we don't push down the run condition when it's not valid to
+-- do so.
+-- Ensure we don't push down when the frame options show that the window
+-- function is not monotonically increasing
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) c
+   FROM empsalary) emp
+WHERE c <= 3;
+                  QUERY PLAN                   
+-----------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.c <= 3)
+   ->  WindowAgg
+         ->  Sort
+               Sort Key: empsalary.salary DESC
+               ->  Seq Scan on empsalary
+(6 rows)
+
+-- Ensure we don't push down when the window function's monotonic properties
+-- don't match that of the clauses.
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary) c
+   FROM empsalary) emp
+WHERE 3 <= c;
+                QUERY PLAN                
+------------------------------------------
+ Subquery Scan on emp
+   Filter: (3 <= emp.c)
+   ->  WindowAgg
+         ->  Sort
+               Sort Key: empsalary.salary
+               ->  Seq Scan on empsalary
+(6 rows)
+
+-- Ensure we don't pushdown when there are multiple window clauses to evaluate
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY empno DESC) c,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+                           QUERY PLAN                            
+-----------------------------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.dr = 1)
+   ->  WindowAgg
+         ->  Sort
+               Sort Key: empsalary.empno DESC
+               ->  WindowAgg
+                     Run Condition: (dense_rank() OVER (?) <= 1)
+                     ->  Sort
+                           Sort Key: empsalary.salary DESC
+                           ->  Seq Scan on empsalary
+(10 rows)
+
 -- Test Sort node collapsing
 EXPLAIN (COSTS OFF)
 SELECT * FROM
diff --git a/src/test/regress/sql/window.sql b/src/test/regress/sql/window.sql
index 41a8e0d152..83db8a4201 100644
--- a/src/test/regress/sql/window.sql
+++ b/src/test/regress/sql/window.sql
@@ -988,6 +988,192 @@ SELECT * FROM
    FROM empsalary) emp
 WHERE depname = 'sales';
 
+-- Test window function run conditions are properly pushed down into the
+-- WindowAgg
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+
+-- The following 3 statements should result the same result.
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE 3 > rn;
+
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE 2 >= rn;
+
+-- Ensure r <= 3 is pushed down into the run condition of the window agg
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          rank() OVER (ORDER BY salary DESC) r
+   FROM empsalary) emp
+WHERE r <= 3;
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          rank() OVER (ORDER BY salary DESC) r
+   FROM empsalary) emp
+WHERE r <= 3;
+
+-- Ensure dr = 1 is converted to dr <= 1 to get all rows leading up to dr = 1
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+
+-- Check COUNT() and COUNT(*)
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) c
+   FROM empsalary) emp
+WHERE c >= 3;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER () c
+   FROM empsalary) emp
+WHERE 11 <= c;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+
+-- Ensure we get a run condition when there's a PARTITION BY clause
+-- also ensure that the original qual remains as a filter.
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          depname,
+          row_number() OVER (PARTITION BY depname ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+
+-- and ensure we get the correct results from the above plan
+SELECT * FROM
+  (SELECT empno,
+          depname,
+          row_number() OVER (PARTITION BY depname ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+
+-- likewise with count(empno) instead of row_number()
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          depname,
+          salary,
+          count(empno) OVER (PARTITION BY depname ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+-- and again, check the results are what we expect.
+SELECT * FROM
+  (SELECT empno,
+          depname,
+          salary,
+          count(empno) OVER (PARTITION BY depname ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+-- Tests to ensure we don't push down the run condition when it's not valid to
+-- do so.
+
+-- Ensure we don't push down when the frame options show that the window
+-- function is not monotonically increasing
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+-- Ensure we don't push down when the window function's monotonic properties
+-- don't match that of the clauses.
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary) c
+   FROM empsalary) emp
+WHERE 3 <= c;
+
+-- Ensure we don't pushdown when there are multiple window clauses to evaluate
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY empno DESC) c,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+
 -- Test Sort node collapsing
 EXPLAIN (COSTS OFF)
 SELECT * FROM
-- 
2.35.1.windows.2

#22

Andy Fan

zhihui.fan1213@gmail.com

almost 4 years ago

In reply to: David Rowley (#21)

Re: Window Function "Run Conditions"

I'm pretty happy with this now. If anyone wants to have a look at
this, can they do so or let me know they're going to within the next
24 hours. Otherwise I plan to move into commit mode with it.

I just came to the office today to double check this patch. I probably can
finish it very soon. But if you are willing to commit it sooner, I am
totally
fine with it.

--
Best Regards
Andy Fan

#23

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: David Rowley (#21)

Re: Window Function "Run Conditions"

On 2022-04-05 12:04:18 +1200, David Rowley wrote:

This is afaics slightly cheaper than referencing a variable in a slot.

I guess you must mean cheaper because it means there will be no
EEOP_*_FETCHSOME step? Otherwise it seems a fairly similar amount of
work.

That, and slightly fewer indirections for accessing values IIRC.

#24

Andy Fan

zhihui.fan1213@gmail.com

almost 4 years ago

In reply to: Andy Fan (#22)

3 attachment(s)

Re: Window Function "Run Conditions"

Hi David:

I just came to the office today to double check this patch. I probably can

finish it very soon.

I would share my current review result first and more review is still in
progress.
There is a lot of amazing stuff there but I'd save the simple +1 and just
share
something I'm not fully understand now. I just focused on the execution
part and
only 1 WindowAgg node situation right now.

1. We can do more on PASSTHROUGH, we just bypass the window function
currently, but IIUC we can ignore all of the following tuples in current
partition
once we go into this mode. patch 0001 shows what I mean.

--- without patch 0001,  we need 1653 ms for the below query, with the
patch 0001,
--- we need 629ms.   This is not a serious performance comparison since I
--- build software with -O0 and --enable_cassert.  but it can show some
improvement.
postgres=# explain analyze select * from (select x,y,row_number() over
(partition
by x order by y) rn from xy) as xy where rn < 2;
                                                                      QUERY
PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------
 Subquery Scan on xy  (cost=0.42..55980.43 rows=5000 width=16) (actual
time=0.072..1653.631 rows=1000 loops=1)
   Filter: (xy.rn = 1)
   Rows Removed by Filter: 999000
   ->  WindowAgg  (cost=0.42..43480.43 rows=1000000 width=16) (actual
time=0.069..1494.553 rows=1000000 loops=1)
         Run Condition: (row_number() OVER (?) < 2)
         ->  Index Only Scan using xy_x_y_idx on xy xy_1
 (cost=0.42..25980.42 rows=1000000 width=8) (actual time=0.047..330.283
rows=1000000 loops=1)
               Heap Fetches: 0
 Planning Time: 0.240 ms
 Execution Time: 1653.913 ms
(9 rows)

postgres=# explain analyze select * from (select x,y,row_number() over
(partition
by x order by y) rn from xy) as xy where rn < 2;
QUERY
PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------
Subquery Scan on xy (cost=0.42..55980.43 rows=5000 width=16) (actual
time=0.103..629.428 rows=1000 loops=1)
Filter: (xy.rn < 2)
Rows Removed by Filter: 1000
-> WindowAgg (cost=0.42..43480.43 rows=1000000 width=16) (actual
time=0.101..628.821 rows=2000 loops=1)
Run Condition: (row_number() OVER (?) < 2)
-> Index Only Scan using xy_x_y_idx on xy xy_1
(cost=0.42..25980.42 rows=1000000 width=8) (actual time=0.063..281.715
rows=1000000 loops=1)
Heap Fetches: 0
Planning Time: 1.119 ms
Execution Time: 629.781 ms
(9 rows)

Time: 633.241 ms

2. the "Rows Removed by Filter: 1000" is strange to me for the above
example.

Subquery Scan on xy (cost=0.42..55980.43 rows=5000 width=16) (actual
time=0.103..629.428 rows=1000 loops=1)
Filter: (xy.rn < 2)
Rows Removed by Filter: 1000

The root cause is even ExecQual(winstate->runcondition, econtext) return
false, we
still return the slot to the upper node. A simple hack can avoid it.

3. With the changes in 2, I think we can avoid the subquery node totally
for the above query.

4. If all the above are correct, looks the enum WindowAggStatus addition is
not a
must since we can do what WINDOWAGG_PASSTHROUGH does just when we find it
is, like
patch 3 shows. (I leave WINDOWAGG_DONE only, but it can be replaced with
previous all_done field).

Finally, Thanks for the patch, it is a good material to study the knowledge
in this area.

--
Best Regards
Andy Fan

Attachments:

v1-0003-Try-to-remove-enum-WindowAggStatus.patchapplication/octet-stream; name=v1-0003-Try-to-remove-enum-WindowAggStatus.patchDownload

From 5d9b00f0b33542360410ca098274ee42377c17c9 Mon Sep 17 00:00:00 2001
From: Andy Fan <yizhi.fzh@alibaba-inc.com>
Date: Tue, 5 Apr 2022 15:08:33 +0800
Subject: [PATCH v1 3/3] Try to remove enum WindowAggStatus.

WINDOWAGG_DONE is left only, but it can be replaced with
previous all_done field.
---
 src/backend/executor/nodeWindowAgg.c | 80 +++++++++++-----------------
 1 file changed, 31 insertions(+), 49 deletions(-)

diff --git a/src/backend/executor/nodeWindowAgg.c b/src/backend/executor/nodeWindowAgg.c
index ad592e9283a..05e2944e12a 100644
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -2116,29 +2116,9 @@ ExecWindowAgg(PlanState *pstate)
 		/* we don't need to invalidate grouptail here; see below */
 	}
 
-retry:
-	/*
-	 * Spool all tuples up to and including the current row, if we haven't
-	 * already
-	 */
-	if (winstate->status == WINDOWAGG_PASSTHROUGH )
-	{
-		/* All the next tuples in this partition is not interest. */
-
-		/* let's read all of them in once. */
-		spool_tuples(winstate, -1);
-
-		/*
-		 * And stop handling them one by one. we just act as we have
-		 * read up to the last tuple.
-		 */
-		winstate->currentpos = winstate->spooled_rows;
-	}
-	else
-	{
-		spool_tuples(winstate, winstate->currentpos);
-	}
+	spool_tuples(winstate, winstate->currentpos);
 
+retry:
 	/* Move to the next partition if we reached the end of this partition */
 	if (winstate->partition_spooled &&
 		winstate->currentpos >= winstate->spooled_rows)
@@ -2149,9 +2129,6 @@ retry:
 		{
 			begin_partition(winstate);
 			Assert(winstate->spooled_rows > 0);
-
-			/* Come out of pass-through mode when changing partition */
-			winstate->status = WINDOWAGG_RUN;
 		}
 		else
 		{
@@ -2207,31 +2184,27 @@ retry:
 			elog(ERROR, "unexpected end of tuplestore");
 	}
 
-	/* don't evaluate the window functions when we're in pass-through mode */
-	if (winstate->status == WINDOWAGG_RUN)
+	/*
+	 * Evaluate true window functions
+	 */
+	numfuncs = winstate->numfuncs;
+	for (i = 0; i < numfuncs; i++)
 	{
-		/*
-		 * Evaluate true window functions
-		 */
-		numfuncs = winstate->numfuncs;
-		for (i = 0; i < numfuncs; i++)
-		{
-			WindowStatePerFunc perfuncstate = &(winstate->perfunc[i]);
-
-			if (perfuncstate->plain_agg)
-				continue;
-			eval_windowfunction(winstate, perfuncstate,
-								&(econtext->ecxt_aggvalues[perfuncstate->wfuncstate->wfuncno]),
-								&(econtext->ecxt_aggnulls[perfuncstate->wfuncstate->wfuncno]));
-		}
+		WindowStatePerFunc perfuncstate = &(winstate->perfunc[i]);
 
-		/*
-		 * Evaluate aggregates
-		 */
-		if (winstate->numaggs > 0)
-			eval_windowaggregates(winstate);
+		if (perfuncstate->plain_agg)
+			continue;
+		eval_windowfunction(winstate, perfuncstate,
+							&(econtext->ecxt_aggvalues[perfuncstate->wfuncstate->wfuncno]),
+							&(econtext->ecxt_aggnulls[perfuncstate->wfuncstate->wfuncno]));
 	}
 
+	/*
+	 * Evaluate aggregates
+	 */
+	if (winstate->numaggs > 0)
+		eval_windowaggregates(winstate);
+
 	/*
 	 * If we have created auxiliary read pointers for the frame or group
 	 * boundaries, force them to be kept up-to-date, because we don't know
@@ -2268,8 +2241,7 @@ retry:
 	 * window function or if we can stop completely.
 	 */
 	econtext->ecxt_scantuple = slot;
-	if (winstate->status == WINDOWAGG_RUN &&
-		!ExecQual(winstate->runcondition, econtext))
+	if (!ExecQual(winstate->runcondition, econtext))
 	{
 		/*
 		 * When the runcondition is no longer true we can either abort
@@ -2280,7 +2252,17 @@ retry:
 		 */
 		if (winstate->use_pass_through)
 		{
-			winstate->status = WINDOWAGG_PASSTHROUGH;
+			/* All the next tuples in this partition is not interest. */
+
+			/* let's read all of them in once. */
+			spool_tuples(winstate, -1);
+
+			/*
+			 * And stop handling them one by one. we just act as we have
+			 * read up to the last tuple.
+			 */
+			winstate->currentpos = winstate->spooled_rows;
+
 			goto retry;
 		}
 		else
-- 
2.21.0

v1-0001-When-we-are-in-PASSTHROUGH-mode-all-the-following.patchapplication/octet-stream; name=v1-0001-When-we-are-in-PASSTHROUGH-mode-all-the-following.patchDownload

From 211e1502dbe2ab7d9aeb2c785c79123eb135f3b4 Mon Sep 17 00:00:00 2001
From: Andy Fan <yizhi.fzh@alibaba-inc.com>
Date: Tue, 5 Apr 2022 14:41:48 +0800
Subject: [PATCH v1 1/3] When we are in PASSTHROUGH mode, all the following
 tuples in this

partition is not interesting, so we quickly ignore all of them.
---
 src/backend/executor/nodeWindowAgg.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/src/backend/executor/nodeWindowAgg.c b/src/backend/executor/nodeWindowAgg.c
index 869dcd74dff..f8792407773 100644
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -2120,7 +2120,23 @@ ExecWindowAgg(PlanState *pstate)
 	 * Spool all tuples up to and including the current row, if we haven't
 	 * already
 	 */
-	spool_tuples(winstate, winstate->currentpos);
+	if (winstate->status == WINDOWAGG_PASSTHROUGH )
+	{
+		/* All the next tuples in this partition is not interest. */
+
+		/* let's read all of them in once. */
+		spool_tuples(winstate, -1);
+
+		/*
+		 * And stop handling them one by one. we just act as we have
+		 * read up to the last tuple.
+		 */
+		winstate->currentpos = winstate->spooled_rows;
+	}
+	else
+	{
+		spool_tuples(winstate, winstate->currentpos);
+	}
 
 	/* Move to the next partition if we reached the end of this partition */
 	if (winstate->partition_spooled &&
-- 
2.21.0

v1-0002-not-return-run-condition-false-tuple-to-upper-nod.patchapplication/octet-stream; name=v1-0002-not-return-run-condition-false-tuple-to-upper-nod.patchDownload

From 5f85c2a263d067a5e6a3fb6a1870049f3e25e482 Mon Sep 17 00:00:00 2001
From: Andy Fan <yizhi.fzh@alibaba-inc.com>
Date: Tue, 5 Apr 2022 14:58:11 +0800
Subject: [PATCH v1 2/3] not return "run condition" = false tuple to upper
 node.

---
 src/backend/executor/nodeWindowAgg.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/src/backend/executor/nodeWindowAgg.c b/src/backend/executor/nodeWindowAgg.c
index f8792407773..ad592e9283a 100644
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -2116,6 +2116,7 @@ ExecWindowAgg(PlanState *pstate)
 		/* we don't need to invalidate grouptail here; see below */
 	}
 
+retry:
 	/*
 	 * Spool all tuples up to and including the current row, if we haven't
 	 * already
@@ -2278,7 +2279,10 @@ ExecWindowAgg(PlanState *pstate)
 		 * use_pass_through field.
 		 */
 		if (winstate->use_pass_through)
+		{
 			winstate->status = WINDOWAGG_PASSTHROUGH;
+			goto retry;
+		}
 		else
 		{
 			winstate->status = WINDOWAGG_DONE;
-- 
2.21.0

#25

Andy Fan

zhihui.fan1213@gmail.com

almost 4 years ago

In reply to: Andy Fan (#24)

Re: Window Function "Run Conditions"

The root cause is even ExecQual(winstate->runcondition, econtext) return
false, we
still return the slot to the upper node. A simple hack can avoid it.

Forget to say 0002 shows what I mean.

--
Best Regards
Andy Fan

#26

David Rowley

dgrowleyml@gmail.com

almost 4 years ago

In reply to: Andy Fan (#24)

1 attachment(s)

Re: Window Function "Run Conditions"

On Tue, 5 Apr 2022 at 19:38, Andy Fan <zhihui.fan1213@gmail.com> wrote:

1. We can do more on PASSTHROUGH, we just bypass the window function
currently, but IIUC we can ignore all of the following tuples in current partition
once we go into this mode. patch 0001 shows what I mean.

Yeah, there is more performance to be had than even what you've done
there. There's no reason really for spool_tuples() to do
tuplestore_puttupleslot() when we're not in run mode.

The attached should give slightly more performance. I'm unsure if
there's more that can be done for window aggregates, i.e.
eval_windowaggregates()

I'll consider the idea about doing all the filtering in
nodeWindowAgg.c. For now I made find_window_run_conditions() keep the
qual so that it's still filtered in the subquery level when there is a
PARTITION BY clause. Probably the best way would be to make
nodeWindowAgg.c just loop with a for(;;) loop. I'll need to give it
more thought. I'll do that in the morning.

David

Attachments:

windowagg_hacks.patchtext/plain; charset=US-ASCII; name=windowagg_hacks.patchDownload

diff --git a/src/backend/executor/nodeWindowAgg.c b/src/backend/executor/nodeWindowAgg.c
index 869dcd74df..19fe720d9c 100644
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -1248,6 +1248,9 @@ spool_tuples(WindowAggState *winstate, int64 pos)
 	if (winstate->partition_spooled)
 		return;					/* whole partition done already */
 
+	if (winstate->status == WINDOWAGG_PASSTHROUGH)
+		pos = -1;
+
 	/*
 	 * If the tuplestore has spilled to disk, alternate reading and writing
 	 * becomes quite expensive due to frequent buffer flushes.  It's cheaper
@@ -1256,7 +1259,7 @@ spool_tuples(WindowAggState *winstate, int64 pos)
 	 * XXX this is a horrid kluge --- it'd be better to fix the performance
 	 * problem inside tuplestore.  FIXME
 	 */
-	if (!tuplestore_in_memory(winstate->buffer))
+	else if (!tuplestore_in_memory(winstate->buffer))
 		pos = -1;
 
 	outerPlan = outerPlanState(winstate);
@@ -1295,9 +1298,12 @@ spool_tuples(WindowAggState *winstate, int64 pos)
 			}
 		}
 
-		/* Still in partition, so save it into the tuplestore */
-		tuplestore_puttupleslot(winstate->buffer, outerslot);
-		winstate->spooled_rows++;
+		if (winstate->status == WINDOWAGG_RUN)
+		{
+			/* Still in partition, so save it into the tuplestore */
+			tuplestore_puttupleslot(winstate->buffer, outerslot);
+			winstate->spooled_rows++;
+		}
 	}
 
 	MemoryContextSwitchTo(oldcontext);

#27

Andy Fan

zhihui.fan1213@gmail.com

almost 4 years ago

In reply to: David Rowley (#26)

Re: Window Function "Run Conditions"

On Tue, Apr 5, 2022 at 7:49 PM David Rowley <dgrowleyml@gmail.com> wrote:

On Tue, 5 Apr 2022 at 19:38, Andy Fan <zhihui.fan1213@gmail.com> wrote:

1. We can do more on PASSTHROUGH, we just bypass the window function
currently, but IIUC we can ignore all of the following tuples in

current partition

once we go into this mode. patch 0001 shows what I mean.

Yeah, there is more performance to be had than even what you've done
there. There's no reason really for spool_tuples() to do
tuplestore_puttupleslot() when we're not in run mode.

Yeah, this is a great idea.

The attached should give slightly more performance. I'm unsure if

there's more that can be done for window aggregates, i.e.
eval_windowaggregates()

I'll consider the idea about doing all the filtering in
nodeWindowAgg.c. For now I made find_window_run_conditions() keep the
qual so that it's still filtered in the subquery level when there is a
PARTITION BY clause. Probably the best way would be to make

nodeWindowAgg.c just loop with a for(;;) loop. I'll need to give it

more thought. I'll do that in the morning.

I just finished the planner part review and thought about the
multi activeWindows
cases, I think passthrough mode should be still needed but just for multi
activeWindow cases, In the passthrough mode, we can not discard the tuples
in the same partition. Just that PARTITION BY clause should not be the
requirement
for passthrough mode and we can do such optimization. We can discuss
more after your final decision.

And I would suggest the below fastpath for this feature.

@@ -2535,7 +2535,7 @@ set_subquery_pathlist(PlannerInfo *root, RelOptInfo
*rel,
                                 * if it happens to reference a window
function.  If so then
                                 * it might be useful to use for the
WindowAgg's runCondition.
                                 */
-                               if (check_and_push_window_quals(subquery,
rte, rti, clause))
+                               if (!subquery->hasWindowFuncs ||
check_and_push_window_quals(subquery, rte, rti, clause))
                                {
                                        /*
                                         * It's not a suitable window run
condition qual or it is,

--
Best Regards
Andy Fan

#28

David Rowley

dgrowleyml@gmail.com

almost 4 years ago

In reply to: Andy Fan (#27)

1 attachment(s)

Re: Window Function "Run Conditions"

On Wed, 6 Apr 2022 at 00:59, Andy Fan <zhihui.fan1213@gmail.com> wrote:

On Tue, Apr 5, 2022 at 7:49 PM David Rowley <dgrowleyml@gmail.com> wrote:

Yeah, there is more performance to be had than even what you've done
there. There's no reason really for spool_tuples() to do
tuplestore_puttupleslot() when we're not in run mode.

Yeah, this is a great idea.

I've attached an updated patch that does most of what you mentioned.
To make this work I had to add another state to the WindowAggStatus.
This new state is what the top-level WindowAgg will move into when
there's a PARTITION BY clause and the run condition becomes false.
The new state is named WINDOWAGG_PASSTHROUGH_STRICT, which does all
that WINDOWAGG_PASSTHROUGH does plus skips tuplestoring tuples during
the spool. We must still spool those tuples when we're not the
top-level WindowAgg so that we can send those out to any calling
WindowAgg nodes. They'll need those so they return the correct result.

This means that for intermediate WindowAgg nodes, when the
runcondition becomes false, we only skip evaluation of WindowFuncs.
WindowAgg nodes above us cannot reference these, so there's no need to
evaluate them, plus, if there's a run condition then these tuples will
be filtered out in the final WindowAgg node.

For the top-level WindowAgg node, when the run condition becomes false
we can save quite a bit more work. If there's no PARTITION BY clause,
then we're done. Just return NULL. When there is a PARTITION BY
clause we move into WINDOWAGG_PASSTHROUGH_STRICT which allows us to
skip both the evaluation of WindowFuncs and also allows us to consume
tuples from our outer plan until we get a tuple belonging to another
partition. No need to tuplestore these tuples as they're being
filtered out.

Since intermediate WindowAggs cannot filter tuples, all the filtering
must occur in the top-level WindowAgg. This cannot be done by way of
the run condition as the run condition is special as when it becomes
false, we don't check again to see if it's become true. A sort node
between the WindowAggs can change the tuple order (i.e previously
monotonic values may no longer be monotonic) so it's only valid to
evaluate the run condition that's meant for the WindowAgg node it was
intended for. To filter out the tuples that don't match the run
condition from intermediate WindowAggs in the top-level WindowAgg,
what I've done is introduced quals for WindowAgg nodes. This means
that we can now see Filter in EXPLAIN For WindowAgg and "Rows Removed
by Filter".

Why didn't I just do the filtering in the outer query like was
happening before? The problem is that when we push the quals down
into the subquery, we don't yet have knowledge of which order that the
WindowAggs will be evaluated in. Only run conditions from
intermediate WindowAggs will ever make it into the Filter, and we
don't know which one the top-level WindowAgg will be until later in
planning. To do the filtering in the outer query we'd need to push
quals back out the subquery again. It seems to me to be easier and
better to filter them out lower down in the plan.

Since the top-level WindowAgg node can now filter tuples, the executor
node had to be given a for(;;) loop so that it goes around again for
another tuple after it filters a tuple out.

I've also updated the commit message which I think I've made quite
clear about what we optimise and how it's done.

And I would suggest the below fastpath for this feature.
-                               if (check_and_push_window_quals(subquery, rte, rti, clause))
+                               if (!subquery->hasWindowFuncs || check_and_push_window_quals(subquery, rte, rti, clause))

Good idea. Thanks!

David

Attachments:

v6-0001-Teach-planner-and-executor-about-monotonic-window.patchtext/plain; charset=US-ASCII; name=v6-0001-Teach-planner-and-executor-about-monotonic-window.patchDownload

From ffdff181761a0299b2d38e8380f6c67067081211 Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Tue, 5 Apr 2022 12:00:40 +1200
Subject: [PATCH v6] Teach planner and executor about monotonic window funcs

Window functions such as row_number() always return a value higher than
the previously returned value for tuples in any given window partition.

Traditionally queries such as;

SELECT * FROM (
   SELECT *, row_number() over (order by c) rn
   FROM t
) t WHERE rn <= 10;

were executed fairly inefficiently.  Neither the query planner nor the
executor knew that once rn made it to 11 that nothing further would match
the outer query's WHERE clause.  It would blindly continue until all
tuples were exhausted from the subquery.

Here we implement means to make the above execute more efficiently.

This is done by way of adding a pg_proc.prosupport function to various of
the built-in window functions and adding supporting code to allow the
support function to inform the planner if the window function is
monotonically increasing, monotonically decreasing, both or neither.  The
planner is then able to make use of that information and possibly allow
the executor to short-circuit execution by way of adding a "run condition"
to the WindowAgg to allow it to determine if some of its execution work
can be skipped.

This "run condition" is not like a normal filter.  These run conditions
are only built using quals comparing values to monotonic window functions.
For monotonic increasing functions, quals making use of the btree
operators for <, <= and = can be used (assuming the window function column
is on the left). You can see here that once such a condition becomes false
that a monotonic increasing function could never make it subsequently true
again.  For monotonically decreasing functions the >, >= and = btree
operators for the given type can be used for run conditions.

The best-case situation for this is when there is a single WindowAgg node
without a PARTITION BY clause.  Here when the run condition becomes false
the WindowAgg node can simply return NULL.  No more tuples will ever match
the run condition.  It's a little more complex when there is a PARTITION
BY clause.  In this case, we cannot return NULL as we must still process
other partitions.  To speed this case up we pull tuples from the outer
plan to check if they're from the same partition and simply discard them
if they are.  When we find a tuple belonging to another partition we start
processing as normal again until the run condition becomes false or we run
out of tuples to process.

When there are multiple WindowAgg nodes to evaluate then this complicates
the situation.  For non-top-level WindowAggs we must ensure we always
return all tuples to the calling node.  Any filtering done could lead to
incorrect results in WindowAgg nodes above.  For all non-top-level nodes,
we can still save some work when the run condition becomes false.  We've
no need to evaluate the WindowFuncs anymore.  Other WindowAgg nodes cannot
reference the value of these and these tuples will not appear in the final
result anyway.  The savings here are small in comparison to what can be
saved in the top-level WingowAgg, but still worthwhile.

Intermediate WindowAgg nodes never filter out tuples, but here we change
WindowAgg so that the top-level WindowAgg filters out tuples that don't
match the intermediate WindowAgg node's run condition.  Such filters
appear in the Filter clause in EXPLAIN for the top-level WindowAgg node.

Here we add prosupport functions to allow the above to work for;
row_number(), rank(), dense_rank(), count(*) and count(expr).  It appears
technically possible to do the same for min() and max(), however, it seems
unlikely to be useful enough, so that's not done here.

Author: David Rowley
Reviewed-by: Andy Fan, Zhihong Yu
Discussion: https://postgr.es/m/CAApHDvqvp3At8++yF8ij06sdcoo1S_b2YoaT9D4Nf+MObzsrLQ@mail.gmail.com
---
 src/backend/commands/explain.c          |   8 +
 src/backend/executor/nodeWindowAgg.c    | 383 +++++++++++++++--------
 src/backend/nodes/copyfuncs.c           |   4 +
 src/backend/nodes/equalfuncs.c          |   1 +
 src/backend/nodes/outfuncs.c            |   6 +
 src/backend/nodes/readfuncs.c           |   4 +
 src/backend/optimizer/path/allpaths.c   | 290 ++++++++++++++++-
 src/backend/optimizer/plan/createplan.c |  13 +-
 src/backend/optimizer/plan/planner.c    |  15 +-
 src/backend/optimizer/plan/setrefs.c    | 102 ++++++
 src/backend/optimizer/util/pathnode.c   |  11 +-
 src/backend/utils/adt/int8.c            |  44 +++
 src/backend/utils/adt/windowfuncs.c     |  63 ++++
 src/include/catalog/pg_proc.dat         |  35 ++-
 src/include/nodes/execnodes.h           |  24 +-
 src/include/nodes/nodes.h               |   3 +-
 src/include/nodes/parsenodes.h          |   1 +
 src/include/nodes/pathnodes.h           |   3 +
 src/include/nodes/plannodes.h           |  21 ++
 src/include/nodes/supportnodes.h        |  64 +++-
 src/include/optimizer/pathnode.h        |   4 +-
 src/test/regress/expected/window.out    | 398 ++++++++++++++++++++++++
 src/test/regress/sql/window.sql         | 206 ++++++++++++
 23 files changed, 1553 insertions(+), 150 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 1e5701b8eb..33d8bf87fb 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1988,6 +1988,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
 			break;
+		case T_WindowAgg:
+			show_upper_qual(plan->qual, "Filter", planstate, ancestors, es);
+			if (plan->qual)
+				show_instrumentation_count("Rows Removed by Filter", 1,
+					planstate, es);
+			show_upper_qual(((WindowAgg *) plan)->runConditionOrig,
+							"Run Condition", planstate, ancestors, es);
+			break;
 		case T_Group:
 			show_group_keys(castNode(GroupState, planstate), ancestors, es);
 			show_upper_qual(plan->qual, "Filter", planstate, ancestors, es);
diff --git a/src/backend/executor/nodeWindowAgg.c b/src/backend/executor/nodeWindowAgg.c
index 08ce05ca5a..d1fa54729c 100644
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -1248,6 +1248,20 @@ spool_tuples(WindowAggState *winstate, int64 pos)
 	if (winstate->partition_spooled)
 		return;					/* whole partition done already */
 
+	/*
+	 * When in pass-through mode we can just exhaust all tuples in the current
+	 * partition.  We don't need these tuples for any further window function
+	 * evaluation, however, we do need to keep them around if we're not the
+	 * top-level window as another WindowAgg node above must see these.
+	 */
+	if (winstate->status != WINDOWAGG_RUN)
+	{
+		Assert(winstate->status == WINDOWAGG_PASSTHROUGH ||
+			   winstate->status == WINDOWAGG_PASSTHROUGH_STRICT);
+
+		pos = -1;
+	}
+
 	/*
 	 * If the tuplestore has spilled to disk, alternate reading and writing
 	 * becomes quite expensive due to frequent buffer flushes.  It's cheaper
@@ -1256,7 +1270,7 @@ spool_tuples(WindowAggState *winstate, int64 pos)
 	 * XXX this is a horrid kluge --- it'd be better to fix the performance
 	 * problem inside tuplestore.  FIXME
 	 */
-	if (!tuplestore_in_memory(winstate->buffer))
+	else if (!tuplestore_in_memory(winstate->buffer))
 		pos = -1;
 
 	outerPlan = outerPlanState(winstate);
@@ -1295,9 +1309,16 @@ spool_tuples(WindowAggState *winstate, int64 pos)
 			}
 		}
 
-		/* Still in partition, so save it into the tuplestore */
-		tuplestore_puttupleslot(winstate->buffer, outerslot);
-		winstate->spooled_rows++;
+		/*
+		 * Remember the tuple unless we're the top-level window and we're in
+		 * pass-through mode.
+		 */
+		if (winstate->status != WINDOWAGG_PASSTHROUGH_STRICT)
+		{
+			/* Still in partition, so save it into the tuplestore */
+			tuplestore_puttupleslot(winstate->buffer, outerslot);
+			winstate->spooled_rows++;
+		}
 	}
 
 	MemoryContextSwitchTo(oldcontext);
@@ -2023,13 +2044,14 @@ static TupleTableSlot *
 ExecWindowAgg(PlanState *pstate)
 {
 	WindowAggState *winstate = castNode(WindowAggState, pstate);
+	TupleTableSlot *slot;
 	ExprContext *econtext;
 	int			i;
 	int			numfuncs;
 
 	CHECK_FOR_INTERRUPTS();
 
-	if (winstate->all_done)
+	if (winstate->status == WINDOWAGG_DONE)
 		return NULL;
 
 	/*
@@ -2099,143 +2121,224 @@ ExecWindowAgg(PlanState *pstate)
 		winstate->all_first = false;
 	}
 
-	if (winstate->buffer == NULL)
-	{
-		/* Initialize for first partition and set current row = 0 */
-		begin_partition(winstate);
-		/* If there are no input rows, we'll detect that and exit below */
-	}
-	else
+	/* We need to loop as the runCondition may filter out tuples */
+	for (;;)
 	{
-		/* Advance current row within partition */
-		winstate->currentpos++;
-		/* This might mean that the frame moves, too */
-		winstate->framehead_valid = false;
-		winstate->frametail_valid = false;
-		/* we don't need to invalidate grouptail here; see below */
-	}
+		if (winstate->buffer == NULL)
+		{
+			/* Initialize for first partition and set current row = 0 */
+			begin_partition(winstate);
+			/* If there are no input rows, we'll detect that and exit below */
+		}
+		else
+		{
+			/* Advance current row within partition */
+			winstate->currentpos++;
+			/* This might mean that the frame moves, too */
+			winstate->framehead_valid = false;
+			winstate->frametail_valid = false;
+			/* we don't need to invalidate grouptail here; see below */
+		}
 
-	/*
-	 * Spool all tuples up to and including the current row, if we haven't
-	 * already
-	 */
-	spool_tuples(winstate, winstate->currentpos);
+		/*
+		 * Spool all tuples up to and including the current row, if we haven't
+		 * already
+		 */
+		spool_tuples(winstate, winstate->currentpos);
 
-	/* Move to the next partition if we reached the end of this partition */
-	if (winstate->partition_spooled &&
-		winstate->currentpos >= winstate->spooled_rows)
-	{
-		release_partition(winstate);
+		/* Move to the next partition if we reached the end of this partition */
+		if (winstate->partition_spooled &&
+			winstate->currentpos >= winstate->spooled_rows)
+		{
+			release_partition(winstate);
+
+			if (winstate->more_partitions)
+			{
+				begin_partition(winstate);
+				Assert(winstate->spooled_rows > 0);
+
+				/* Come out of pass-through mode when changing partition */
+				winstate->status = WINDOWAGG_RUN;
+			}
+			else
+			{
+				/* No further partitions?  We're done */
+				winstate->status = WINDOWAGG_DONE;
+				return NULL;
+			}
+		}
+
+		/* final output execution is in ps_ExprContext */
+		econtext = winstate->ss.ps.ps_ExprContext;
+
+		/* Clear the per-output-tuple context for current row */
+		ResetExprContext(econtext);
 
-		if (winstate->more_partitions)
+		/*
+		 * Read the current row from the tuplestore, and save in
+		 * ScanTupleSlot. (We can't rely on the outerplan's output slot
+		 * because we may have to read beyond the current row.  Also, we have
+		 * to actually copy the row out of the tuplestore, since window
+		 * function evaluation might cause the tuplestore to dump its state to
+		 * disk.)
+		 *
+		 * In GROUPS mode, or when tracking a group-oriented exclusion clause,
+		 * we must also detect entering a new peer group and update associated
+		 * state when that happens.  We use temp_slot_2 to temporarily hold
+		 * the previous row for this purpose.
+		 *
+		 * Current row must be in the tuplestore, since we spooled it above.
+		 */
+		tuplestore_select_read_pointer(winstate->buffer, winstate->current_ptr);
+		if ((winstate->frameOptions & (FRAMEOPTION_GROUPS |
+									   FRAMEOPTION_EXCLUDE_GROUP |
+									   FRAMEOPTION_EXCLUDE_TIES)) &&
+			winstate->currentpos > 0)
 		{
-			begin_partition(winstate);
-			Assert(winstate->spooled_rows > 0);
+			ExecCopySlot(winstate->temp_slot_2, winstate->ss.ss_ScanTupleSlot);
+			if (!tuplestore_gettupleslot(winstate->buffer, true, true,
+										 winstate->ss.ss_ScanTupleSlot))
+				elog(ERROR, "unexpected end of tuplestore");
+			if (!are_peers(winstate, winstate->temp_slot_2,
+						   winstate->ss.ss_ScanTupleSlot))
+			{
+				winstate->currentgroup++;
+				winstate->groupheadpos = winstate->currentpos;
+				winstate->grouptail_valid = false;
+			}
+			ExecClearTuple(winstate->temp_slot_2);
 		}
 		else
 		{
-			winstate->all_done = true;
-			return NULL;
+			if (!tuplestore_gettupleslot(winstate->buffer, true, true,
+										 winstate->ss.ss_ScanTupleSlot))
+				elog(ERROR, "unexpected end of tuplestore");
 		}
-	}
 
-	/* final output execution is in ps_ExprContext */
-	econtext = winstate->ss.ps.ps_ExprContext;
+		/* don't evaluate the window functions when we're in pass-through mode */
+		if (winstate->status == WINDOWAGG_RUN)
+		{
+			/*
+			 * Evaluate true window functions
+			 */
+			numfuncs = winstate->numfuncs;
+			for (i = 0; i < numfuncs; i++)
+			{
+				WindowStatePerFunc perfuncstate = &(winstate->perfunc[i]);
 
-	/* Clear the per-output-tuple context for current row */
-	ResetExprContext(econtext);
+				if (perfuncstate->plain_agg)
+					continue;
+				eval_windowfunction(winstate, perfuncstate,
+									&(econtext->ecxt_aggvalues[perfuncstate->wfuncstate->wfuncno]),
+									&(econtext->ecxt_aggnulls[perfuncstate->wfuncstate->wfuncno]));
+			}
 
-	/*
-	 * Read the current row from the tuplestore, and save in ScanTupleSlot.
-	 * (We can't rely on the outerplan's output slot because we may have to
-	 * read beyond the current row.  Also, we have to actually copy the row
-	 * out of the tuplestore, since window function evaluation might cause the
-	 * tuplestore to dump its state to disk.)
-	 *
-	 * In GROUPS mode, or when tracking a group-oriented exclusion clause, we
-	 * must also detect entering a new peer group and update associated state
-	 * when that happens.  We use temp_slot_2 to temporarily hold the previous
-	 * row for this purpose.
-	 *
-	 * Current row must be in the tuplestore, since we spooled it above.
-	 */
-	tuplestore_select_read_pointer(winstate->buffer, winstate->current_ptr);
-	if ((winstate->frameOptions & (FRAMEOPTION_GROUPS |
-								   FRAMEOPTION_EXCLUDE_GROUP |
-								   FRAMEOPTION_EXCLUDE_TIES)) &&
-		winstate->currentpos > 0)
-	{
-		ExecCopySlot(winstate->temp_slot_2, winstate->ss.ss_ScanTupleSlot);
-		if (!tuplestore_gettupleslot(winstate->buffer, true, true,
-									 winstate->ss.ss_ScanTupleSlot))
-			elog(ERROR, "unexpected end of tuplestore");
-		if (!are_peers(winstate, winstate->temp_slot_2,
-					   winstate->ss.ss_ScanTupleSlot))
-		{
-			winstate->currentgroup++;
-			winstate->groupheadpos = winstate->currentpos;
-			winstate->grouptail_valid = false;
+			/*
+			 * Evaluate aggregates
+			 */
+			if (winstate->numaggs > 0)
+				eval_windowaggregates(winstate);
 		}
-		ExecClearTuple(winstate->temp_slot_2);
-	}
-	else
-	{
-		if (!tuplestore_gettupleslot(winstate->buffer, true, true,
-									 winstate->ss.ss_ScanTupleSlot))
-			elog(ERROR, "unexpected end of tuplestore");
-	}
 
-	/*
-	 * Evaluate true window functions
-	 */
-	numfuncs = winstate->numfuncs;
-	for (i = 0; i < numfuncs; i++)
-	{
-		WindowStatePerFunc perfuncstate = &(winstate->perfunc[i]);
+		/*
+		 * If we have created auxiliary read pointers for the frame or group
+		 * boundaries, force them to be kept up-to-date, because we don't know
+		 * whether the window function(s) will do anything that requires that.
+		 * Failing to advance the pointers would result in being unable to
+		 * trim data from the tuplestore, which is bad.  (If we could know in
+		 * advance whether the window functions will use frame boundary info,
+		 * we could skip creating these pointers in the first place ... but
+		 * unfortunately the window function API doesn't require that.)
+		 */
+		if (winstate->framehead_ptr >= 0)
+			update_frameheadpos(winstate);
+		if (winstate->frametail_ptr >= 0)
+			update_frametailpos(winstate);
+		if (winstate->grouptail_ptr >= 0)
+			update_grouptailpos(winstate);
 
-		if (perfuncstate->plain_agg)
-			continue;
-		eval_windowfunction(winstate, perfuncstate,
-							&(econtext->ecxt_aggvalues[perfuncstate->wfuncstate->wfuncno]),
-							&(econtext->ecxt_aggnulls[perfuncstate->wfuncstate->wfuncno]));
-	}
+		/*
+		 * Truncate any no-longer-needed rows from the tuplestore.
+		 */
+		tuplestore_trim(winstate->buffer);
 
-	/*
-	 * Evaluate aggregates
-	 */
-	if (winstate->numaggs > 0)
-		eval_windowaggregates(winstate);
+		/*
+		 * Form and return a projection tuple using the windowfunc results and
+		 * the current row.  Setting ecxt_outertuple arranges that any Vars
+		 * will be evaluated with respect to that row.
+		 */
+		econtext->ecxt_outertuple = winstate->ss.ss_ScanTupleSlot;
 
-	/*
-	 * If we have created auxiliary read pointers for the frame or group
-	 * boundaries, force them to be kept up-to-date, because we don't know
-	 * whether the window function(s) will do anything that requires that.
-	 * Failing to advance the pointers would result in being unable to trim
-	 * data from the tuplestore, which is bad.  (If we could know in advance
-	 * whether the window functions will use frame boundary info, we could
-	 * skip creating these pointers in the first place ... but unfortunately
-	 * the window function API doesn't require that.)
-	 */
-	if (winstate->framehead_ptr >= 0)
-		update_frameheadpos(winstate);
-	if (winstate->frametail_ptr >= 0)
-		update_frametailpos(winstate);
-	if (winstate->grouptail_ptr >= 0)
-		update_grouptailpos(winstate);
+		slot = ExecProject(winstate->ss.ps.ps_ProjInfo);
 
-	/*
-	 * Truncate any no-longer-needed rows from the tuplestore.
-	 */
-	tuplestore_trim(winstate->buffer);
+		if (winstate->status == WINDOWAGG_RUN)
+		{
+			econtext->ecxt_scantuple = slot;
 
-	/*
-	 * Form and return a projection tuple using the windowfunc results and the
-	 * current row.  Setting ecxt_outertuple arranges that any Vars will be
-	 * evaluated with respect to that row.
-	 */
-	econtext->ecxt_outertuple = winstate->ss.ss_ScanTupleSlot;
+			/*
+			 * Now evaluate the run condition to see if we need to go into
+			 * pass-through mode, or maybe stop completely.
+			 */
+			if (!ExecQual(winstate->runcondition, econtext))
+			{
+				/*
+				 * Determine which mode to move into.  If there is no
+				 * PARTITION BY clause and we're the top-level WindowAgg then
+				 * we're done.  This tuple and any future tuples cannot
+				 * possibly match the runcondition.  However, when there is a
+				 * PARTITION BY clause or we're not the top-level window we
+				 * can't just stop as we need to either process other
+				 * partitions or ensure WindowAgg nodes above us receive all
+				 * of the tuples they need to process their WindowFuncs.
+				 */
+				if (winstate->use_pass_through)
+				{
+					/*
+					 * STRICT pass-through mode is required for the top window
+					 * when there is a PARTITION BY clause.  Otherwise we must
+					 * ensure we store tuples that don't match the
+					 * runcondition so they're available to WindowAggs above.
+					 */
+					if (winstate->top_window)
+					{
+						winstate->status = WINDOWAGG_PASSTHROUGH_STRICT;
+						continue;
+					}
+					else
+						winstate->status = WINDOWAGG_PASSTHROUGH;
+				}
+				else
+				{
+					/*
+					 * Pass-through not required.  We can just return NULL.
+					 * Nothing else will match the runcondition.
+					 */
+					winstate->status = WINDOWAGG_DONE;
+					return NULL;
+				}
+			}
+
+			/*
+			 * Filter out any tuples we don't need in the top-level WindowAgg.
+			 */
+			if (!ExecQual(winstate->ss.ps.qual, econtext))
+			{
+				InstrCountFiltered1(winstate, 1);
+				continue;
+			}
+
+			break;
+		}
+
+		/*
+		 * When not in WINDOWAGG_RUN mode, we must still return this tuple if
+		 * we're anything apart from the top window.
+		 */
+		else if (!winstate->top_window)
+			break;
+	}
 
-	return ExecProject(winstate->ss.ps.ps_ProjInfo);
+	return slot;
 }
 
 /* -----------------
@@ -2300,12 +2403,35 @@ ExecInitWindowAgg(WindowAgg *node, EState *estate, int eflags)
 							  "WindowAgg Aggregates",
 							  ALLOCSET_DEFAULT_SIZES);
 
+	/* Only the top-level WindowAgg may have a qual */
+	Assert(node->plan.qual == NIL || node->topWindow);
+
+	/* Initialize the qual */
+	winstate->ss.ps.qual = ExecInitQual(node->plan.qual,
+										(PlanState *) winstate);
+
+	/*
+	 * Setup the run condition, if we received one from the query planner.
+	 * When set, this may allow us to move into pass-through mode so that we
+	 * don't have to perform any further evaluation of WindowFuncs in the
+	 * current partition or possibly stop returning tuples altogether when all
+	 * tuples are in the same partition.
+	 */
+	if (node->runCondition != NIL)
+		winstate->runcondition = ExecInitQual(node->runCondition,
+											  (PlanState *) winstate);
+	else
+		winstate->runcondition = NULL;
+
 	/*
-	 * WindowAgg nodes never have quals, since they can only occur at the
-	 * logical top level of a query (ie, after any WHERE or HAVING filters)
+	 * When we're not the top-level WindowAgg node or we are but have a
+	 * PARTITION BY clause we must move into one of the WINDOWAGG_PASSTHROUGH*
+	 * modes when the runCondition becomes false.
 	 */
-	Assert(node->plan.qual == NIL);
-	winstate->ss.ps.qual = NULL;
+	winstate->use_pass_through = !node->topWindow || node->partNumCols > 0;
+
+	/* remember if we're the top-window or we are below the top-window */
+	winstate->top_window = node->topWindow;
 
 	/*
 	 * initialize child nodes
@@ -2500,6 +2626,9 @@ ExecInitWindowAgg(WindowAgg *node, EState *estate, int eflags)
 		winstate->agg_winobj = agg_winobj;
 	}
 
+	/* Set the status to running */
+	winstate->status = WINDOWAGG_RUN;
+
 	/* copy frame options to state node for easy access */
 	winstate->frameOptions = frameOptions;
 
@@ -2579,7 +2708,7 @@ ExecReScanWindowAgg(WindowAggState *node)
 	PlanState  *outerPlan = outerPlanState(node);
 	ExprContext *econtext = node->ss.ps.ps_ExprContext;
 
-	node->all_done = false;
+	node->status = WINDOWAGG_RUN;
 	node->all_first = true;
 
 	/* release tuplestore et al */
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 46a1943d97..6f56b269ce 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -1104,11 +1104,14 @@ _copyWindowAgg(const WindowAgg *from)
 	COPY_SCALAR_FIELD(frameOptions);
 	COPY_NODE_FIELD(startOffset);
 	COPY_NODE_FIELD(endOffset);
+	COPY_NODE_FIELD(runCondition);
+	COPY_NODE_FIELD(runConditionOrig);
 	COPY_SCALAR_FIELD(startInRangeFunc);
 	COPY_SCALAR_FIELD(endInRangeFunc);
 	COPY_SCALAR_FIELD(inRangeColl);
 	COPY_SCALAR_FIELD(inRangeAsc);
 	COPY_SCALAR_FIELD(inRangeNullsFirst);
+	COPY_SCALAR_FIELD(topWindow);
 
 	return newnode;
 }
@@ -3061,6 +3064,7 @@ _copyWindowClause(const WindowClause *from)
 	COPY_SCALAR_FIELD(frameOptions);
 	COPY_NODE_FIELD(startOffset);
 	COPY_NODE_FIELD(endOffset);
+	COPY_NODE_FIELD(runCondition);
 	COPY_SCALAR_FIELD(startInRangeFunc);
 	COPY_SCALAR_FIELD(endInRangeFunc);
 	COPY_SCALAR_FIELD(inRangeColl);
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 1f765f42c9..4b4f380bba 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -3234,6 +3234,7 @@ _equalWindowClause(const WindowClause *a, const WindowClause *b)
 	COMPARE_SCALAR_FIELD(frameOptions);
 	COMPARE_NODE_FIELD(startOffset);
 	COMPARE_NODE_FIELD(endOffset);
+	COMPARE_NODE_FIELD(runCondition);
 	COMPARE_SCALAR_FIELD(startInRangeFunc);
 	COMPARE_SCALAR_FIELD(endInRangeFunc);
 	COMPARE_SCALAR_FIELD(inRangeColl);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 13e1643530..d5f5e76c55 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -829,11 +829,14 @@ _outWindowAgg(StringInfo str, const WindowAgg *node)
 	WRITE_INT_FIELD(frameOptions);
 	WRITE_NODE_FIELD(startOffset);
 	WRITE_NODE_FIELD(endOffset);
+	WRITE_NODE_FIELD(runCondition);
+	WRITE_NODE_FIELD(runConditionOrig);
 	WRITE_OID_FIELD(startInRangeFunc);
 	WRITE_OID_FIELD(endInRangeFunc);
 	WRITE_OID_FIELD(inRangeColl);
 	WRITE_BOOL_FIELD(inRangeAsc);
 	WRITE_BOOL_FIELD(inRangeNullsFirst);
+	WRITE_BOOL_FIELD(topWindow);
 }
 
 static void
@@ -2283,6 +2286,8 @@ _outWindowAggPath(StringInfo str, const WindowAggPath *node)
 
 	WRITE_NODE_FIELD(subpath);
 	WRITE_NODE_FIELD(winclause);
+	WRITE_NODE_FIELD(qual);
+	WRITE_BOOL_FIELD(topwindow);
 }
 
 static void
@@ -3293,6 +3298,7 @@ _outWindowClause(StringInfo str, const WindowClause *node)
 	WRITE_INT_FIELD(frameOptions);
 	WRITE_NODE_FIELD(startOffset);
 	WRITE_NODE_FIELD(endOffset);
+	WRITE_NODE_FIELD(runCondition);
 	WRITE_OID_FIELD(startInRangeFunc);
 	WRITE_OID_FIELD(endInRangeFunc);
 	WRITE_OID_FIELD(inRangeColl);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 48f7216c9e..3d150cb25d 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -384,6 +384,7 @@ _readWindowClause(void)
 	READ_INT_FIELD(frameOptions);
 	READ_NODE_FIELD(startOffset);
 	READ_NODE_FIELD(endOffset);
+	READ_NODE_FIELD(runCondition);
 	READ_OID_FIELD(startInRangeFunc);
 	READ_OID_FIELD(endInRangeFunc);
 	READ_OID_FIELD(inRangeColl);
@@ -2576,11 +2577,14 @@ _readWindowAgg(void)
 	READ_INT_FIELD(frameOptions);
 	READ_NODE_FIELD(startOffset);
 	READ_NODE_FIELD(endOffset);
+	READ_NODE_FIELD(runCondition);
+	READ_NODE_FIELD(runConditionOrig);
 	READ_OID_FIELD(startInRangeFunc);
 	READ_OID_FIELD(endInRangeFunc);
 	READ_OID_FIELD(inRangeColl);
 	READ_BOOL_FIELD(inRangeAsc);
 	READ_BOOL_FIELD(inRangeNullsFirst);
+	READ_BOOL_FIELD(topWindow);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 169b1d53fc..91fcb2114c 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -27,6 +27,7 @@
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
+#include "nodes/supportnodes.h"
 #ifdef OPTIMIZER_DEBUG
 #include "nodes/print.h"
 #endif
@@ -2157,6 +2158,275 @@ has_multiple_baserels(PlannerInfo *root)
 	return false;
 }
 
+/*
+ * find_window_run_conditions
+ *		Determine if 'wfunc' is really a WindowFunc and call its prosupport
+ *		function to determine the function's monotonic properties.  We then
+ *		see if 'opexpr' can be used to short-circuit execution.
+ *
+ * For example row_number() over (order by ...) always produces a value one
+ * higher than the previous.  If someone has a window function such as that
+ * in a subquery and just wants, say all rows with a row number less than or
+ * equal to 10, then we may as well stop processing the windowagg once the row
+ * number reaches 11.  Here we check if 'opexpr' might help us to stop doing
+ * needless extra processing in WindowAgg nodes.
+ *
+ * '*keep_original' is set to true if the caller should also use 'opexpr' for
+ * its original purpose.  This is set to false if the caller can assume that
+ * the run condition will handle all of the required filtering.
+ *
+ * Returns true if 'opexpr' was found to be useful and was added to the
+ * WindowClauses runCondition. We also set *keep_original accordingly.
+ * If the 'opexpr' cannot be used then we set *keep_original to true and
+ * return false.
+ */
+static bool
+find_window_run_conditions(Query *subquery, RangeTblEntry *rte, Index rti,
+						   AttrNumber attno, WindowFunc *wfunc, OpExpr *opexpr,
+						   bool wfunc_left, bool *keep_original)
+{
+	Oid			prosupport;
+	Expr	   *otherexpr;
+	SupportRequestWFuncMonotonic req;
+	SupportRequestWFuncMonotonic *res;
+	WindowClause *wclause;
+	List	   *opinfos;
+	OpExpr	   *runopexpr;
+	Oid			runoperator;
+	ListCell   *lc;
+
+	*keep_original = true;
+
+	while (IsA(wfunc, RelabelType))
+		wfunc = (WindowFunc *) ((RelabelType *) wfunc)->arg;
+
+	/* we can only work with window functions */
+	if (!IsA(wfunc, WindowFunc))
+		return false;
+
+	prosupport = get_func_support(wfunc->winfnoid);
+
+	/* Check if there's a support function for 'wfunc' */
+	if (!OidIsValid(prosupport))
+		return false;
+
+	/* get the Expr from the other side of the OpExpr */
+	if (wfunc_left)
+		otherexpr = lsecond(opexpr->args);
+	else
+		otherexpr = linitial(opexpr->args);
+
+	/*
+	 * The value being compared must not change during the evaluation of the
+	 * window partition.
+	 */
+	if (!is_pseudo_constant_clause((Node *) otherexpr))
+		return false;
+
+	/* find the window clause belonging to the window function */
+	wclause = (WindowClause *) list_nth(subquery->windowClause,
+										wfunc->winref - 1);
+
+	req.type = T_SupportRequestWFuncMonotonic;
+	req.window_func = wfunc;
+	req.window_clause = wclause;
+
+	/* call the support function */
+	res = (SupportRequestWFuncMonotonic *)
+		DatumGetPointer(OidFunctionCall1(prosupport,
+										 PointerGetDatum(&req)));
+
+	/*
+	 * Nothing to do if the function is neither monotonically increasing nor
+	 * monotonically decreasing.
+	 */
+	if (res == NULL || res->monotonic == MONOTONICFUNC_NONE)
+		return false;
+
+	runopexpr = NULL;
+	runoperator = InvalidOid;
+	opinfos = get_op_btree_interpretation(opexpr->opno);
+
+	foreach(lc, opinfos)
+	{
+		OpBtreeInterpretation *opinfo = (OpBtreeInterpretation *) lfirst(lc);
+		int			strategy = opinfo->strategy;
+
+		/* handle < / <= */
+		if (strategy == BTLessStrategyNumber ||
+			strategy == BTLessEqualStrategyNumber)
+		{
+			/*
+			 * < / <= is supported for monotonically increasing functions in
+			 * the form <wfunc> op <pseudoconst> and <pseudoconst> op <wfunc>
+			 * for monotonically decreasing functions.
+			 */
+			if ((wfunc_left && (res->monotonic & MONOTONICFUNC_INCREASING)) ||
+				(!wfunc_left && (res->monotonic & MONOTONICFUNC_DECREASING)))
+			{
+				/*
+				 * We must keep the original qual in place if there is a
+				 * PARTITION BY clause as the top-level WindowAgg remains in
+				 * pass-through mode and does nothing to filter out unwanted
+				 * tuples.
+				 */
+				*keep_original = false;
+				runopexpr = opexpr;
+				runoperator = opexpr->opno;
+			}
+			break;
+		}
+		/* handle > / >= */
+		else if (strategy == BTGreaterStrategyNumber ||
+				 strategy == BTGreaterEqualStrategyNumber)
+		{
+			/*
+			 * > / >= is supported for monotonically decreasing functions in
+			 * the form <wfunc> op <pseudoconst> and <pseudoconst> op <wfunc>
+			 * for monotonically increasing functions.
+			 */
+			if ((wfunc_left && (res->monotonic & MONOTONICFUNC_DECREASING)) ||
+				(!wfunc_left && (res->monotonic & MONOTONICFUNC_INCREASING)))
+			{
+				*keep_original = false;
+				runopexpr = opexpr;
+				runoperator = opexpr->opno;
+			}
+			break;
+		}
+		/* handle = */
+		else if (strategy == BTEqualStrategyNumber)
+		{
+			int16		newstrategy;
+
+			/*
+			 * When both monotonically increasing and decreasing then the
+			 * return value of the window function will be the same each time.
+			 * We can simply use 'opexpr' as the run condition without
+			 * modifying it.
+			 */
+			if ((res->monotonic & MONOTONICFUNC_BOTH) == MONOTONICFUNC_BOTH)
+			{
+				*keep_original = false;
+				runopexpr = opexpr;
+				break;
+			}
+
+			/*
+			 * When monotonically increasing we make a qual with <wfunc> <=
+			 * <value> or <value> >= <wfunc> in order to filter out values
+			 * which are above the value in the equality condition.  For
+			 * monotonically decreasing functions we want to filter values
+			 * below the value in the equality condition.
+			 */
+			if (res->monotonic & MONOTONICFUNC_INCREASING)
+				newstrategy = wfunc_left ? BTLessEqualStrategyNumber : BTGreaterEqualStrategyNumber;
+			else
+				newstrategy = wfunc_left ? BTGreaterEqualStrategyNumber : BTLessEqualStrategyNumber;
+
+			/* We must keep the original equality qual */
+			*keep_original = true;
+			runopexpr = opexpr;
+
+			/* determine the operator to use for the runCondition qual */
+			runoperator = get_opfamily_member(opinfo->opfamily_id,
+											  opinfo->oplefttype,
+											  opinfo->oprighttype,
+											  newstrategy);
+			break;
+		}
+	}
+
+	if (runopexpr != NULL)
+	{
+		Expr	   *newexpr;
+
+		/*
+		 * Build the qual required for the run condition keeping the
+		 * WindowFunc on the same side as it was originally.
+		 */
+		if (wfunc_left)
+			newexpr = make_opclause(runoperator,
+									runopexpr->opresulttype,
+									runopexpr->opretset, (Expr *) wfunc,
+									otherexpr, runopexpr->opcollid,
+									runopexpr->inputcollid);
+		else
+			newexpr = make_opclause(runoperator,
+									runopexpr->opresulttype,
+									runopexpr->opretset,
+									otherexpr, (Expr *) wfunc,
+									runopexpr->opcollid,
+									runopexpr->inputcollid);
+
+		wclause->runCondition = lappend(wclause->runCondition, newexpr);
+
+		return true;
+	}
+
+	/* unsupported OpExpr */
+	return false;
+}
+
+/*
+ * check_and_push_window_quals
+ *		Check if 'clause' is a qual that can be pushed into a WindowFunc's
+ *		WindowClause as a 'runCondition' qual.  These, when present, allow
+ *		some unnecessary work to be skipped during execution.
+ *
+ * Returns true if the caller still must keep the original qual or false if
+ * the caller can safely ignore the original qual because the WindowAgg node
+ * will use the runCondition to stop returning tuples.
+ */
+static bool
+check_and_push_window_quals(Query *subquery, RangeTblEntry *rte, Index rti,
+							Node *clause)
+{
+	OpExpr	   *opexpr = (OpExpr *) clause;
+	bool		keep_original = true;
+	Var		   *var1;
+	Var		   *var2;
+
+	/* We're only able to use OpExprs with 2 operands */
+	if (!IsA(opexpr, OpExpr))
+		return true;
+
+	if (list_length(opexpr->args) != 2)
+		return true;
+
+	/*
+	 * Check for plain Vars that reference window functions in the subquery.
+	 * If we find any, we'll ask find_window_run_conditions() if 'opexpr' can
+	 * be used as a run condition.
+	 */
+
+	/* Check the left side of the OpExpr */
+	var1 = linitial(opexpr->args);
+	if (IsA(var1, Var) && var1->varattno > 0)
+	{
+		TargetEntry *tle = list_nth(subquery->targetList, var1->varattno - 1);
+		WindowFunc *wfunc = (WindowFunc *) tle->expr;
+
+		if (find_window_run_conditions(subquery, rte, rti, tle->resno, wfunc,
+									   opexpr, true, &keep_original))
+			return keep_original;
+	}
+
+	/* and check the right side */
+	var2 = lsecond(opexpr->args);
+	if (IsA(var2, Var) && var2->varattno > 0)
+	{
+		TargetEntry *tle = list_nth(subquery->targetList, var2->varattno - 1);
+		WindowFunc *wfunc = (WindowFunc *) tle->expr;
+
+		if (find_window_run_conditions(subquery, rte, rti, tle->resno, wfunc,
+									   opexpr, false, &keep_original))
+			return keep_original;
+	}
+
+	return true;
+}
+
 /*
  * set_subquery_pathlist
  *		Generate SubqueryScan access paths for a subquery RTE
@@ -2245,19 +2515,31 @@ set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 		foreach(l, rel->baserestrictinfo)
 		{
 			RestrictInfo *rinfo = (RestrictInfo *) lfirst(l);
+			Node	   *clause = (Node *) rinfo->clause;
 
 			if (!rinfo->pseudoconstant &&
 				qual_is_pushdown_safe(subquery, rti, rinfo, &safetyInfo))
 			{
-				Node	   *clause = (Node *) rinfo->clause;
-
 				/* Push it down */
 				subquery_push_qual(subquery, rte, rti, clause);
 			}
 			else
 			{
-				/* Keep it in the upper query */
-				upperrestrictlist = lappend(upperrestrictlist, rinfo);
+				/*
+				 * Since we can't push the qual down into the subquery, check
+				 * if it happens to reference a window function.  If so then
+				 * it might be useful to use for the WindowAgg's runCondition.
+				 */
+				if (!subquery->hasWindowFuncs ||
+					check_and_push_window_quals(subquery, rte, rti, clause))
+				{
+					/*
+					 * subquery has no window funcs or the clause is not a
+					 * suitable window run condition qual or it is, but the
+					 * original must also be kept in the upper query.
+					 */
+					upperrestrictlist = lappend(upperrestrictlist, rinfo);
+				}
 			}
 		}
 		rel->baserestrictinfo = upperrestrictlist;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 51591bb812..95476ada0b 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -288,6 +288,7 @@ static WindowAgg *make_windowagg(List *tlist, Index winref,
 								 int frameOptions, Node *startOffset, Node *endOffset,
 								 Oid startInRangeFunc, Oid endInRangeFunc,
 								 Oid inRangeColl, bool inRangeAsc, bool inRangeNullsFirst,
+								 List *runCondition, List *qual, bool topWindow,
 								 Plan *lefttree);
 static Group *make_group(List *tlist, List *qual, int numGroupCols,
 						 AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
@@ -2672,6 +2673,9 @@ create_windowagg_plan(PlannerInfo *root, WindowAggPath *best_path)
 						  wc->inRangeColl,
 						  wc->inRangeAsc,
 						  wc->inRangeNullsFirst,
+						  wc->runCondition,
+						  best_path->qual,
+						  best_path->topwindow,
 						  subplan);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -6558,7 +6562,7 @@ make_windowagg(List *tlist, Index winref,
 			   int frameOptions, Node *startOffset, Node *endOffset,
 			   Oid startInRangeFunc, Oid endInRangeFunc,
 			   Oid inRangeColl, bool inRangeAsc, bool inRangeNullsFirst,
-			   Plan *lefttree)
+			   List *runCondition, List *qual, bool topWindow, Plan *lefttree)
 {
 	WindowAgg  *node = makeNode(WindowAgg);
 	Plan	   *plan = &node->plan;
@@ -6575,17 +6579,20 @@ make_windowagg(List *tlist, Index winref,
 	node->frameOptions = frameOptions;
 	node->startOffset = startOffset;
 	node->endOffset = endOffset;
+	node->runCondition = runCondition;
+	/* a duplicate of the above for EXPLAIN */
+	node->runConditionOrig = runCondition;
 	node->startInRangeFunc = startInRangeFunc;
 	node->endInRangeFunc = endInRangeFunc;
 	node->inRangeColl = inRangeColl;
 	node->inRangeAsc = inRangeAsc;
 	node->inRangeNullsFirst = inRangeNullsFirst;
+	node->topWindow = topWindow;
 
 	plan->targetlist = tlist;
 	plan->lefttree = lefttree;
 	plan->righttree = NULL;
-	/* WindowAgg nodes never have a qual clause */
-	plan->qual = NIL;
+	plan->qual = qual;
 
 	return node;
 }
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b2569c5d0c..b090b087e9 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4190,6 +4190,7 @@ create_one_window_path(PlannerInfo *root,
 {
 	PathTarget *window_target;
 	ListCell   *l;
+	List	   *topqual = NIL;
 
 	/*
 	 * Since each window clause could require a different sort order, we stack
@@ -4214,6 +4215,7 @@ create_one_window_path(PlannerInfo *root,
 		List	   *window_pathkeys;
 		int			presorted_keys;
 		bool		is_sorted;
+		bool		topwindow;
 
 		window_pathkeys = make_pathkeys_for_window(root,
 												   wc,
@@ -4277,10 +4279,21 @@ create_one_window_path(PlannerInfo *root,
 			window_target = output_target;
 		}
 
+		/* mark the final item in the list as the top-level window */
+		topwindow = foreach_current_index(l) == list_length(activeWindows) - 1;
+
+		/*
+		 * Accumulate all of the runConditions from each intermediate
+		 * WindowClause.  The top-level WindowAgg must pass these as a qual so
+		 * that it filters out unwanted tuples correctly.
+		 */
+		if (!topwindow)
+			topqual = list_concat(topqual, wc->runCondition);
+
 		path = (Path *)
 			create_windowagg_path(root, window_rel, path, window_target,
 								  wflists->windowFuncs[wc->winref],
-								  wc);
+								  wc, topwindow ? topqual : NIL, topwindow);
 	}
 
 	add_path(window_rel, path);
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 7519723081..6ea3505646 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -71,6 +71,13 @@ typedef struct
 	double		num_exec;
 } fix_upper_expr_context;
 
+typedef struct
+{
+	PlannerInfo *root;
+	indexed_tlist *subplan_itlist;
+	int			newvarno;
+} fix_windowagg_cond_context;
+
 /*
  * Selecting the best alternative in an AlternativeSubPlan expression requires
  * estimating how many times that expression will be evaluated.  For an
@@ -171,6 +178,9 @@ static List *set_returning_clause_references(PlannerInfo *root,
 											 Plan *topplan,
 											 Index resultRelation,
 											 int rtoffset);
+static List *set_windowagg_runcondition_references(PlannerInfo *root,
+												   List *runcondition,
+												   Plan *plan);
 
 
 /*****************************************************************************
@@ -885,6 +895,18 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 			{
 				WindowAgg  *wplan = (WindowAgg *) plan;
 
+				/*
+				 * Adjust the WindowAgg's run conditions by swapping the
+				 * WindowFuncs references out to instead reference the Var in
+				 * the scan slot so that when the executor evaluates the
+				 * runCondition, it receives the WindowFunc's value from the
+				 * slot that the result has just been stored into rather than
+				 * evaluating the WindowFunc all over again.
+				 */
+				wplan->runCondition = set_windowagg_runcondition_references(root,
+																			wplan->runCondition,
+																			(Plan *) wplan);
+
 				set_upper_references(root, plan, rtoffset);
 
 				/*
@@ -896,6 +918,14 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 					fix_scan_expr(root, wplan->startOffset, rtoffset, 1);
 				wplan->endOffset =
 					fix_scan_expr(root, wplan->endOffset, rtoffset, 1);
+				wplan->runCondition = fix_scan_list(root,
+													wplan->runCondition,
+													rtoffset,
+													NUM_EXEC_TLIST(plan));
+				wplan->runConditionOrig = fix_scan_list(root,
+														wplan->runConditionOrig,
+														rtoffset,
+														NUM_EXEC_TLIST(plan));
 			}
 			break;
 		case T_Result:
@@ -3064,6 +3094,78 @@ set_returning_clause_references(PlannerInfo *root,
 	return rlist;
 }
 
+/*
+ * fix_windowagg_condition_expr_mutator
+ *		Mutator function for replacing WindowFuncs with the corresponding Var
+ *		in the targetlist which references that WindowFunc.
+ */
+static Node *
+fix_windowagg_condition_expr_mutator(Node *node,
+									 fix_windowagg_cond_context *context)
+{
+	if (node == NULL)
+		return NULL;
+
+	if (IsA(node, WindowFunc))
+	{
+		Var		   *newvar;
+
+		newvar = search_indexed_tlist_for_non_var((Expr *) node,
+												  context->subplan_itlist,
+												  context->newvarno);
+		if (newvar)
+			return (Node *) newvar;
+		elog(ERROR, "WindowFunc not found in subplan target lists");
+	}
+
+	return expression_tree_mutator(node,
+								   fix_windowagg_condition_expr_mutator,
+								   (void *) context);
+}
+
+/*
+ * fix_windowagg_condition_expr
+ *		Converts references in 'runcondition' so that any WindowFunc
+ *		references are swapped out for a Var which references the matching
+ *		WindowFunc in 'subplan_itlist'.
+ */
+static List *
+fix_windowagg_condition_expr(PlannerInfo *root,
+							 List *runcondition,
+							 indexed_tlist *subplan_itlist)
+{
+	fix_windowagg_cond_context context;
+
+	context.root = root;
+	context.subplan_itlist = subplan_itlist;
+	context.newvarno = 0;
+
+	return (List *) fix_windowagg_condition_expr_mutator((Node *) runcondition,
+														 &context);
+}
+
+/*
+ * set_windowagg_runcondition_references
+ *		Converts references in 'runcondition' so that any WindowFunc
+ *		references are swapped out for a Var which references the matching
+ *		WindowFunc in 'plan' targetlist.
+ */
+static List *
+set_windowagg_runcondition_references(PlannerInfo *root,
+									  List *runcondition,
+									  Plan *plan)
+{
+	List	   *newlist;
+	indexed_tlist *itlist;
+
+	itlist = build_tlist_index(plan->targetlist);
+
+	newlist = fix_windowagg_condition_expr(root, runcondition, itlist);
+
+	pfree(itlist);
+
+	return newlist;
+}
 
 /*****************************************************************************
  *					QUERY DEPENDENCY MANAGEMENT
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 1670e54644..11b804093b 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3387,6 +3387,8 @@ create_minmaxagg_path(PlannerInfo *root,
  * 'subpath' is the path representing the source of data
  * 'target' is the PathTarget to be computed
  * 'windowFuncs' is a list of WindowFunc structs
+ * 'qual' WindowClause.runconditions from lower-level WindowAggPaths.
+ *		Must always be NIL when topwindow == false
  * 'winclause' is a WindowClause that is common to all the WindowFuncs
  *
  * The input must be sorted according to the WindowClause's PARTITION keys
@@ -3398,10 +3400,15 @@ create_windowagg_path(PlannerInfo *root,
 					  Path *subpath,
 					  PathTarget *target,
 					  List *windowFuncs,
-					  WindowClause *winclause)
+					  WindowClause *winclause,
+					  List *qual,
+					  bool topwindow)
 {
 	WindowAggPath *pathnode = makeNode(WindowAggPath);
 
+	/* qual can only be set for the topwindow */
+	Assert(qual == NIL || topwindow);
+
 	pathnode->path.pathtype = T_WindowAgg;
 	pathnode->path.parent = rel;
 	pathnode->path.pathtarget = target;
@@ -3416,6 +3423,8 @@ create_windowagg_path(PlannerInfo *root,
 
 	pathnode->subpath = subpath;
 	pathnode->winclause = winclause;
+	pathnode->qual = qual;
+	pathnode->topwindow = topwindow;
 
 	/*
 	 * For costing purposes, assume that there are no redundant partitioning
diff --git a/src/backend/utils/adt/int8.c b/src/backend/utils/adt/int8.c
index 4a87114a4f..98d4323755 100644
--- a/src/backend/utils/adt/int8.c
+++ b/src/backend/utils/adt/int8.c
@@ -24,6 +24,7 @@
 #include "nodes/supportnodes.h"
 #include "optimizer/optimizer.h"
 #include "utils/builtins.h"
+#include "utils/lsyscache.h"
 
 
 typedef struct
@@ -818,6 +819,49 @@ int8dec_any(PG_FUNCTION_ARGS)
 	return int8dec(fcinfo);
 }
 
+/*
+ * int8inc_support
+ *		prosupport function for int8inc() and int8inc_any()
+ */
+Datum
+int8inc_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq = (Node *) PG_GETARG_POINTER(0);
+
+	if (IsA(rawreq, SupportRequestWFuncMonotonic))
+	{
+		SupportRequestWFuncMonotonic *req = (SupportRequestWFuncMonotonic *) rawreq;
+		MonotonicFunction monotonic = MONOTONICFUNC_NONE;
+		int			frameOptions = req->window_clause->frameOptions;
+
+		/* No ORDER BY clause then all rows are peers */
+		if (req->window_clause->orderClause == NIL)
+			monotonic = MONOTONICFUNC_BOTH;
+		else
+		{
+			/*
+			 * Otherwise take into account the frame options.  When the frame
+			 * bound is the start of the window then the resulting value can
+			 * never decrease, therefore is monotonically increasing
+			 */
+			if (frameOptions & FRAMEOPTION_START_UNBOUNDED_PRECEDING)
+				monotonic |= MONOTONICFUNC_INCREASING;
+
+			/*
+			 * Likewise, if the frame bound is the end of the window then the
+			 * resulting value can never decrease.
+			 */
+			if (frameOptions & FRAMEOPTION_END_UNBOUNDED_FOLLOWING)
+				monotonic |= MONOTONICFUNC_DECREASING;
+		}
+
+		req->monotonic = monotonic;
+		PG_RETURN_POINTER(req);
+	}
+
+	PG_RETURN_POINTER(NULL);
+}
+
 
 Datum
 int8larger(PG_FUNCTION_ARGS)
diff --git a/src/backend/utils/adt/windowfuncs.c b/src/backend/utils/adt/windowfuncs.c
index 3e0cc9be1a..596564fa15 100644
--- a/src/backend/utils/adt/windowfuncs.c
+++ b/src/backend/utils/adt/windowfuncs.c
@@ -13,6 +13,7 @@
  */
 #include "postgres.h"
 
+#include "nodes/supportnodes.h"
 #include "utils/builtins.h"
 #include "windowapi.h"
 
@@ -88,6 +89,26 @@ window_row_number(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(curpos + 1);
 }
 
+/*
+ * window_row_number_support
+ *		prosupport function for window_row_number()
+ */
+Datum
+window_row_number_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq = (Node *) PG_GETARG_POINTER(0);
+
+	if (IsA(rawreq, SupportRequestWFuncMonotonic))
+	{
+		SupportRequestWFuncMonotonic *req = (SupportRequestWFuncMonotonic *) rawreq;
+
+		/* row_number() is monotonically increasing */
+		req->monotonic = MONOTONICFUNC_INCREASING;
+		PG_RETURN_POINTER(req);
+	}
+
+	PG_RETURN_POINTER(NULL);
+}
 
 /*
  * rank
@@ -110,6 +131,27 @@ window_rank(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(context->rank);
 }
 
+/*
+ * window_rank_support
+ *		prosupport function for window_rank()
+ */
+Datum
+window_rank_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq = (Node *) PG_GETARG_POINTER(0);
+
+	if (IsA(rawreq, SupportRequestWFuncMonotonic))
+	{
+		SupportRequestWFuncMonotonic *req = (SupportRequestWFuncMonotonic *) rawreq;
+
+		/* rank() is monotonically increasing */
+		req->monotonic = MONOTONICFUNC_INCREASING;
+		PG_RETURN_POINTER(req);
+	}
+
+	PG_RETURN_POINTER(NULL);
+}
+
 /*
  * dense_rank
  * Rank increases by 1 when key columns change.
@@ -130,6 +172,27 @@ window_dense_rank(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(context->rank);
 }
 
+/*
+ * window_dense_rank_support
+ *		prosupport function for window_dense_rank()
+ */
+Datum
+window_dense_rank_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq = (Node *) PG_GETARG_POINTER(0);
+
+	if (IsA(rawreq, SupportRequestWFuncMonotonic))
+	{
+		SupportRequestWFuncMonotonic *req = (SupportRequestWFuncMonotonic *) rawreq;
+
+		/* dense_rank() is monotonically increasing */
+		req->monotonic = MONOTONICFUNC_INCREASING;
+		PG_RETURN_POINTER(req);
+	}
+
+	PG_RETURN_POINTER(NULL);
+}
+
 /*
  * percent_rank
  * return fraction between 0 and 1 inclusive,
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index e8f89a7b18..1525a96a5b 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6637,11 +6637,16 @@
 # count has two forms: count(any) and count(*)
 { oid => '2147',
   descr => 'number of input rows for which the input expression is not null',
-  proname => 'count', prokind => 'a', proisstrict => 'f', prorettype => 'int8',
-  proargtypes => 'any', prosrc => 'aggregate_dummy' },
+  proname => 'count', prosupport => 'int8inc_support', prokind => 'a',
+  proisstrict => 'f', prorettype => 'int8', proargtypes => 'any',
+  prosrc => 'aggregate_dummy' },
 { oid => '2803', descr => 'number of input rows',
-  proname => 'count', prokind => 'a', proisstrict => 'f', prorettype => 'int8',
-  proargtypes => '', prosrc => 'aggregate_dummy' },
+  proname => 'count', prosupport => 'int8inc_support', prokind => 'a',
+  proisstrict => 'f', prorettype => 'int8', proargtypes => '',
+  prosrc => 'aggregate_dummy' },
+{ oid => '8802', descr => 'planner support for count run condition',
+  proname => 'int8inc_support', prorettype => 'internal',
+  proargtypes => 'internal', prosrc => 'int8inc_support' },
 
 { oid => '2718',
   descr => 'population variance of bigint input values (square of the population standard deviation)',
@@ -10161,14 +10166,26 @@
 
 # SQL-spec window functions
 { oid => '3100', descr => 'row number within partition',
-  proname => 'row_number', prokind => 'w', proisstrict => 'f',
-  prorettype => 'int8', proargtypes => '', prosrc => 'window_row_number' },
+  proname => 'row_number', prosupport => 'window_row_number_support',
+  prokind => 'w', proisstrict => 'f',  prorettype => 'int8',
+  proargtypes => '', prosrc => 'window_row_number' },
+{ oid => '8799', descr => 'planner support for row_number run condition',
+  proname => 'window_row_number_support', prorettype => 'internal',
+  proargtypes => 'internal', prosrc => 'window_row_number_support' },
 { oid => '3101', descr => 'integer rank with gaps',
-  proname => 'rank', prokind => 'w', proisstrict => 'f', prorettype => 'int8',
+  proname => 'rank', prosupport => 'window_rank_support',
+  prokind => 'w', proisstrict => 'f', prorettype => 'int8',
   proargtypes => '', prosrc => 'window_rank' },
+{ oid => '8800', descr => 'planner support for rank run condition',
+  proname => 'window_rank_support', prorettype => 'internal',
+  proargtypes => 'internal', prosrc => 'window_rank_support' },
 { oid => '3102', descr => 'integer rank without gaps',
-  proname => 'dense_rank', prokind => 'w', proisstrict => 'f',
-  prorettype => 'int8', proargtypes => '', prosrc => 'window_dense_rank' },
+  proname => 'dense_rank', prosupport => 'window_dense_rank_support',
+  prokind => 'w', proisstrict => 'f', prorettype => 'int8', proargtypes => '',
+  prosrc => 'window_dense_rank' },
+{ oid => '8801', descr => 'planner support for dense rank run condition',
+  proname => 'window_dense_rank_support', prorettype => 'internal',
+  proargtypes => 'internal', prosrc => 'window_dense_rank_support' },
 { oid => '3103', descr => 'fractional rank within partition',
   proname => 'percent_rank', prokind => 'w', proisstrict => 'f',
   prorettype => 'float8', proargtypes => '', prosrc => 'window_percent_rank' },
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index cbbcff81d2..94b191f8ae 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2406,6 +2406,18 @@ typedef struct AggState
 typedef struct WindowStatePerFuncData *WindowStatePerFunc;
 typedef struct WindowStatePerAggData *WindowStatePerAgg;
 
+/*
+ * WindowAggStatus -- Used to track the status of WindowAggState
+ */
+typedef enum WindowAggStatus
+{
+	WINDOWAGG_DONE,				/* No more processing to do */
+	WINDOWAGG_RUN,				/* Normal processing of window funcs */
+	WINDOWAGG_PASSTHROUGH,		/* Don't eval window funcs */
+	WINDOWAGG_PASSTHROUGH_STRICT	/* Pass-through plus don't store new
+									 * tuples during spool */
+} WindowAggStatus;
+
 typedef struct WindowAggState
 {
 	ScanState	ss;				/* its first field is NodeTag */
@@ -2432,6 +2444,7 @@ typedef struct WindowAggState
 	struct WindowObjectData *agg_winobj;	/* winobj for aggregate fetches */
 	int64		aggregatedbase; /* start row for current aggregates */
 	int64		aggregatedupto; /* rows before this one are aggregated */
+	WindowAggStatus status;		/* run status of WindowAggState */
 
 	int			frameOptions;	/* frame_clause options, see WindowDef */
 	ExprState  *startOffset;	/* expression for starting bound offset */
@@ -2458,8 +2471,17 @@ typedef struct WindowAggState
 	MemoryContext curaggcontext;	/* current aggregate's working data */
 	ExprContext *tmpcontext;	/* short-term evaluation context */
 
+	ExprState  *runcondition;	/* Condition which must remain true otherwise
+								 * execution of the WindowAgg will finish or
+								 * go into pass-through mode.  NULL when there
+								 * is no such condition. */
+
+	bool		use_pass_through;	/* When false, stop execution when
+									 * runcondition is no longer true.  Else
+									 * just stop evaluating window funcs. */
+	bool		top_window;		/* true if this is the top-most WindowAgg or
+								 * the only WindowAgg in this query level */
 	bool		all_first;		/* true if the scan is starting */
-	bool		all_done;		/* true if the scan is finished */
 	bool		partition_spooled;	/* true if all tuples in current partition
 									 * have been spooled into tuplestore */
 	bool		more_partitions;	/* true if there's more partitions after
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 300824258e..340d28f4e1 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -560,7 +560,8 @@ typedef enum NodeTag
 	T_SupportRequestSelectivity,	/* in nodes/supportnodes.h */
 	T_SupportRequestCost,		/* in nodes/supportnodes.h */
 	T_SupportRequestRows,		/* in nodes/supportnodes.h */
-	T_SupportRequestIndexCondition	/* in nodes/supportnodes.h */
+	T_SupportRequestIndexCondition, /* in nodes/supportnodes.h */
+	T_SupportRequestWFuncMonotonic	/* in nodes/supportnodes.h */
 } NodeTag;
 
 /*
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 8998d34560..b2cdf8249f 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -1402,6 +1402,7 @@ typedef struct WindowClause
 	int			frameOptions;	/* frame_clause options, see WindowDef */
 	Node	   *startOffset;	/* expression for starting bound, if any */
 	Node	   *endOffset;		/* expression for ending bound, if any */
+	List	   *runCondition;	/* qual to help short-circuit execution */
 	Oid			startInRangeFunc;	/* in_range function for startOffset */
 	Oid			endInRangeFunc; /* in_range function for endOffset */
 	Oid			inRangeColl;	/* collation for in_range tests */
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 6cbcb67bdf..c5ab53e05c 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1843,6 +1843,9 @@ typedef struct WindowAggPath
 	Path		path;
 	Path	   *subpath;		/* path representing input source */
 	WindowClause *winclause;	/* WindowClause we'll be using */
+	List	   *qual;			/* lower-level WindowAgg runconditions */
+	bool		topwindow;		/* false for all apart from the WindowAgg
+								 * that's closest to the root of the plan */
 } WindowAggPath;
 
 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 10dd35f011..e43e360d9b 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -926,12 +926,16 @@ typedef struct WindowAgg
 	int			frameOptions;	/* frame_clause options, see WindowDef */
 	Node	   *startOffset;	/* expression for starting bound, if any */
 	Node	   *endOffset;		/* expression for ending bound, if any */
+	List	   *runCondition;	/* qual to help short-circuit execution */
+	List	   *runConditionOrig;	/* runCondition for display in EXPLAIN */
 	/* these fields are used with RANGE offset PRECEDING/FOLLOWING: */
 	Oid			startInRangeFunc;	/* in_range function for startOffset */
 	Oid			endInRangeFunc; /* in_range function for endOffset */
 	Oid			inRangeColl;	/* collation for in_range tests */
 	bool		inRangeAsc;		/* use ASC sort order for in_range tests? */
 	bool		inRangeNullsFirst;	/* nulls sort first for in_range tests? */
+	bool		topWindow;		/* false for all apart from the WindowAgg
+								 * that's closest to the root of the plan */
 } WindowAgg;
 
 /* ----------------
@@ -1324,4 +1328,21 @@ typedef struct PlanInvalItem
 	uint32		hashValue;		/* hash value of object's cache lookup key */
 } PlanInvalItem;
 
+/*
+ * MonotonicFunction
+ *
+ * Allows the planner to track monotonic properties of functions.  A function
+ * is monotonically increasing if a subsequent call cannot yield a lower value
+ * than the previous call.  A monotonically decreasing function cannot yield a
+ * higher value on subsequent calls, and a function which is both must return
+ * the same value on each call.
+ */
+typedef enum MonotonicFunction
+{
+	MONOTONICFUNC_NONE = 0,
+	MONOTONICFUNC_INCREASING = (1 << 0),
+	MONOTONICFUNC_DECREASING = (1 << 1),
+	MONOTONICFUNC_BOTH = MONOTONICFUNC_INCREASING | MONOTONICFUNC_DECREASING
+} MonotonicFunction;
+
 #endif							/* PLANNODES_H */
diff --git a/src/include/nodes/supportnodes.h b/src/include/nodes/supportnodes.h
index 88b61b3ab3..bdd43fc614 100644
--- a/src/include/nodes/supportnodes.h
+++ b/src/include/nodes/supportnodes.h
@@ -33,12 +33,12 @@
 #ifndef SUPPORTNODES_H
 #define SUPPORTNODES_H
 
-#include "nodes/primnodes.h"
+#include "nodes/plannodes.h"
 
 struct PlannerInfo;				/* avoid including pathnodes.h here */
 struct IndexOptInfo;
 struct SpecialJoinInfo;
-
+struct WindowClause;
 
 /*
  * The Simplify request allows the support function to perform plan-time
@@ -239,4 +239,64 @@ typedef struct SupportRequestIndexCondition
 								 * equivalent of the function call */
 } SupportRequestIndexCondition;
 
+/* ----------
+ * To allow more efficient execution of any monotonically increasing and/or
+ * monotonically decreasing window functions, we support calling the window
+ * function's prosupport function passing along this struct whenever the
+ * planner sees an OpExpr qual directly reference a window function in a
+ * subquery.  When the planner encounters this, we populate this struct and
+ * pass it along to the window function's prosupport function so that it can
+ * evaluate if the given WindowFunc is;
+ *
+ * a) monotonically increasing, or
+ * b) monotonically decreasing, or
+ * c) both monotonically increasing and decreasing, or
+ * d) none of the above.
+ *
+ * A function that is monotonically increasing can never return a value that
+ * is lower than a value returned in a "previous call".  A monotonically
+ * decreasing function can never return a value higher than a value returned
+ * in a previous call.  A function that is both must return the same value
+ * each time.
+ *
+ * We define "previous call" to mean a previous call to the same WindowFunc
+ * struct in the same window partition.
+ *
+ * row_number() is an example of a monotonically increasing function.  The
+ * return value will be reset back to 1 in each new partition.  An example of
+ * a monotonically increasing and decreasing function is COUNT(*) OVER ().
+ * Since there is no ORDER BY clause in this example, all rows in the
+ * partition are peers and all rows within the partition will be within the
+ * frame bound.  Likewise for COUNT(*) OVER(ORDER BY a ROWS BETWEEN UNBOUNDED
+ * PRECEDING AND UNBOUNDED FOLLOWING).
+ *
+ * COUNT(*) OVER (ORDER BY a ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
+ * is an example of a monotonically decreasing function.
+ *
+ * Implementations must only concern themselves with the given WindowFunc
+ * being monotonic in a single partition.
+ *
+ * Inputs:
+ *	'window_func' is the pointer to the window function being called.
+ *
+ *	'window_clause' pointer to the WindowClause data.  Support functions can
+ *	use this to check frame bounds, etc.
+ *
+ * Outputs:
+ *	'monotonic' the resulting MonotonicFunction value for the given input
+ *	window function and window clause.
+ * ----------
+ */
+typedef struct SupportRequestWFuncMonotonic
+{
+	NodeTag		type;
+
+	/* Input fields: */
+	WindowFunc *window_func;	/* Pointer to the window function data */
+	struct WindowClause *window_clause; /* Pointer to the window clause data */
+
+	/* Output fields: */
+	MonotonicFunction monotonic;
+} SupportRequestWFuncMonotonic;
+
 #endif							/* SUPPORTNODES_H */
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 6eca547af8..d2d46b15df 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -245,7 +245,9 @@ extern WindowAggPath *create_windowagg_path(PlannerInfo *root,
 											Path *subpath,
 											PathTarget *target,
 											List *windowFuncs,
-											WindowClause *winclause);
+											WindowClause *winclause,
+											List *qual,
+											bool topwindow);
 extern SetOpPath *create_setop_path(PlannerInfo *root,
 									RelOptInfo *rel,
 									Path *subpath,
diff --git a/src/test/regress/expected/window.out b/src/test/regress/expected/window.out
index bb9ff7f07b..5ed34b9a48 100644
--- a/src/test/regress/expected/window.out
+++ b/src/test/regress/expected/window.out
@@ -3336,6 +3336,404 @@ WHERE depname = 'sales';
                            ->  Seq Scan on empsalary
 (9 rows)
 
+-- Test window function run conditions are properly pushed down into the
+-- WindowAgg
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+                  QUERY PLAN                  
+----------------------------------------------
+ WindowAgg
+   Run Condition: (row_number() OVER (?) < 3)
+   ->  Sort
+         Sort Key: empsalary.empno
+         ->  Seq Scan on empsalary
+(5 rows)
+
+-- The following 3 statements should result the same result.
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+ empno | rn 
+-------+----
+     1 |  1
+     2 |  2
+(2 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE 3 > rn;
+ empno | rn 
+-------+----
+     1 |  1
+     2 |  2
+(2 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE 2 >= rn;
+ empno | rn 
+-------+----
+     1 |  1
+     2 |  2
+(2 rows)
+
+-- Ensure r <= 3 is pushed down into the run condition of the window agg
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          rank() OVER (ORDER BY salary DESC) r
+   FROM empsalary) emp
+WHERE r <= 3;
+               QUERY PLAN                
+-----------------------------------------
+ WindowAgg
+   Run Condition: (rank() OVER (?) <= 3)
+   ->  Sort
+         Sort Key: empsalary.salary DESC
+         ->  Seq Scan on empsalary
+(5 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          rank() OVER (ORDER BY salary DESC) r
+   FROM empsalary) emp
+WHERE r <= 3;
+ empno | salary | r 
+-------+--------+---
+     8 |   6000 | 1
+    10 |   5200 | 2
+    11 |   5200 | 2
+(3 rows)
+
+-- Ensure dr = 1 is converted to dr <= 1 to get all rows leading up to dr = 1
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+                     QUERY PLAN                      
+-----------------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.dr = 1)
+   ->  WindowAgg
+         Run Condition: (dense_rank() OVER (?) <= 1)
+         ->  Sort
+               Sort Key: empsalary.salary DESC
+               ->  Seq Scan on empsalary
+(7 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+ empno | salary | dr 
+-------+--------+----
+     8 |   6000 |  1
+(1 row)
+
+-- Check COUNT() and COUNT(*)
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+                QUERY PLAN                 
+-------------------------------------------
+ WindowAgg
+   Run Condition: (count(*) OVER (?) <= 3)
+   ->  Sort
+         Sort Key: empsalary.salary DESC
+         ->  Seq Scan on empsalary
+(5 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+ empno | salary | c 
+-------+--------+---
+     8 |   6000 | 1
+    10 |   5200 | 3
+    11 |   5200 | 3
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+                       QUERY PLAN                        
+---------------------------------------------------------
+ WindowAgg
+   Run Condition: (count(empsalary.empno) OVER (?) <= 3)
+   ->  Sort
+         Sort Key: empsalary.salary DESC
+         ->  Seq Scan on empsalary
+(5 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+ empno | salary | c 
+-------+--------+---
+     8 |   6000 | 1
+    10 |   5200 | 3
+    11 |   5200 | 3
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) c
+   FROM empsalary) emp
+WHERE c >= 3;
+                QUERY PLAN                 
+-------------------------------------------
+ WindowAgg
+   Run Condition: (count(*) OVER (?) >= 3)
+   ->  Sort
+         Sort Key: empsalary.salary DESC
+         ->  Seq Scan on empsalary
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER () c
+   FROM empsalary) emp
+WHERE 11 <= c;
+                 QUERY PLAN                 
+--------------------------------------------
+ WindowAgg
+   Run Condition: (11 <= count(*) OVER (?))
+   ->  Seq Scan on empsalary
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+                     QUERY PLAN                      
+-----------------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.dr = 1)
+   ->  WindowAgg
+         Run Condition: (dense_rank() OVER (?) <= 1)
+         ->  Sort
+               Sort Key: empsalary.salary DESC
+               ->  Seq Scan on empsalary
+(7 rows)
+
+-- Ensure we get a run condition when there's a PARTITION BY clause
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          depname,
+          row_number() OVER (PARTITION BY depname ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+                      QUERY PLAN                      
+------------------------------------------------------
+ WindowAgg
+   Run Condition: (row_number() OVER (?) < 3)
+   ->  Sort
+         Sort Key: empsalary.depname, empsalary.empno
+         ->  Seq Scan on empsalary
+(5 rows)
+
+-- and ensure we get the correct results from the above plan
+SELECT * FROM
+  (SELECT empno,
+          depname,
+          row_number() OVER (PARTITION BY depname ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+ empno |  depname  | rn 
+-------+-----------+----
+     7 | develop   |  1
+     8 | develop   |  2
+     2 | personnel |  1
+     5 | personnel |  2
+     1 | sales     |  1
+     3 | sales     |  2
+(6 rows)
+
+-- likewise with count(empno) instead of row_number()
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          depname,
+          salary,
+          count(empno) OVER (PARTITION BY depname ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ WindowAgg
+   Run Condition: (count(empsalary.empno) OVER (?) <= 3)
+   ->  Sort
+         Sort Key: empsalary.depname, empsalary.salary DESC
+         ->  Seq Scan on empsalary
+(5 rows)
+
+-- and again, check the results are what we expect.
+SELECT * FROM
+  (SELECT empno,
+          depname,
+          salary,
+          count(empno) OVER (PARTITION BY depname ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+ empno |  depname  | salary | c 
+-------+-----------+--------+---
+     8 | develop   |   6000 | 1
+    10 | develop   |   5200 | 3
+    11 | develop   |   5200 | 3
+     2 | personnel |   3900 | 1
+     5 | personnel |   3500 | 2
+     1 | sales     |   5000 | 1
+     4 | sales     |   4800 | 3
+     3 | sales     |   4800 | 3
+(8 rows)
+
+-- Some more complex cases with multiple window clauses
+EXPLAIN (COSTS OFF)
+SELECT * FROM (
+	SELECT *,
+		count(salary) OVER (PARTITION BY depname || '') c1, -- w1
+		row_number() OVER (PARTITION BY depname) rn, -- w2
+		count(*) OVER (PARTITION BY depname) c2, -- w2
+		count(*) OVER (PARTITION BY '' || depname) c3 -- w3
+	FROM empsalary
+) e WHERE rn <= 1 AND c1 <= 3;
+                                        QUERY PLAN                                         
+-------------------------------------------------------------------------------------------
+ Subquery Scan on e
+   ->  WindowAgg
+         Filter: ((row_number() OVER (?)) <= 1)
+         Run Condition: (count(empsalary.salary) OVER (?) <= 3)
+         ->  Sort
+               Sort Key: (((empsalary.depname)::text || ''::text))
+               ->  WindowAgg
+                     Run Condition: (row_number() OVER (?) <= 1)
+                     ->  Sort
+                           Sort Key: empsalary.depname
+                           ->  WindowAgg
+                                 ->  Sort
+                                       Sort Key: ((''::text || (empsalary.depname)::text))
+                                       ->  Seq Scan on empsalary
+(14 rows)
+
+-- Ensure we correctly filter out all of the run conditions from each window
+SELECT * FROM (
+	SELECT *,
+		count(salary) OVER (PARTITION BY depname || '') c1, -- w1
+		row_number() OVER (PARTITION BY depname) rn, -- w2
+		count(*) OVER (PARTITION BY depname) c2, -- w2
+		count(*) OVER (PARTITION BY '' || depname) c3 -- w3
+	FROM empsalary
+) e WHERE rn <= 1 AND c1 <= 3;
+  depname  | empno | salary | enroll_date | c1 | rn | c2 | c3 
+-----------+-------+--------+-------------+----+----+----+----
+ personnel |     5 |   3500 | 12-10-2007  |  2 |  1 |  2 |  2
+ sales     |     3 |   4800 | 08-01-2007  |  3 |  1 |  3 |  3
+(2 rows)
+
+-- Tests to ensure we don't push down the run condition when it's not valid to
+-- do so.
+-- Ensure we don't push down when the frame options show that the window
+-- function is not monotonically increasing
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) c
+   FROM empsalary) emp
+WHERE c <= 3;
+                  QUERY PLAN                   
+-----------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.c <= 3)
+   ->  WindowAgg
+         ->  Sort
+               Sort Key: empsalary.salary DESC
+               ->  Seq Scan on empsalary
+(6 rows)
+
+-- Ensure we don't push down when the window function's monotonic properties
+-- don't match that of the clauses.
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary) c
+   FROM empsalary) emp
+WHERE 3 <= c;
+                QUERY PLAN                
+------------------------------------------
+ Subquery Scan on emp
+   Filter: (3 <= emp.c)
+   ->  WindowAgg
+         ->  Sort
+               Sort Key: empsalary.salary
+               ->  Seq Scan on empsalary
+(6 rows)
+
+-- Ensure we don't pushdown when there are multiple window clauses to evaluate
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY empno DESC) c,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+                           QUERY PLAN                            
+-----------------------------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.dr = 1)
+   ->  WindowAgg
+         Filter: ((dense_rank() OVER (?)) <= 1)
+         ->  Sort
+               Sort Key: empsalary.empno DESC
+               ->  WindowAgg
+                     Run Condition: (dense_rank() OVER (?) <= 1)
+                     ->  Sort
+                           Sort Key: empsalary.salary DESC
+                           ->  Seq Scan on empsalary
+(11 rows)
+
 -- Test Sort node collapsing
 EXPLAIN (COSTS OFF)
 SELECT * FROM
diff --git a/src/test/regress/sql/window.sql b/src/test/regress/sql/window.sql
index 41a8e0d152..ec9ee97a45 100644
--- a/src/test/regress/sql/window.sql
+++ b/src/test/regress/sql/window.sql
@@ -988,6 +988,212 @@ SELECT * FROM
    FROM empsalary) emp
 WHERE depname = 'sales';
 
+-- Test window function run conditions are properly pushed down into the
+-- WindowAgg
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+
+-- The following 3 statements should result the same result.
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE 3 > rn;
+
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE 2 >= rn;
+
+-- Ensure r <= 3 is pushed down into the run condition of the window agg
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          rank() OVER (ORDER BY salary DESC) r
+   FROM empsalary) emp
+WHERE r <= 3;
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          rank() OVER (ORDER BY salary DESC) r
+   FROM empsalary) emp
+WHERE r <= 3;
+
+-- Ensure dr = 1 is converted to dr <= 1 to get all rows leading up to dr = 1
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+
+-- Check COUNT() and COUNT(*)
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) c
+   FROM empsalary) emp
+WHERE c >= 3;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER () c
+   FROM empsalary) emp
+WHERE 11 <= c;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+
+-- Ensure we get a run condition when there's a PARTITION BY clause
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          depname,
+          row_number() OVER (PARTITION BY depname ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+
+-- and ensure we get the correct results from the above plan
+SELECT * FROM
+  (SELECT empno,
+          depname,
+          row_number() OVER (PARTITION BY depname ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+
+-- likewise with count(empno) instead of row_number()
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          depname,
+          salary,
+          count(empno) OVER (PARTITION BY depname ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+-- and again, check the results are what we expect.
+SELECT * FROM
+  (SELECT empno,
+          depname,
+          salary,
+          count(empno) OVER (PARTITION BY depname ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+-- Some more complex cases with multiple window clauses
+EXPLAIN (COSTS OFF)
+SELECT * FROM (
+	SELECT *,
+		count(salary) OVER (PARTITION BY depname || '') c1, -- w1
+		row_number() OVER (PARTITION BY depname) rn, -- w2
+		count(*) OVER (PARTITION BY depname) c2, -- w2
+		count(*) OVER (PARTITION BY '' || depname) c3 -- w3
+	FROM empsalary
+) e WHERE rn <= 1 AND c1 <= 3;
+
+-- Ensure we correctly filter out all of the run conditions from each window
+SELECT * FROM (
+	SELECT *,
+		count(salary) OVER (PARTITION BY depname || '') c1, -- w1
+		row_number() OVER (PARTITION BY depname) rn, -- w2
+		count(*) OVER (PARTITION BY depname) c2, -- w2
+		count(*) OVER (PARTITION BY '' || depname) c3 -- w3
+	FROM empsalary
+) e WHERE rn <= 1 AND c1 <= 3;
+
+-- Tests to ensure we don't push down the run condition when it's not valid to
+-- do so.
+
+-- Ensure we don't push down when the frame options show that the window
+-- function is not monotonically increasing
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+-- Ensure we don't push down when the window function's monotonic properties
+-- don't match that of the clauses.
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary) c
+   FROM empsalary) emp
+WHERE 3 <= c;
+
+-- Ensure we don't pushdown when there are multiple window clauses to evaluate
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY empno DESC) c,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+
 -- Test Sort node collapsing
 EXPLAIN (COSTS OFF)
 SELECT * FROM
-- 
2.32.0

#29

Zhihong Yu

zyu@yugabyte.com

almost 4 years ago

In reply to: David Rowley (#28)

Re: Window Function "Run Conditions"

On Wed, Apr 6, 2022 at 7:36 PM David Rowley <dgrowleyml@gmail.com> wrote:

On Wed, 6 Apr 2022 at 00:59, Andy Fan <zhihui.fan1213@gmail.com> wrote:

On Tue, Apr 5, 2022 at 7:49 PM David Rowley <dgrowleyml@gmail.com>

wrote:

Yeah, there is more performance to be had than even what you've done
there. There's no reason really for spool_tuples() to do
tuplestore_puttupleslot() when we're not in run mode.

Yeah, this is a great idea.

I've attached an updated patch that does most of what you mentioned.
To make this work I had to add another state to the WindowAggStatus.
This new state is what the top-level WindowAgg will move into when
there's a PARTITION BY clause and the run condition becomes false.
The new state is named WINDOWAGG_PASSTHROUGH_STRICT, which does all
that WINDOWAGG_PASSTHROUGH does plus skips tuplestoring tuples during
the spool. We must still spool those tuples when we're not the
top-level WindowAgg so that we can send those out to any calling
WindowAgg nodes. They'll need those so they return the correct result.

This means that for intermediate WindowAgg nodes, when the
runcondition becomes false, we only skip evaluation of WindowFuncs.
WindowAgg nodes above us cannot reference these, so there's no need to
evaluate them, plus, if there's a run condition then these tuples will
be filtered out in the final WindowAgg node.

For the top-level WindowAgg node, when the run condition becomes false
we can save quite a bit more work. If there's no PARTITION BY clause,
then we're done. Just return NULL. When there is a PARTITION BY
clause we move into WINDOWAGG_PASSTHROUGH_STRICT which allows us to
skip both the evaluation of WindowFuncs and also allows us to consume
tuples from our outer plan until we get a tuple belonging to another
partition. No need to tuplestore these tuples as they're being
filtered out.

Since intermediate WindowAggs cannot filter tuples, all the filtering
must occur in the top-level WindowAgg. This cannot be done by way of
the run condition as the run condition is special as when it becomes
false, we don't check again to see if it's become true. A sort node
between the WindowAggs can change the tuple order (i.e previously
monotonic values may no longer be monotonic) so it's only valid to
evaluate the run condition that's meant for the WindowAgg node it was
intended for. To filter out the tuples that don't match the run
condition from intermediate WindowAggs in the top-level WindowAgg,
what I've done is introduced quals for WindowAgg nodes. This means
that we can now see Filter in EXPLAIN For WindowAgg and "Rows Removed
by Filter".

Why didn't I just do the filtering in the outer query like was
happening before? The problem is that when we push the quals down
into the subquery, we don't yet have knowledge of which order that the
WindowAggs will be evaluated in. Only run conditions from
intermediate WindowAggs will ever make it into the Filter, and we
don't know which one the top-level WindowAgg will be until later in
planning. To do the filtering in the outer query we'd need to push
quals back out the subquery again. It seems to me to be easier and
better to filter them out lower down in the plan.

Since the top-level WindowAgg node can now filter tuples, the executor
node had to be given a for(;;) loop so that it goes around again for
another tuple after it filters a tuple out.

I've also updated the commit message which I think I've made quite
clear about what we optimise and how it's done.

And I would suggest the below fastpath for this feature.
- if

(check_and_push_window_quals(subquery, rte, rti, clause))

+ if (!subquery->hasWindowFuncs ||

check_and_push_window_quals(subquery, rte, rti, clause))

Good idea. Thanks!

David

Hi,

+                * We must keep the original qual in place if there is a
+                * PARTITION BY clause as the top-level WindowAgg remains in
+                * pass-through mode and does nothing to filter out unwanted
+                * tuples.
+                */
+               *keep_original = false;

The comment talks about keeping original qual but the assignment uses the
value false.
Maybe the comment can be rephrased so that it matches the assignment.

Cheers

#30

David Rowley

dgrowleyml@gmail.com

almost 4 years ago

In reply to: Zhihong Yu (#29)

Re: Window Function "Run Conditions"

On Thu, 7 Apr 2022 at 15:41, Zhihong Yu <zyu@yugabyte.com> wrote:

+                * We must keep the original qual in place if there is a
+                * PARTITION BY clause as the top-level WindowAgg remains in
+                * pass-through mode and does nothing to filter out unwanted
+                * tuples.
+                */
+               *keep_original = false;
The comment talks about keeping original qual but the assignment uses the value false.
Maybe the comment can be rephrased so that it matches the assignment.

Thanks. I've just removed that comment locally now. You're right, it
was out of date.

David

#31

David Rowley

dgrowleyml@gmail.com

almost 4 years ago

In reply to: David Rowley (#30)

1 attachment(s)

Re: Window Function "Run Conditions"

On Thu, 7 Apr 2022 at 19:01, David Rowley <dgrowleyml@gmail.com> wrote:

On Thu, 7 Apr 2022 at 15:41, Zhihong Yu <zyu@yugabyte.com> wrote:
+                * We must keep the original qual in place if there is a
+                * PARTITION BY clause as the top-level WindowAgg remains in
+                * pass-through mode and does nothing to filter out unwanted
+                * tuples.
+                */
+               *keep_original = false;
The comment talks about keeping original qual but the assignment uses the value false.
Maybe the comment can be rephrased so that it matches the assignment.
Thanks. I've just removed that comment locally now. You're right, it
was out of date.

I've attached the updated patch with the fixed comment and a few other
comments reworded slightly.

I've also pgindented the patch.

Barring any objection, I'm planning to push this one in around 10 hours time.

David

Attachments:

v7-0001-Teach-planner-and-executor-about-monotonic-window.patchtext/plain; charset=US-ASCII; name=v7-0001-Teach-planner-and-executor-about-monotonic-window.patchDownload

From 8714dea4815c41f0b4ee4714785f97787e57ac4c Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Tue, 5 Apr 2022 12:00:40 +1200
Subject: [PATCH v7] Teach planner and executor about monotonic window funcs

Window functions such as row_number() always return a value higher than
the previously returned value for tuples in any given window partition.

Traditionally queries such as;

SELECT * FROM (
   SELECT *, row_number() over (order by c) rn
   FROM t
) t WHERE rn <= 10;

were executed fairly inefficiently.  Neither the query planner nor the
executor knew that once rn made it to 11 that nothing further would match
the outer query's WHERE clause.  It would blindly continue until all
tuples were exhausted from the subquery.

Here we implement means to make the above execute more efficiently.

This is done by way of adding a pg_proc.prosupport function to various of
the built-in window functions and adding supporting code to allow the
support function to inform the planner if the window function is
monotonically increasing, monotonically decreasing, both or neither.  The
planner is then able to make use of that information and possibly allow
the executor to short-circuit execution by way of adding a "run condition"
to the WindowAgg to allow it to determine if some of its execution work
can be skipped.

This "run condition" is not like a normal filter.  These run conditions
are only built using quals comparing values to monotonic window functions.
For monotonic increasing functions, quals making use of the btree
operators for <, <= and = can be used (assuming the window function column
is on the left). You can see here that once such a condition becomes false
that a monotonic increasing function could never make it subsequently true
again.  For monotonically decreasing functions the >, >= and = btree
operators for the given type can be used for run conditions.

The best-case situation for this is when there is a single WindowAgg node
without a PARTITION BY clause.  Here when the run condition becomes false
the WindowAgg node can simply return NULL.  No more tuples will ever match
the run condition.  It's a little more complex when there is a PARTITION
BY clause.  In this case, we cannot return NULL as we must still process
other partitions.  To speed this case up we pull tuples from the outer
plan to check if they're from the same partition and simply discard them
if they are.  When we find a tuple belonging to another partition we start
processing as normal again until the run condition becomes false or we run
out of tuples to process.

When there are multiple WindowAgg nodes to evaluate then this complicates
the situation.  For intermediate WindowAggs we must ensure we always
return all tuples to the calling node.  Any filtering done could lead to
incorrect results in WindowAgg nodes above.  For all intermediate nodes,
we can still save some work when the run condition becomes false.  We've
no need to evaluate the WindowFuncs anymore.  Other WindowAgg nodes cannot
reference the value of these and these tuples will not appear in the final
result anyway.  The savings here are small in comparison to what can be
saved in the top-level WingowAgg, but still worthwhile.

Intermediate WindowAgg nodes never filter out tuples, but here we change
WindowAgg so that the top-level WindowAgg filters out tuples that don't
match the intermediate WindowAgg node's run condition.  Such filters
appear in the "Filter" clause in EXPLAIN for the top-level WindowAgg node.

Here we add prosupport functions to allow the above to work for;
row_number(), rank(), dense_rank(), count(*) and count(expr).  It appears
technically possible to do the same for min() and max(), however, it seems
unlikely to be useful enough, so that's not done here.

Author: David Rowley
Reviewed-by: Andy Fan, Zhihong Yu
Discussion: https://postgr.es/m/CAApHDvqvp3At8++yF8ij06sdcoo1S_b2YoaT9D4Nf+MObzsrLQ@mail.gmail.com
---
 src/backend/commands/explain.c          |   8 +
 src/backend/executor/nodeWindowAgg.c    | 380 ++++++++++++++--------
 src/backend/nodes/copyfuncs.c           |   4 +
 src/backend/nodes/equalfuncs.c          |   1 +
 src/backend/nodes/outfuncs.c            |   6 +
 src/backend/nodes/readfuncs.c           |   4 +
 src/backend/optimizer/path/allpaths.c   | 284 ++++++++++++++++-
 src/backend/optimizer/plan/createplan.c |  13 +-
 src/backend/optimizer/plan/planner.c    |  15 +-
 src/backend/optimizer/plan/setrefs.c    | 102 ++++++
 src/backend/optimizer/util/pathnode.c   |  13 +-
 src/backend/utils/adt/int8.c            |  44 +++
 src/backend/utils/adt/windowfuncs.c     |  63 ++++
 src/include/catalog/pg_proc.dat         |  35 ++-
 src/include/nodes/execnodes.h           |  24 +-
 src/include/nodes/nodes.h               |   3 +-
 src/include/nodes/parsenodes.h          |   1 +
 src/include/nodes/pathnodes.h           |   3 +
 src/include/nodes/plannodes.h           |  21 ++
 src/include/nodes/supportnodes.h        |  64 +++-
 src/include/optimizer/pathnode.h        |   4 +-
 src/test/regress/expected/window.out    | 398 ++++++++++++++++++++++++
 src/test/regress/sql/window.sql         | 206 ++++++++++++
 23 files changed, 1546 insertions(+), 150 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 1e5701b8eb..33d8bf87fb 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1988,6 +1988,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
 			break;
+		case T_WindowAgg:
+			show_upper_qual(plan->qual, "Filter", planstate, ancestors, es);
+			if (plan->qual)
+				show_instrumentation_count("Rows Removed by Filter", 1,
+					planstate, es);
+			show_upper_qual(((WindowAgg *) plan)->runConditionOrig,
+							"Run Condition", planstate, ancestors, es);
+			break;
 		case T_Group:
 			show_group_keys(castNode(GroupState, planstate), ancestors, es);
 			show_upper_qual(plan->qual, "Filter", planstate, ancestors, es);
diff --git a/src/backend/executor/nodeWindowAgg.c b/src/backend/executor/nodeWindowAgg.c
index 08ce05ca5a..c7f6a1a2f1 100644
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -1248,6 +1248,20 @@ spool_tuples(WindowAggState *winstate, int64 pos)
 	if (winstate->partition_spooled)
 		return;					/* whole partition done already */
 
+	/*
+	 * When in pass-through mode we can just exhaust all tuples in the current
+	 * partition.  We don't need these tuples for any further window function
+	 * evaluation, however, we do need to keep them around if we're not the
+	 * top-level window as another WindowAgg node above must see these.
+	 */
+	if (winstate->status != WINDOWAGG_RUN)
+	{
+		Assert(winstate->status == WINDOWAGG_PASSTHROUGH ||
+			   winstate->status == WINDOWAGG_PASSTHROUGH_STRICT);
+
+		pos = -1;
+	}
+
 	/*
 	 * If the tuplestore has spilled to disk, alternate reading and writing
 	 * becomes quite expensive due to frequent buffer flushes.  It's cheaper
@@ -1256,7 +1270,7 @@ spool_tuples(WindowAggState *winstate, int64 pos)
 	 * XXX this is a horrid kluge --- it'd be better to fix the performance
 	 * problem inside tuplestore.  FIXME
 	 */
-	if (!tuplestore_in_memory(winstate->buffer))
+	else if (!tuplestore_in_memory(winstate->buffer))
 		pos = -1;
 
 	outerPlan = outerPlanState(winstate);
@@ -1295,9 +1309,16 @@ spool_tuples(WindowAggState *winstate, int64 pos)
 			}
 		}
 
-		/* Still in partition, so save it into the tuplestore */
-		tuplestore_puttupleslot(winstate->buffer, outerslot);
-		winstate->spooled_rows++;
+		/*
+		 * Remember the tuple unless we're the top-level window and we're in
+		 * pass-through mode.
+		 */
+		if (winstate->status != WINDOWAGG_PASSTHROUGH_STRICT)
+		{
+			/* Still in partition, so save it into the tuplestore */
+			tuplestore_puttupleslot(winstate->buffer, outerslot);
+			winstate->spooled_rows++;
+		}
 	}
 
 	MemoryContextSwitchTo(oldcontext);
@@ -2023,13 +2044,14 @@ static TupleTableSlot *
 ExecWindowAgg(PlanState *pstate)
 {
 	WindowAggState *winstate = castNode(WindowAggState, pstate);
+	TupleTableSlot *slot;
 	ExprContext *econtext;
 	int			i;
 	int			numfuncs;
 
 	CHECK_FOR_INTERRUPTS();
 
-	if (winstate->all_done)
+	if (winstate->status == WINDOWAGG_DONE)
 		return NULL;
 
 	/*
@@ -2099,143 +2121,224 @@ ExecWindowAgg(PlanState *pstate)
 		winstate->all_first = false;
 	}
 
-	if (winstate->buffer == NULL)
-	{
-		/* Initialize for first partition and set current row = 0 */
-		begin_partition(winstate);
-		/* If there are no input rows, we'll detect that and exit below */
-	}
-	else
+	/* We need to loop as the runCondition may filter out tuples */
+	for (;;)
 	{
-		/* Advance current row within partition */
-		winstate->currentpos++;
-		/* This might mean that the frame moves, too */
-		winstate->framehead_valid = false;
-		winstate->frametail_valid = false;
-		/* we don't need to invalidate grouptail here; see below */
-	}
+		if (winstate->buffer == NULL)
+		{
+			/* Initialize for first partition and set current row = 0 */
+			begin_partition(winstate);
+			/* If there are no input rows, we'll detect that and exit below */
+		}
+		else
+		{
+			/* Advance current row within partition */
+			winstate->currentpos++;
+			/* This might mean that the frame moves, too */
+			winstate->framehead_valid = false;
+			winstate->frametail_valid = false;
+			/* we don't need to invalidate grouptail here; see below */
+		}
 
-	/*
-	 * Spool all tuples up to and including the current row, if we haven't
-	 * already
-	 */
-	spool_tuples(winstate, winstate->currentpos);
+		/*
+		 * Spool all tuples up to and including the current row, if we haven't
+		 * already
+		 */
+		spool_tuples(winstate, winstate->currentpos);
 
-	/* Move to the next partition if we reached the end of this partition */
-	if (winstate->partition_spooled &&
-		winstate->currentpos >= winstate->spooled_rows)
-	{
-		release_partition(winstate);
+		/* Move to the next partition if we reached the end of this partition */
+		if (winstate->partition_spooled &&
+			winstate->currentpos >= winstate->spooled_rows)
+		{
+			release_partition(winstate);
+
+			if (winstate->more_partitions)
+			{
+				begin_partition(winstate);
+				Assert(winstate->spooled_rows > 0);
+
+				/* Come out of pass-through mode when changing partition */
+				winstate->status = WINDOWAGG_RUN;
+			}
+			else
+			{
+				/* No further partitions?  We're done */
+				winstate->status = WINDOWAGG_DONE;
+				return NULL;
+			}
+		}
+
+		/* final output execution is in ps_ExprContext */
+		econtext = winstate->ss.ps.ps_ExprContext;
+
+		/* Clear the per-output-tuple context for current row */
+		ResetExprContext(econtext);
 
-		if (winstate->more_partitions)
+		/*
+		 * Read the current row from the tuplestore, and save in
+		 * ScanTupleSlot. (We can't rely on the outerplan's output slot
+		 * because we may have to read beyond the current row.  Also, we have
+		 * to actually copy the row out of the tuplestore, since window
+		 * function evaluation might cause the tuplestore to dump its state to
+		 * disk.)
+		 *
+		 * In GROUPS mode, or when tracking a group-oriented exclusion clause,
+		 * we must also detect entering a new peer group and update associated
+		 * state when that happens.  We use temp_slot_2 to temporarily hold
+		 * the previous row for this purpose.
+		 *
+		 * Current row must be in the tuplestore, since we spooled it above.
+		 */
+		tuplestore_select_read_pointer(winstate->buffer, winstate->current_ptr);
+		if ((winstate->frameOptions & (FRAMEOPTION_GROUPS |
+									   FRAMEOPTION_EXCLUDE_GROUP |
+									   FRAMEOPTION_EXCLUDE_TIES)) &&
+			winstate->currentpos > 0)
 		{
-			begin_partition(winstate);
-			Assert(winstate->spooled_rows > 0);
+			ExecCopySlot(winstate->temp_slot_2, winstate->ss.ss_ScanTupleSlot);
+			if (!tuplestore_gettupleslot(winstate->buffer, true, true,
+										 winstate->ss.ss_ScanTupleSlot))
+				elog(ERROR, "unexpected end of tuplestore");
+			if (!are_peers(winstate, winstate->temp_slot_2,
+						   winstate->ss.ss_ScanTupleSlot))
+			{
+				winstate->currentgroup++;
+				winstate->groupheadpos = winstate->currentpos;
+				winstate->grouptail_valid = false;
+			}
+			ExecClearTuple(winstate->temp_slot_2);
 		}
 		else
 		{
-			winstate->all_done = true;
-			return NULL;
+			if (!tuplestore_gettupleslot(winstate->buffer, true, true,
+										 winstate->ss.ss_ScanTupleSlot))
+				elog(ERROR, "unexpected end of tuplestore");
 		}
-	}
 
-	/* final output execution is in ps_ExprContext */
-	econtext = winstate->ss.ps.ps_ExprContext;
+		/* don't evaluate the window functions when we're in pass-through mode */
+		if (winstate->status == WINDOWAGG_RUN)
+		{
+			/*
+			 * Evaluate true window functions
+			 */
+			numfuncs = winstate->numfuncs;
+			for (i = 0; i < numfuncs; i++)
+			{
+				WindowStatePerFunc perfuncstate = &(winstate->perfunc[i]);
 
-	/* Clear the per-output-tuple context for current row */
-	ResetExprContext(econtext);
+				if (perfuncstate->plain_agg)
+					continue;
+				eval_windowfunction(winstate, perfuncstate,
+									&(econtext->ecxt_aggvalues[perfuncstate->wfuncstate->wfuncno]),
+									&(econtext->ecxt_aggnulls[perfuncstate->wfuncstate->wfuncno]));
+			}
 
-	/*
-	 * Read the current row from the tuplestore, and save in ScanTupleSlot.
-	 * (We can't rely on the outerplan's output slot because we may have to
-	 * read beyond the current row.  Also, we have to actually copy the row
-	 * out of the tuplestore, since window function evaluation might cause the
-	 * tuplestore to dump its state to disk.)
-	 *
-	 * In GROUPS mode, or when tracking a group-oriented exclusion clause, we
-	 * must also detect entering a new peer group and update associated state
-	 * when that happens.  We use temp_slot_2 to temporarily hold the previous
-	 * row for this purpose.
-	 *
-	 * Current row must be in the tuplestore, since we spooled it above.
-	 */
-	tuplestore_select_read_pointer(winstate->buffer, winstate->current_ptr);
-	if ((winstate->frameOptions & (FRAMEOPTION_GROUPS |
-								   FRAMEOPTION_EXCLUDE_GROUP |
-								   FRAMEOPTION_EXCLUDE_TIES)) &&
-		winstate->currentpos > 0)
-	{
-		ExecCopySlot(winstate->temp_slot_2, winstate->ss.ss_ScanTupleSlot);
-		if (!tuplestore_gettupleslot(winstate->buffer, true, true,
-									 winstate->ss.ss_ScanTupleSlot))
-			elog(ERROR, "unexpected end of tuplestore");
-		if (!are_peers(winstate, winstate->temp_slot_2,
-					   winstate->ss.ss_ScanTupleSlot))
-		{
-			winstate->currentgroup++;
-			winstate->groupheadpos = winstate->currentpos;
-			winstate->grouptail_valid = false;
+			/*
+			 * Evaluate aggregates
+			 */
+			if (winstate->numaggs > 0)
+				eval_windowaggregates(winstate);
 		}
-		ExecClearTuple(winstate->temp_slot_2);
-	}
-	else
-	{
-		if (!tuplestore_gettupleslot(winstate->buffer, true, true,
-									 winstate->ss.ss_ScanTupleSlot))
-			elog(ERROR, "unexpected end of tuplestore");
-	}
 
-	/*
-	 * Evaluate true window functions
-	 */
-	numfuncs = winstate->numfuncs;
-	for (i = 0; i < numfuncs; i++)
-	{
-		WindowStatePerFunc perfuncstate = &(winstate->perfunc[i]);
+		/*
+		 * If we have created auxiliary read pointers for the frame or group
+		 * boundaries, force them to be kept up-to-date, because we don't know
+		 * whether the window function(s) will do anything that requires that.
+		 * Failing to advance the pointers would result in being unable to
+		 * trim data from the tuplestore, which is bad.  (If we could know in
+		 * advance whether the window functions will use frame boundary info,
+		 * we could skip creating these pointers in the first place ... but
+		 * unfortunately the window function API doesn't require that.)
+		 */
+		if (winstate->framehead_ptr >= 0)
+			update_frameheadpos(winstate);
+		if (winstate->frametail_ptr >= 0)
+			update_frametailpos(winstate);
+		if (winstate->grouptail_ptr >= 0)
+			update_grouptailpos(winstate);
 
-		if (perfuncstate->plain_agg)
-			continue;
-		eval_windowfunction(winstate, perfuncstate,
-							&(econtext->ecxt_aggvalues[perfuncstate->wfuncstate->wfuncno]),
-							&(econtext->ecxt_aggnulls[perfuncstate->wfuncstate->wfuncno]));
-	}
+		/*
+		 * Truncate any no-longer-needed rows from the tuplestore.
+		 */
+		tuplestore_trim(winstate->buffer);
 
-	/*
-	 * Evaluate aggregates
-	 */
-	if (winstate->numaggs > 0)
-		eval_windowaggregates(winstate);
+		/*
+		 * Form and return a projection tuple using the windowfunc results and
+		 * the current row.  Setting ecxt_outertuple arranges that any Vars
+		 * will be evaluated with respect to that row.
+		 */
+		econtext->ecxt_outertuple = winstate->ss.ss_ScanTupleSlot;
 
-	/*
-	 * If we have created auxiliary read pointers for the frame or group
-	 * boundaries, force them to be kept up-to-date, because we don't know
-	 * whether the window function(s) will do anything that requires that.
-	 * Failing to advance the pointers would result in being unable to trim
-	 * data from the tuplestore, which is bad.  (If we could know in advance
-	 * whether the window functions will use frame boundary info, we could
-	 * skip creating these pointers in the first place ... but unfortunately
-	 * the window function API doesn't require that.)
-	 */
-	if (winstate->framehead_ptr >= 0)
-		update_frameheadpos(winstate);
-	if (winstate->frametail_ptr >= 0)
-		update_frametailpos(winstate);
-	if (winstate->grouptail_ptr >= 0)
-		update_grouptailpos(winstate);
+		slot = ExecProject(winstate->ss.ps.ps_ProjInfo);
 
-	/*
-	 * Truncate any no-longer-needed rows from the tuplestore.
-	 */
-	tuplestore_trim(winstate->buffer);
+		if (winstate->status == WINDOWAGG_RUN)
+		{
+			econtext->ecxt_scantuple = slot;
 
-	/*
-	 * Form and return a projection tuple using the windowfunc results and the
-	 * current row.  Setting ecxt_outertuple arranges that any Vars will be
-	 * evaluated with respect to that row.
-	 */
-	econtext->ecxt_outertuple = winstate->ss.ss_ScanTupleSlot;
+			/*
+			 * Now evaluate the run condition to see if we need to go into
+			 * pass-through mode, or maybe stop completely.
+			 */
+			if (!ExecQual(winstate->runcondition, econtext))
+			{
+				/*
+				 * Determine which mode to move into.  If there is no
+				 * PARTITION BY clause and we're the top-level WindowAgg then
+				 * we're done.  This tuple and any future tuples cannot
+				 * possibly match the runcondition.  However, when there is a
+				 * PARTITION BY clause or we're not the top-level window we
+				 * can't just stop as we need to either process other
+				 * partitions or ensure WindowAgg nodes above us receive all
+				 * of the tuples they need to process their WindowFuncs.
+				 */
+				if (winstate->use_pass_through)
+				{
+					/*
+					 * STRICT pass-through mode is required for the top window
+					 * when there is a PARTITION BY clause.  Otherwise we must
+					 * ensure we store tuples that don't match the
+					 * runcondition so they're available to WindowAggs above.
+					 */
+					if (winstate->top_window)
+					{
+						winstate->status = WINDOWAGG_PASSTHROUGH_STRICT;
+						continue;
+					}
+					else
+						winstate->status = WINDOWAGG_PASSTHROUGH;
+				}
+				else
+				{
+					/*
+					 * Pass-through not required.  We can just return NULL.
+					 * Nothing else will match the runcondition.
+					 */
+					winstate->status = WINDOWAGG_DONE;
+					return NULL;
+				}
+			}
 
-	return ExecProject(winstate->ss.ps.ps_ProjInfo);
+			/*
+			 * Filter out any tuples we don't need in the top-level WindowAgg.
+			 */
+			if (!ExecQual(winstate->ss.ps.qual, econtext))
+			{
+				InstrCountFiltered1(winstate, 1);
+				continue;
+			}
+
+			break;
+		}
+
+		/*
+		 * When not in WINDOWAGG_RUN mode, we must still return this tuple if
+		 * we're anything apart from the top window.
+		 */
+		else if (!winstate->top_window)
+			break;
+	}
+
+	return slot;
 }
 
 /* -----------------
@@ -2300,12 +2403,32 @@ ExecInitWindowAgg(WindowAgg *node, EState *estate, int eflags)
 							  "WindowAgg Aggregates",
 							  ALLOCSET_DEFAULT_SIZES);
 
+	/* Only the top-level WindowAgg may have a qual */
+	Assert(node->plan.qual == NIL || node->topWindow);
+
+	/* Initialize the qual */
+	winstate->ss.ps.qual = ExecInitQual(node->plan.qual,
+										(PlanState *) winstate);
+
+	/*
+	 * Setup the run condition, if we received one from the query planner.
+	 * When set, this may allow us to move into pass-through mode so that we
+	 * don't have to perform any further evaluation of WindowFuncs in the
+	 * current partition or possibly stop returning tuples altogether when all
+	 * tuples are in the same partition.
+	 */
+	winstate->runcondition = ExecInitQual(node->runCondition,
+										  (PlanState *) winstate);
+
 	/*
-	 * WindowAgg nodes never have quals, since they can only occur at the
-	 * logical top level of a query (ie, after any WHERE or HAVING filters)
+	 * When we're not the top-level WindowAgg node or we are but have a
+	 * PARTITION BY clause we must move into one of the WINDOWAGG_PASSTHROUGH*
+	 * modes when the runCondition becomes false.
 	 */
-	Assert(node->plan.qual == NIL);
-	winstate->ss.ps.qual = NULL;
+	winstate->use_pass_through = !node->topWindow || node->partNumCols > 0;
+
+	/* remember if we're the top-window or we are below the top-window */
+	winstate->top_window = node->topWindow;
 
 	/*
 	 * initialize child nodes
@@ -2500,6 +2623,9 @@ ExecInitWindowAgg(WindowAgg *node, EState *estate, int eflags)
 		winstate->agg_winobj = agg_winobj;
 	}
 
+	/* Set the status to running */
+	winstate->status = WINDOWAGG_RUN;
+
 	/* copy frame options to state node for easy access */
 	winstate->frameOptions = frameOptions;
 
@@ -2579,7 +2705,7 @@ ExecReScanWindowAgg(WindowAggState *node)
 	PlanState  *outerPlan = outerPlanState(node);
 	ExprContext *econtext = node->ss.ps.ps_ExprContext;
 
-	node->all_done = false;
+	node->status = WINDOWAGG_RUN;
 	node->all_first = true;
 
 	/* release tuplestore et al */
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 46a1943d97..6f56b269ce 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -1104,11 +1104,14 @@ _copyWindowAgg(const WindowAgg *from)
 	COPY_SCALAR_FIELD(frameOptions);
 	COPY_NODE_FIELD(startOffset);
 	COPY_NODE_FIELD(endOffset);
+	COPY_NODE_FIELD(runCondition);
+	COPY_NODE_FIELD(runConditionOrig);
 	COPY_SCALAR_FIELD(startInRangeFunc);
 	COPY_SCALAR_FIELD(endInRangeFunc);
 	COPY_SCALAR_FIELD(inRangeColl);
 	COPY_SCALAR_FIELD(inRangeAsc);
 	COPY_SCALAR_FIELD(inRangeNullsFirst);
+	COPY_SCALAR_FIELD(topWindow);
 
 	return newnode;
 }
@@ -3061,6 +3064,7 @@ _copyWindowClause(const WindowClause *from)
 	COPY_SCALAR_FIELD(frameOptions);
 	COPY_NODE_FIELD(startOffset);
 	COPY_NODE_FIELD(endOffset);
+	COPY_NODE_FIELD(runCondition);
 	COPY_SCALAR_FIELD(startInRangeFunc);
 	COPY_SCALAR_FIELD(endInRangeFunc);
 	COPY_SCALAR_FIELD(inRangeColl);
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 1f765f42c9..4b4f380bba 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -3234,6 +3234,7 @@ _equalWindowClause(const WindowClause *a, const WindowClause *b)
 	COMPARE_SCALAR_FIELD(frameOptions);
 	COMPARE_NODE_FIELD(startOffset);
 	COMPARE_NODE_FIELD(endOffset);
+	COMPARE_NODE_FIELD(runCondition);
 	COMPARE_SCALAR_FIELD(startInRangeFunc);
 	COMPARE_SCALAR_FIELD(endInRangeFunc);
 	COMPARE_SCALAR_FIELD(inRangeColl);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 13e1643530..d5f5e76c55 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -829,11 +829,14 @@ _outWindowAgg(StringInfo str, const WindowAgg *node)
 	WRITE_INT_FIELD(frameOptions);
 	WRITE_NODE_FIELD(startOffset);
 	WRITE_NODE_FIELD(endOffset);
+	WRITE_NODE_FIELD(runCondition);
+	WRITE_NODE_FIELD(runConditionOrig);
 	WRITE_OID_FIELD(startInRangeFunc);
 	WRITE_OID_FIELD(endInRangeFunc);
 	WRITE_OID_FIELD(inRangeColl);
 	WRITE_BOOL_FIELD(inRangeAsc);
 	WRITE_BOOL_FIELD(inRangeNullsFirst);
+	WRITE_BOOL_FIELD(topWindow);
 }
 
 static void
@@ -2283,6 +2286,8 @@ _outWindowAggPath(StringInfo str, const WindowAggPath *node)
 
 	WRITE_NODE_FIELD(subpath);
 	WRITE_NODE_FIELD(winclause);
+	WRITE_NODE_FIELD(qual);
+	WRITE_BOOL_FIELD(topwindow);
 }
 
 static void
@@ -3293,6 +3298,7 @@ _outWindowClause(StringInfo str, const WindowClause *node)
 	WRITE_INT_FIELD(frameOptions);
 	WRITE_NODE_FIELD(startOffset);
 	WRITE_NODE_FIELD(endOffset);
+	WRITE_NODE_FIELD(runCondition);
 	WRITE_OID_FIELD(startInRangeFunc);
 	WRITE_OID_FIELD(endInRangeFunc);
 	WRITE_OID_FIELD(inRangeColl);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 48f7216c9e..3d150cb25d 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -384,6 +384,7 @@ _readWindowClause(void)
 	READ_INT_FIELD(frameOptions);
 	READ_NODE_FIELD(startOffset);
 	READ_NODE_FIELD(endOffset);
+	READ_NODE_FIELD(runCondition);
 	READ_OID_FIELD(startInRangeFunc);
 	READ_OID_FIELD(endInRangeFunc);
 	READ_OID_FIELD(inRangeColl);
@@ -2576,11 +2577,14 @@ _readWindowAgg(void)
 	READ_INT_FIELD(frameOptions);
 	READ_NODE_FIELD(startOffset);
 	READ_NODE_FIELD(endOffset);
+	READ_NODE_FIELD(runCondition);
+	READ_NODE_FIELD(runConditionOrig);
 	READ_OID_FIELD(startInRangeFunc);
 	READ_OID_FIELD(endInRangeFunc);
 	READ_OID_FIELD(inRangeColl);
 	READ_BOOL_FIELD(inRangeAsc);
 	READ_BOOL_FIELD(inRangeNullsFirst);
+	READ_BOOL_FIELD(topWindow);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 169b1d53fc..998212dda8 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -27,6 +27,7 @@
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
+#include "nodes/supportnodes.h"
 #ifdef OPTIMIZER_DEBUG
 #include "nodes/print.h"
 #endif
@@ -2157,6 +2158,269 @@ has_multiple_baserels(PlannerInfo *root)
 	return false;
 }
 
+/*
+ * find_window_run_conditions
+ *		Determine if 'wfunc' is really a WindowFunc and call its prosupport
+ *		function to determine the function's monotonic properties.  We then
+ *		see if 'opexpr' can be used to short-circuit execution.
+ *
+ * For example row_number() over (order by ...) always produces a value one
+ * higher than the previous.  If someone has a window function in a subquery
+ * and has a WHERE clause in the outer query to filter rows <= 10, then we may
+ * as well stop processing the windowagg once the row number reaches 11.  Here
+ * we check if 'opexpr' might help us to stop doing needless extra processing
+ * in WindowAgg nodes.
+ *
+ * '*keep_original' is set to true if the caller should also use 'opexpr' for
+ * its original purpose.  This is set to false if the caller can assume that
+ * the run condition will handle all of the required filtering.
+ *
+ * Returns true if 'opexpr' was found to be useful and was added to the
+ * WindowClauses runCondition. We also set *keep_original accordingly.
+ * If the 'opexpr' cannot be used then we set *keep_original to true and
+ * return false.
+ */
+static bool
+find_window_run_conditions(Query *subquery, RangeTblEntry *rte, Index rti,
+						   AttrNumber attno, WindowFunc *wfunc, OpExpr *opexpr,
+						   bool wfunc_left, bool *keep_original)
+{
+	Oid			prosupport;
+	Expr	   *otherexpr;
+	SupportRequestWFuncMonotonic req;
+	SupportRequestWFuncMonotonic *res;
+	WindowClause *wclause;
+	List	   *opinfos;
+	OpExpr	   *runopexpr;
+	Oid			runoperator;
+	ListCell   *lc;
+
+	*keep_original = true;
+
+	while (IsA(wfunc, RelabelType))
+		wfunc = (WindowFunc *) ((RelabelType *) wfunc)->arg;
+
+	/* we can only work with window functions */
+	if (!IsA(wfunc, WindowFunc))
+		return false;
+
+	prosupport = get_func_support(wfunc->winfnoid);
+
+	/* Check if there's a support function for 'wfunc' */
+	if (!OidIsValid(prosupport))
+		return false;
+
+	/* get the Expr from the other side of the OpExpr */
+	if (wfunc_left)
+		otherexpr = lsecond(opexpr->args);
+	else
+		otherexpr = linitial(opexpr->args);
+
+	/*
+	 * The value being compared must not change during the evaluation of the
+	 * window partition.
+	 */
+	if (!is_pseudo_constant_clause((Node *) otherexpr))
+		return false;
+
+	/* find the window clause belonging to the window function */
+	wclause = (WindowClause *) list_nth(subquery->windowClause,
+										wfunc->winref - 1);
+
+	req.type = T_SupportRequestWFuncMonotonic;
+	req.window_func = wfunc;
+	req.window_clause = wclause;
+
+	/* call the support function */
+	res = (SupportRequestWFuncMonotonic *)
+		DatumGetPointer(OidFunctionCall1(prosupport,
+										 PointerGetDatum(&req)));
+
+	/*
+	 * Nothing to do if the function is neither monotonically increasing nor
+	 * monotonically decreasing.
+	 */
+	if (res == NULL || res->monotonic == MONOTONICFUNC_NONE)
+		return false;
+
+	runopexpr = NULL;
+	runoperator = InvalidOid;
+	opinfos = get_op_btree_interpretation(opexpr->opno);
+
+	foreach(lc, opinfos)
+	{
+		OpBtreeInterpretation *opinfo = (OpBtreeInterpretation *) lfirst(lc);
+		int			strategy = opinfo->strategy;
+
+		/* handle < / <= */
+		if (strategy == BTLessStrategyNumber ||
+			strategy == BTLessEqualStrategyNumber)
+		{
+			/*
+			 * < / <= is supported for monotonically increasing functions in
+			 * the form <wfunc> op <pseudoconst> and <pseudoconst> op <wfunc>
+			 * for monotonically decreasing functions.
+			 */
+			if ((wfunc_left && (res->monotonic & MONOTONICFUNC_INCREASING)) ||
+				(!wfunc_left && (res->monotonic & MONOTONICFUNC_DECREASING)))
+			{
+				*keep_original = false;
+				runopexpr = opexpr;
+				runoperator = opexpr->opno;
+			}
+			break;
+		}
+		/* handle > / >= */
+		else if (strategy == BTGreaterStrategyNumber ||
+				 strategy == BTGreaterEqualStrategyNumber)
+		{
+			/*
+			 * > / >= is supported for monotonically decreasing functions in
+			 * the form <wfunc> op <pseudoconst> and <pseudoconst> op <wfunc>
+			 * for monotonically increasing functions.
+			 */
+			if ((wfunc_left && (res->monotonic & MONOTONICFUNC_DECREASING)) ||
+				(!wfunc_left && (res->monotonic & MONOTONICFUNC_INCREASING)))
+			{
+				*keep_original = false;
+				runopexpr = opexpr;
+				runoperator = opexpr->opno;
+			}
+			break;
+		}
+		/* handle = */
+		else if (strategy == BTEqualStrategyNumber)
+		{
+			int16		newstrategy;
+
+			/*
+			 * When both monotonically increasing and decreasing then the
+			 * return value of the window function will be the same each time.
+			 * We can simply use 'opexpr' as the run condition without
+			 * modifying it.
+			 */
+			if ((res->monotonic & MONOTONICFUNC_BOTH) == MONOTONICFUNC_BOTH)
+			{
+				*keep_original = false;
+				runopexpr = opexpr;
+				break;
+			}
+
+			/*
+			 * When monotonically increasing we make a qual with <wfunc> <=
+			 * <value> or <value> >= <wfunc> in order to filter out values
+			 * which are above the value in the equality condition.  For
+			 * monotonically decreasing functions we want to filter values
+			 * below the value in the equality condition.
+			 */
+			if (res->monotonic & MONOTONICFUNC_INCREASING)
+				newstrategy = wfunc_left ? BTLessEqualStrategyNumber : BTGreaterEqualStrategyNumber;
+			else
+				newstrategy = wfunc_left ? BTGreaterEqualStrategyNumber : BTLessEqualStrategyNumber;
+
+			/* We must keep the original equality qual */
+			*keep_original = true;
+			runopexpr = opexpr;
+
+			/* determine the operator to use for the runCondition qual */
+			runoperator = get_opfamily_member(opinfo->opfamily_id,
+											  opinfo->oplefttype,
+											  opinfo->oprighttype,
+											  newstrategy);
+			break;
+		}
+	}
+
+	if (runopexpr != NULL)
+	{
+		Expr	   *newexpr;
+
+		/*
+		 * Build the qual required for the run condition keeping the
+		 * WindowFunc on the same side as it was originally.
+		 */
+		if (wfunc_left)
+			newexpr = make_opclause(runoperator,
+									runopexpr->opresulttype,
+									runopexpr->opretset, (Expr *) wfunc,
+									otherexpr, runopexpr->opcollid,
+									runopexpr->inputcollid);
+		else
+			newexpr = make_opclause(runoperator,
+									runopexpr->opresulttype,
+									runopexpr->opretset,
+									otherexpr, (Expr *) wfunc,
+									runopexpr->opcollid,
+									runopexpr->inputcollid);
+
+		wclause->runCondition = lappend(wclause->runCondition, newexpr);
+
+		return true;
+	}
+
+	/* unsupported OpExpr */
+	return false;
+}
+
+/*
+ * check_and_push_window_quals
+ *		Check if 'clause' is a qual that can be pushed into a WindowFunc's
+ *		WindowClause as a 'runCondition' qual.  These, when present, allow
+ *		some unnecessary work to be skipped during execution.
+ *
+ * Returns true if the caller still must keep the original qual or false if
+ * the caller can safely ignore the original qual because the WindowAgg node
+ * will use the runCondition to stop returning tuples.
+ */
+static bool
+check_and_push_window_quals(Query *subquery, RangeTblEntry *rte, Index rti,
+							Node *clause)
+{
+	OpExpr	   *opexpr = (OpExpr *) clause;
+	bool		keep_original = true;
+	Var		   *var1;
+	Var		   *var2;
+
+	/* We're only able to use OpExprs with 2 operands */
+	if (!IsA(opexpr, OpExpr))
+		return true;
+
+	if (list_length(opexpr->args) != 2)
+		return true;
+
+	/*
+	 * Check for plain Vars that reference window functions in the subquery.
+	 * If we find any, we'll ask find_window_run_conditions() if 'opexpr' can
+	 * be used as part of the run condition.
+	 */
+
+	/* Check the left side of the OpExpr */
+	var1 = linitial(opexpr->args);
+	if (IsA(var1, Var) && var1->varattno > 0)
+	{
+		TargetEntry *tle = list_nth(subquery->targetList, var1->varattno - 1);
+		WindowFunc *wfunc = (WindowFunc *) tle->expr;
+
+		if (find_window_run_conditions(subquery, rte, rti, tle->resno, wfunc,
+									   opexpr, true, &keep_original))
+			return keep_original;
+	}
+
+	/* and check the right side */
+	var2 = lsecond(opexpr->args);
+	if (IsA(var2, Var) && var2->varattno > 0)
+	{
+		TargetEntry *tle = list_nth(subquery->targetList, var2->varattno - 1);
+		WindowFunc *wfunc = (WindowFunc *) tle->expr;
+
+		if (find_window_run_conditions(subquery, rte, rti, tle->resno, wfunc,
+									   opexpr, false, &keep_original))
+			return keep_original;
+	}
+
+	return true;
+}
+
 /*
  * set_subquery_pathlist
  *		Generate SubqueryScan access paths for a subquery RTE
@@ -2245,19 +2509,31 @@ set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 		foreach(l, rel->baserestrictinfo)
 		{
 			RestrictInfo *rinfo = (RestrictInfo *) lfirst(l);
+			Node	   *clause = (Node *) rinfo->clause;
 
 			if (!rinfo->pseudoconstant &&
 				qual_is_pushdown_safe(subquery, rti, rinfo, &safetyInfo))
 			{
-				Node	   *clause = (Node *) rinfo->clause;
-
 				/* Push it down */
 				subquery_push_qual(subquery, rte, rti, clause);
 			}
 			else
 			{
-				/* Keep it in the upper query */
-				upperrestrictlist = lappend(upperrestrictlist, rinfo);
+				/*
+				 * Since we can't push the qual down into the subquery, check
+				 * if it happens to reference a window function.  If so then
+				 * it might be useful to use for the WindowAgg's runCondition.
+				 */
+				if (!subquery->hasWindowFuncs ||
+					check_and_push_window_quals(subquery, rte, rti, clause))
+				{
+					/*
+					 * subquery has no window funcs or the clause is not a
+					 * suitable window run condition qual or it is, but the
+					 * original must also be kept in the upper query.
+					 */
+					upperrestrictlist = lappend(upperrestrictlist, rinfo);
+				}
 			}
 		}
 		rel->baserestrictinfo = upperrestrictlist;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 51591bb812..95476ada0b 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -288,6 +288,7 @@ static WindowAgg *make_windowagg(List *tlist, Index winref,
 								 int frameOptions, Node *startOffset, Node *endOffset,
 								 Oid startInRangeFunc, Oid endInRangeFunc,
 								 Oid inRangeColl, bool inRangeAsc, bool inRangeNullsFirst,
+								 List *runCondition, List *qual, bool topWindow,
 								 Plan *lefttree);
 static Group *make_group(List *tlist, List *qual, int numGroupCols,
 						 AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
@@ -2672,6 +2673,9 @@ create_windowagg_plan(PlannerInfo *root, WindowAggPath *best_path)
 						  wc->inRangeColl,
 						  wc->inRangeAsc,
 						  wc->inRangeNullsFirst,
+						  wc->runCondition,
+						  best_path->qual,
+						  best_path->topwindow,
 						  subplan);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -6558,7 +6562,7 @@ make_windowagg(List *tlist, Index winref,
 			   int frameOptions, Node *startOffset, Node *endOffset,
 			   Oid startInRangeFunc, Oid endInRangeFunc,
 			   Oid inRangeColl, bool inRangeAsc, bool inRangeNullsFirst,
-			   Plan *lefttree)
+			   List *runCondition, List *qual, bool topWindow, Plan *lefttree)
 {
 	WindowAgg  *node = makeNode(WindowAgg);
 	Plan	   *plan = &node->plan;
@@ -6575,17 +6579,20 @@ make_windowagg(List *tlist, Index winref,
 	node->frameOptions = frameOptions;
 	node->startOffset = startOffset;
 	node->endOffset = endOffset;
+	node->runCondition = runCondition;
+	/* a duplicate of the above for EXPLAIN */
+	node->runConditionOrig = runCondition;
 	node->startInRangeFunc = startInRangeFunc;
 	node->endInRangeFunc = endInRangeFunc;
 	node->inRangeColl = inRangeColl;
 	node->inRangeAsc = inRangeAsc;
 	node->inRangeNullsFirst = inRangeNullsFirst;
+	node->topWindow = topWindow;
 
 	plan->targetlist = tlist;
 	plan->lefttree = lefttree;
 	plan->righttree = NULL;
-	/* WindowAgg nodes never have a qual clause */
-	plan->qual = NIL;
+	plan->qual = qual;
 
 	return node;
 }
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b2569c5d0c..b090b087e9 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4190,6 +4190,7 @@ create_one_window_path(PlannerInfo *root,
 {
 	PathTarget *window_target;
 	ListCell   *l;
+	List	   *topqual = NIL;
 
 	/*
 	 * Since each window clause could require a different sort order, we stack
@@ -4214,6 +4215,7 @@ create_one_window_path(PlannerInfo *root,
 		List	   *window_pathkeys;
 		int			presorted_keys;
 		bool		is_sorted;
+		bool		topwindow;
 
 		window_pathkeys = make_pathkeys_for_window(root,
 												   wc,
@@ -4277,10 +4279,21 @@ create_one_window_path(PlannerInfo *root,
 			window_target = output_target;
 		}
 
+		/* mark the final item in the list as the top-level window */
+		topwindow = foreach_current_index(l) == list_length(activeWindows) - 1;
+
+		/*
+		 * Accumulate all of the runConditions from each intermediate
+		 * WindowClause.  The top-level WindowAgg must pass these as a qual so
+		 * that it filters out unwanted tuples correctly.
+		 */
+		if (!topwindow)
+			topqual = list_concat(topqual, wc->runCondition);
+
 		path = (Path *)
 			create_windowagg_path(root, window_rel, path, window_target,
 								  wflists->windowFuncs[wc->winref],
-								  wc);
+								  wc, topwindow ? topqual : NIL, topwindow);
 	}
 
 	add_path(window_rel, path);
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 7519723081..6ea3505646 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -71,6 +71,13 @@ typedef struct
 	double		num_exec;
 } fix_upper_expr_context;
 
+typedef struct
+{
+	PlannerInfo *root;
+	indexed_tlist *subplan_itlist;
+	int			newvarno;
+} fix_windowagg_cond_context;
+
 /*
  * Selecting the best alternative in an AlternativeSubPlan expression requires
  * estimating how many times that expression will be evaluated.  For an
@@ -171,6 +178,9 @@ static List *set_returning_clause_references(PlannerInfo *root,
 											 Plan *topplan,
 											 Index resultRelation,
 											 int rtoffset);
+static List *set_windowagg_runcondition_references(PlannerInfo *root,
+												   List *runcondition,
+												   Plan *plan);
 
 
 /*****************************************************************************
@@ -885,6 +895,18 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 			{
 				WindowAgg  *wplan = (WindowAgg *) plan;
 
+				/*
+				 * Adjust the WindowAgg's run conditions by swapping the
+				 * WindowFuncs references out to instead reference the Var in
+				 * the scan slot so that when the executor evaluates the
+				 * runCondition, it receives the WindowFunc's value from the
+				 * slot that the result has just been stored into rather than
+				 * evaluating the WindowFunc all over again.
+				 */
+				wplan->runCondition = set_windowagg_runcondition_references(root,
+																			wplan->runCondition,
+																			(Plan *) wplan);
+
 				set_upper_references(root, plan, rtoffset);
 
 				/*
@@ -896,6 +918,14 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 					fix_scan_expr(root, wplan->startOffset, rtoffset, 1);
 				wplan->endOffset =
 					fix_scan_expr(root, wplan->endOffset, rtoffset, 1);
+				wplan->runCondition = fix_scan_list(root,
+													wplan->runCondition,
+													rtoffset,
+													NUM_EXEC_TLIST(plan));
+				wplan->runConditionOrig = fix_scan_list(root,
+														wplan->runConditionOrig,
+														rtoffset,
+														NUM_EXEC_TLIST(plan));
 			}
 			break;
 		case T_Result:
@@ -3064,6 +3094,78 @@ set_returning_clause_references(PlannerInfo *root,
 	return rlist;
 }
 
+/*
+ * fix_windowagg_condition_expr_mutator
+ *		Mutator function for replacing WindowFuncs with the corresponding Var
+ *		in the targetlist which references that WindowFunc.
+ */
+static Node *
+fix_windowagg_condition_expr_mutator(Node *node,
+									 fix_windowagg_cond_context *context)
+{
+	if (node == NULL)
+		return NULL;
+
+	if (IsA(node, WindowFunc))
+	{
+		Var		   *newvar;
+
+		newvar = search_indexed_tlist_for_non_var((Expr *) node,
+												  context->subplan_itlist,
+												  context->newvarno);
+		if (newvar)
+			return (Node *) newvar;
+		elog(ERROR, "WindowFunc not found in subplan target lists");
+	}
+
+	return expression_tree_mutator(node,
+								   fix_windowagg_condition_expr_mutator,
+								   (void *) context);
+}
+
+/*
+ * fix_windowagg_condition_expr
+ *		Converts references in 'runcondition' so that any WindowFunc
+ *		references are swapped out for a Var which references the matching
+ *		WindowFunc in 'subplan_itlist'.
+ */
+static List *
+fix_windowagg_condition_expr(PlannerInfo *root,
+							 List *runcondition,
+							 indexed_tlist *subplan_itlist)
+{
+	fix_windowagg_cond_context context;
+
+	context.root = root;
+	context.subplan_itlist = subplan_itlist;
+	context.newvarno = 0;
+
+	return (List *) fix_windowagg_condition_expr_mutator((Node *) runcondition,
+														 &context);
+}
+
+/*
+ * set_windowagg_runcondition_references
+ *		Converts references in 'runcondition' so that any WindowFunc
+ *		references are swapped out for a Var which references the matching
+ *		WindowFunc in 'plan' targetlist.
+ */
+static List *
+set_windowagg_runcondition_references(PlannerInfo *root,
+									  List *runcondition,
+									  Plan *plan)
+{
+	List	   *newlist;
+	indexed_tlist *itlist;
+
+	itlist = build_tlist_index(plan->targetlist);
+
+	newlist = fix_windowagg_condition_expr(root, runcondition, itlist);
+
+	pfree(itlist);
+
+	return newlist;
+}
 
 /*****************************************************************************
  *					QUERY DEPENDENCY MANAGEMENT
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 1670e54644..e2a3c110ce 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3388,6 +3388,10 @@ create_minmaxagg_path(PlannerInfo *root,
  * 'target' is the PathTarget to be computed
  * 'windowFuncs' is a list of WindowFunc structs
  * 'winclause' is a WindowClause that is common to all the WindowFuncs
+ * 'qual' WindowClause.runconditions from lower-level WindowAggPaths.
+ *		Must always be NIL when topwindow == false
+ * 'topwindow' pass as true only for the top-level WindowAgg. False for all
+ *		intermediate WindowAggs.
  *
  * The input must be sorted according to the WindowClause's PARTITION keys
  * plus ORDER BY keys.
@@ -3398,10 +3402,15 @@ create_windowagg_path(PlannerInfo *root,
 					  Path *subpath,
 					  PathTarget *target,
 					  List *windowFuncs,
-					  WindowClause *winclause)
+					  WindowClause *winclause,
+					  List *qual,
+					  bool topwindow)
 {
 	WindowAggPath *pathnode = makeNode(WindowAggPath);
 
+	/* qual can only be set for the topwindow */
+	Assert(qual == NIL || topwindow);
+
 	pathnode->path.pathtype = T_WindowAgg;
 	pathnode->path.parent = rel;
 	pathnode->path.pathtarget = target;
@@ -3416,6 +3425,8 @@ create_windowagg_path(PlannerInfo *root,
 
 	pathnode->subpath = subpath;
 	pathnode->winclause = winclause;
+	pathnode->qual = qual;
+	pathnode->topwindow = topwindow;
 
 	/*
 	 * For costing purposes, assume that there are no redundant partitioning
diff --git a/src/backend/utils/adt/int8.c b/src/backend/utils/adt/int8.c
index 4a87114a4f..98d4323755 100644
--- a/src/backend/utils/adt/int8.c
+++ b/src/backend/utils/adt/int8.c
@@ -24,6 +24,7 @@
 #include "nodes/supportnodes.h"
 #include "optimizer/optimizer.h"
 #include "utils/builtins.h"
+#include "utils/lsyscache.h"
 
 
 typedef struct
@@ -818,6 +819,49 @@ int8dec_any(PG_FUNCTION_ARGS)
 	return int8dec(fcinfo);
 }
 
+/*
+ * int8inc_support
+ *		prosupport function for int8inc() and int8inc_any()
+ */
+Datum
+int8inc_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq = (Node *) PG_GETARG_POINTER(0);
+
+	if (IsA(rawreq, SupportRequestWFuncMonotonic))
+	{
+		SupportRequestWFuncMonotonic *req = (SupportRequestWFuncMonotonic *) rawreq;
+		MonotonicFunction monotonic = MONOTONICFUNC_NONE;
+		int			frameOptions = req->window_clause->frameOptions;
+
+		/* No ORDER BY clause then all rows are peers */
+		if (req->window_clause->orderClause == NIL)
+			monotonic = MONOTONICFUNC_BOTH;
+		else
+		{
+			/*
+			 * Otherwise take into account the frame options.  When the frame
+			 * bound is the start of the window then the resulting value can
+			 * never decrease, therefore is monotonically increasing
+			 */
+			if (frameOptions & FRAMEOPTION_START_UNBOUNDED_PRECEDING)
+				monotonic |= MONOTONICFUNC_INCREASING;
+
+			/*
+			 * Likewise, if the frame bound is the end of the window then the
+			 * resulting value can never decrease.
+			 */
+			if (frameOptions & FRAMEOPTION_END_UNBOUNDED_FOLLOWING)
+				monotonic |= MONOTONICFUNC_DECREASING;
+		}
+
+		req->monotonic = monotonic;
+		PG_RETURN_POINTER(req);
+	}
+
+	PG_RETURN_POINTER(NULL);
+}
+
 
 Datum
 int8larger(PG_FUNCTION_ARGS)
diff --git a/src/backend/utils/adt/windowfuncs.c b/src/backend/utils/adt/windowfuncs.c
index 3e0cc9be1a..596564fa15 100644
--- a/src/backend/utils/adt/windowfuncs.c
+++ b/src/backend/utils/adt/windowfuncs.c
@@ -13,6 +13,7 @@
  */
 #include "postgres.h"
 
+#include "nodes/supportnodes.h"
 #include "utils/builtins.h"
 #include "windowapi.h"
 
@@ -88,6 +89,26 @@ window_row_number(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(curpos + 1);
 }
 
+/*
+ * window_row_number_support
+ *		prosupport function for window_row_number()
+ */
+Datum
+window_row_number_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq = (Node *) PG_GETARG_POINTER(0);
+
+	if (IsA(rawreq, SupportRequestWFuncMonotonic))
+	{
+		SupportRequestWFuncMonotonic *req = (SupportRequestWFuncMonotonic *) rawreq;
+
+		/* row_number() is monotonically increasing */
+		req->monotonic = MONOTONICFUNC_INCREASING;
+		PG_RETURN_POINTER(req);
+	}
+
+	PG_RETURN_POINTER(NULL);
+}
 
 /*
  * rank
@@ -110,6 +131,27 @@ window_rank(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(context->rank);
 }
 
+/*
+ * window_rank_support
+ *		prosupport function for window_rank()
+ */
+Datum
+window_rank_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq = (Node *) PG_GETARG_POINTER(0);
+
+	if (IsA(rawreq, SupportRequestWFuncMonotonic))
+	{
+		SupportRequestWFuncMonotonic *req = (SupportRequestWFuncMonotonic *) rawreq;
+
+		/* rank() is monotonically increasing */
+		req->monotonic = MONOTONICFUNC_INCREASING;
+		PG_RETURN_POINTER(req);
+	}
+
+	PG_RETURN_POINTER(NULL);
+}
+
 /*
  * dense_rank
  * Rank increases by 1 when key columns change.
@@ -130,6 +172,27 @@ window_dense_rank(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(context->rank);
 }
 
+/*
+ * window_dense_rank_support
+ *		prosupport function for window_dense_rank()
+ */
+Datum
+window_dense_rank_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq = (Node *) PG_GETARG_POINTER(0);
+
+	if (IsA(rawreq, SupportRequestWFuncMonotonic))
+	{
+		SupportRequestWFuncMonotonic *req = (SupportRequestWFuncMonotonic *) rawreq;
+
+		/* dense_rank() is monotonically increasing */
+		req->monotonic = MONOTONICFUNC_INCREASING;
+		PG_RETURN_POINTER(req);
+	}
+
+	PG_RETURN_POINTER(NULL);
+}
+
 /*
  * percent_rank
  * return fraction between 0 and 1 inclusive,
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 61876c4e80..be31e5ca25 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6662,11 +6662,16 @@
 # count has two forms: count(any) and count(*)
 { oid => '2147',
   descr => 'number of input rows for which the input expression is not null',
-  proname => 'count', prokind => 'a', proisstrict => 'f', prorettype => 'int8',
-  proargtypes => 'any', prosrc => 'aggregate_dummy' },
+  proname => 'count', prosupport => 'int8inc_support', prokind => 'a',
+  proisstrict => 'f', prorettype => 'int8', proargtypes => 'any',
+  prosrc => 'aggregate_dummy' },
 { oid => '2803', descr => 'number of input rows',
-  proname => 'count', prokind => 'a', proisstrict => 'f', prorettype => 'int8',
-  proargtypes => '', prosrc => 'aggregate_dummy' },
+  proname => 'count', prosupport => 'int8inc_support', prokind => 'a',
+  proisstrict => 'f', prorettype => 'int8', proargtypes => '',
+  prosrc => 'aggregate_dummy' },
+{ oid => '8802', descr => 'planner support for count run condition',
+  proname => 'int8inc_support', prorettype => 'internal',
+  proargtypes => 'internal', prosrc => 'int8inc_support' },
 
 { oid => '2718',
   descr => 'population variance of bigint input values (square of the population standard deviation)',
@@ -10186,14 +10191,26 @@
 
 # SQL-spec window functions
 { oid => '3100', descr => 'row number within partition',
-  proname => 'row_number', prokind => 'w', proisstrict => 'f',
-  prorettype => 'int8', proargtypes => '', prosrc => 'window_row_number' },
+  proname => 'row_number', prosupport => 'window_row_number_support',
+  prokind => 'w', proisstrict => 'f',  prorettype => 'int8',
+  proargtypes => '', prosrc => 'window_row_number' },
+{ oid => '8799', descr => 'planner support for row_number run condition',
+  proname => 'window_row_number_support', prorettype => 'internal',
+  proargtypes => 'internal', prosrc => 'window_row_number_support' },
 { oid => '3101', descr => 'integer rank with gaps',
-  proname => 'rank', prokind => 'w', proisstrict => 'f', prorettype => 'int8',
+  proname => 'rank', prosupport => 'window_rank_support',
+  prokind => 'w', proisstrict => 'f', prorettype => 'int8',
   proargtypes => '', prosrc => 'window_rank' },
+{ oid => '8800', descr => 'planner support for rank run condition',
+  proname => 'window_rank_support', prorettype => 'internal',
+  proargtypes => 'internal', prosrc => 'window_rank_support' },
 { oid => '3102', descr => 'integer rank without gaps',
-  proname => 'dense_rank', prokind => 'w', proisstrict => 'f',
-  prorettype => 'int8', proargtypes => '', prosrc => 'window_dense_rank' },
+  proname => 'dense_rank', prosupport => 'window_dense_rank_support',
+  prokind => 'w', proisstrict => 'f', prorettype => 'int8', proargtypes => '',
+  prosrc => 'window_dense_rank' },
+{ oid => '8801', descr => 'planner support for dense rank run condition',
+  proname => 'window_dense_rank_support', prorettype => 'internal',
+  proargtypes => 'internal', prosrc => 'window_dense_rank_support' },
 { oid => '3103', descr => 'fractional rank within partition',
   proname => 'percent_rank', prokind => 'w', proisstrict => 'f',
   prorettype => 'float8', proargtypes => '', prosrc => 'window_percent_rank' },
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index cbbcff81d2..94b191f8ae 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2406,6 +2406,18 @@ typedef struct AggState
 typedef struct WindowStatePerFuncData *WindowStatePerFunc;
 typedef struct WindowStatePerAggData *WindowStatePerAgg;
 
+/*
+ * WindowAggStatus -- Used to track the status of WindowAggState
+ */
+typedef enum WindowAggStatus
+{
+	WINDOWAGG_DONE,				/* No more processing to do */
+	WINDOWAGG_RUN,				/* Normal processing of window funcs */
+	WINDOWAGG_PASSTHROUGH,		/* Don't eval window funcs */
+	WINDOWAGG_PASSTHROUGH_STRICT	/* Pass-through plus don't store new
+									 * tuples during spool */
+} WindowAggStatus;
+
 typedef struct WindowAggState
 {
 	ScanState	ss;				/* its first field is NodeTag */
@@ -2432,6 +2444,7 @@ typedef struct WindowAggState
 	struct WindowObjectData *agg_winobj;	/* winobj for aggregate fetches */
 	int64		aggregatedbase; /* start row for current aggregates */
 	int64		aggregatedupto; /* rows before this one are aggregated */
+	WindowAggStatus status;		/* run status of WindowAggState */
 
 	int			frameOptions;	/* frame_clause options, see WindowDef */
 	ExprState  *startOffset;	/* expression for starting bound offset */
@@ -2458,8 +2471,17 @@ typedef struct WindowAggState
 	MemoryContext curaggcontext;	/* current aggregate's working data */
 	ExprContext *tmpcontext;	/* short-term evaluation context */
 
+	ExprState  *runcondition;	/* Condition which must remain true otherwise
+								 * execution of the WindowAgg will finish or
+								 * go into pass-through mode.  NULL when there
+								 * is no such condition. */
+
+	bool		use_pass_through;	/* When false, stop execution when
+									 * runcondition is no longer true.  Else
+									 * just stop evaluating window funcs. */
+	bool		top_window;		/* true if this is the top-most WindowAgg or
+								 * the only WindowAgg in this query level */
 	bool		all_first;		/* true if the scan is starting */
-	bool		all_done;		/* true if the scan is finished */
 	bool		partition_spooled;	/* true if all tuples in current partition
 									 * have been spooled into tuplestore */
 	bool		more_partitions;	/* true if there's more partitions after
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 300824258e..340d28f4e1 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -560,7 +560,8 @@ typedef enum NodeTag
 	T_SupportRequestSelectivity,	/* in nodes/supportnodes.h */
 	T_SupportRequestCost,		/* in nodes/supportnodes.h */
 	T_SupportRequestRows,		/* in nodes/supportnodes.h */
-	T_SupportRequestIndexCondition	/* in nodes/supportnodes.h */
+	T_SupportRequestIndexCondition, /* in nodes/supportnodes.h */
+	T_SupportRequestWFuncMonotonic	/* in nodes/supportnodes.h */
 } NodeTag;
 
 /*
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 8998d34560..b2cdf8249f 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -1402,6 +1402,7 @@ typedef struct WindowClause
 	int			frameOptions;	/* frame_clause options, see WindowDef */
 	Node	   *startOffset;	/* expression for starting bound, if any */
 	Node	   *endOffset;		/* expression for ending bound, if any */
+	List	   *runCondition;	/* qual to help short-circuit execution */
 	Oid			startInRangeFunc;	/* in_range function for startOffset */
 	Oid			endInRangeFunc; /* in_range function for endOffset */
 	Oid			inRangeColl;	/* collation for in_range tests */
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 6cbcb67bdf..c5ab53e05c 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1843,6 +1843,9 @@ typedef struct WindowAggPath
 	Path		path;
 	Path	   *subpath;		/* path representing input source */
 	WindowClause *winclause;	/* WindowClause we'll be using */
+	List	   *qual;			/* lower-level WindowAgg runconditions */
+	bool		topwindow;		/* false for all apart from the WindowAgg
+								 * that's closest to the root of the plan */
 } WindowAggPath;
 
 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 10dd35f011..e43e360d9b 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -926,12 +926,16 @@ typedef struct WindowAgg
 	int			frameOptions;	/* frame_clause options, see WindowDef */
 	Node	   *startOffset;	/* expression for starting bound, if any */
 	Node	   *endOffset;		/* expression for ending bound, if any */
+	List	   *runCondition;	/* qual to help short-circuit execution */
+	List	   *runConditionOrig;	/* runCondition for display in EXPLAIN */
 	/* these fields are used with RANGE offset PRECEDING/FOLLOWING: */
 	Oid			startInRangeFunc;	/* in_range function for startOffset */
 	Oid			endInRangeFunc; /* in_range function for endOffset */
 	Oid			inRangeColl;	/* collation for in_range tests */
 	bool		inRangeAsc;		/* use ASC sort order for in_range tests? */
 	bool		inRangeNullsFirst;	/* nulls sort first for in_range tests? */
+	bool		topWindow;		/* false for all apart from the WindowAgg
+								 * that's closest to the root of the plan */
 } WindowAgg;
 
 /* ----------------
@@ -1324,4 +1328,21 @@ typedef struct PlanInvalItem
 	uint32		hashValue;		/* hash value of object's cache lookup key */
 } PlanInvalItem;
 
+/*
+ * MonotonicFunction
+ *
+ * Allows the planner to track monotonic properties of functions.  A function
+ * is monotonically increasing if a subsequent call cannot yield a lower value
+ * than the previous call.  A monotonically decreasing function cannot yield a
+ * higher value on subsequent calls, and a function which is both must return
+ * the same value on each call.
+ */
+typedef enum MonotonicFunction
+{
+	MONOTONICFUNC_NONE = 0,
+	MONOTONICFUNC_INCREASING = (1 << 0),
+	MONOTONICFUNC_DECREASING = (1 << 1),
+	MONOTONICFUNC_BOTH = MONOTONICFUNC_INCREASING | MONOTONICFUNC_DECREASING
+} MonotonicFunction;
+
 #endif							/* PLANNODES_H */
diff --git a/src/include/nodes/supportnodes.h b/src/include/nodes/supportnodes.h
index 88b61b3ab3..9fcbc39949 100644
--- a/src/include/nodes/supportnodes.h
+++ b/src/include/nodes/supportnodes.h
@@ -33,12 +33,12 @@
 #ifndef SUPPORTNODES_H
 #define SUPPORTNODES_H
 
-#include "nodes/primnodes.h"
+#include "nodes/plannodes.h"
 
 struct PlannerInfo;				/* avoid including pathnodes.h here */
 struct IndexOptInfo;
 struct SpecialJoinInfo;
-
+struct WindowClause;
 
 /*
  * The Simplify request allows the support function to perform plan-time
@@ -239,4 +239,64 @@ typedef struct SupportRequestIndexCondition
 								 * equivalent of the function call */
 } SupportRequestIndexCondition;
 
+/* ----------
+ * To support more efficient query execution of any monotonically increasing
+ * and/or monotonically decreasing window functions, we support calling the
+ * window function's prosupport function passing along this struct whenever
+ * the planner sees an OpExpr qual directly reference a window function in a
+ * subquery.  When the planner encounters this, we populate this struct and
+ * pass it along to the window function's prosupport function so that it can
+ * evaluate if the given WindowFunc is;
+ *
+ * a) monotonically increasing, or
+ * b) monotonically decreasing, or
+ * c) both monotonically increasing and decreasing, or
+ * d) none of the above.
+ *
+ * A function that is monotonically increasing can never return a value that
+ * is lower than a value returned in a "previous call".  A monotonically
+ * decreasing function can never return a value higher than a value returned
+ * in a previous call.  A function that is both must return the same value
+ * each time.
+ *
+ * We define "previous call" to mean a previous call to the same WindowFunc
+ * struct in the same window partition.
+ *
+ * row_number() is an example of a monotonically increasing function.  The
+ * return value will be reset back to 1 in each new partition.  An example of
+ * a monotonically increasing and decreasing function is COUNT(*) OVER ().
+ * Since there is no ORDER BY clause in this example, all rows in the
+ * partition are peers and all rows within the partition will be within the
+ * frame bound.  Likewise for COUNT(*) OVER(ORDER BY a ROWS BETWEEN UNBOUNDED
+ * PRECEDING AND UNBOUNDED FOLLOWING).
+ *
+ * COUNT(*) OVER (ORDER BY a ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
+ * is an example of a monotonically decreasing function.
+ *
+ * Implementations must only concern themselves with the given WindowFunc
+ * being monotonic in a single partition.
+ *
+ * Inputs:
+ *	'window_func' is the pointer to the window function being called.
+ *
+ *	'window_clause' pointer to the WindowClause data.  Support functions can
+ *	use this to check frame bounds, etc.
+ *
+ * Outputs:
+ *	'monotonic' the resulting MonotonicFunction value for the given input
+ *	window function and window clause.
+ * ----------
+ */
+typedef struct SupportRequestWFuncMonotonic
+{
+	NodeTag		type;
+
+	/* Input fields: */
+	WindowFunc *window_func;	/* Pointer to the window function data */
+	struct WindowClause *window_clause; /* Pointer to the window clause data */
+
+	/* Output fields: */
+	MonotonicFunction monotonic;
+} SupportRequestWFuncMonotonic;
+
 #endif							/* SUPPORTNODES_H */
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 6eca547af8..d2d46b15df 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -245,7 +245,9 @@ extern WindowAggPath *create_windowagg_path(PlannerInfo *root,
 											Path *subpath,
 											PathTarget *target,
 											List *windowFuncs,
-											WindowClause *winclause);
+											WindowClause *winclause,
+											List *qual,
+											bool topwindow);
 extern SetOpPath *create_setop_path(PlannerInfo *root,
 									RelOptInfo *rel,
 									Path *subpath,
diff --git a/src/test/regress/expected/window.out b/src/test/regress/expected/window.out
index bb9ff7f07b..d78b4c463c 100644
--- a/src/test/regress/expected/window.out
+++ b/src/test/regress/expected/window.out
@@ -3336,6 +3336,404 @@ WHERE depname = 'sales';
                            ->  Seq Scan on empsalary
 (9 rows)
 
+-- Test window function run conditions are properly pushed down into the
+-- WindowAgg
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+                  QUERY PLAN                  
+----------------------------------------------
+ WindowAgg
+   Run Condition: (row_number() OVER (?) < 3)
+   ->  Sort
+         Sort Key: empsalary.empno
+         ->  Seq Scan on empsalary
+(5 rows)
+
+-- The following 3 statements should result the same result.
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+ empno | rn 
+-------+----
+     1 |  1
+     2 |  2
+(2 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE 3 > rn;
+ empno | rn 
+-------+----
+     1 |  1
+     2 |  2
+(2 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE 2 >= rn;
+ empno | rn 
+-------+----
+     1 |  1
+     2 |  2
+(2 rows)
+
+-- Ensure r <= 3 is pushed down into the run condition of the window agg
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          rank() OVER (ORDER BY salary DESC) r
+   FROM empsalary) emp
+WHERE r <= 3;
+               QUERY PLAN                
+-----------------------------------------
+ WindowAgg
+   Run Condition: (rank() OVER (?) <= 3)
+   ->  Sort
+         Sort Key: empsalary.salary DESC
+         ->  Seq Scan on empsalary
+(5 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          rank() OVER (ORDER BY salary DESC) r
+   FROM empsalary) emp
+WHERE r <= 3;
+ empno | salary | r 
+-------+--------+---
+     8 |   6000 | 1
+    10 |   5200 | 2
+    11 |   5200 | 2
+(3 rows)
+
+-- Ensure dr = 1 is converted to dr <= 1 to get all rows leading up to dr = 1
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+                     QUERY PLAN                      
+-----------------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.dr = 1)
+   ->  WindowAgg
+         Run Condition: (dense_rank() OVER (?) <= 1)
+         ->  Sort
+               Sort Key: empsalary.salary DESC
+               ->  Seq Scan on empsalary
+(7 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+ empno | salary | dr 
+-------+--------+----
+     8 |   6000 |  1
+(1 row)
+
+-- Check COUNT() and COUNT(*)
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+                QUERY PLAN                 
+-------------------------------------------
+ WindowAgg
+   Run Condition: (count(*) OVER (?) <= 3)
+   ->  Sort
+         Sort Key: empsalary.salary DESC
+         ->  Seq Scan on empsalary
+(5 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+ empno | salary | c 
+-------+--------+---
+     8 |   6000 | 1
+    10 |   5200 | 3
+    11 |   5200 | 3
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+                       QUERY PLAN                        
+---------------------------------------------------------
+ WindowAgg
+   Run Condition: (count(empsalary.empno) OVER (?) <= 3)
+   ->  Sort
+         Sort Key: empsalary.salary DESC
+         ->  Seq Scan on empsalary
+(5 rows)
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+ empno | salary | c 
+-------+--------+---
+     8 |   6000 | 1
+    10 |   5200 | 3
+    11 |   5200 | 3
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) c
+   FROM empsalary) emp
+WHERE c >= 3;
+                QUERY PLAN                 
+-------------------------------------------
+ WindowAgg
+   Run Condition: (count(*) OVER (?) >= 3)
+   ->  Sort
+         Sort Key: empsalary.salary DESC
+         ->  Seq Scan on empsalary
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER () c
+   FROM empsalary) emp
+WHERE 11 <= c;
+                 QUERY PLAN                 
+--------------------------------------------
+ WindowAgg
+   Run Condition: (11 <= count(*) OVER (?))
+   ->  Seq Scan on empsalary
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+                     QUERY PLAN                      
+-----------------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.dr = 1)
+   ->  WindowAgg
+         Run Condition: (dense_rank() OVER (?) <= 1)
+         ->  Sort
+               Sort Key: empsalary.salary DESC
+               ->  Seq Scan on empsalary
+(7 rows)
+
+-- Ensure we get a run condition when there's a PARTITION BY clause
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          depname,
+          row_number() OVER (PARTITION BY depname ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+                      QUERY PLAN                      
+------------------------------------------------------
+ WindowAgg
+   Run Condition: (row_number() OVER (?) < 3)
+   ->  Sort
+         Sort Key: empsalary.depname, empsalary.empno
+         ->  Seq Scan on empsalary
+(5 rows)
+
+-- and ensure we get the correct results from the above plan
+SELECT * FROM
+  (SELECT empno,
+          depname,
+          row_number() OVER (PARTITION BY depname ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+ empno |  depname  | rn 
+-------+-----------+----
+     7 | develop   |  1
+     8 | develop   |  2
+     2 | personnel |  1
+     5 | personnel |  2
+     1 | sales     |  1
+     3 | sales     |  2
+(6 rows)
+
+-- likewise with count(empno) instead of row_number()
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          depname,
+          salary,
+          count(empno) OVER (PARTITION BY depname ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ WindowAgg
+   Run Condition: (count(empsalary.empno) OVER (?) <= 3)
+   ->  Sort
+         Sort Key: empsalary.depname, empsalary.salary DESC
+         ->  Seq Scan on empsalary
+(5 rows)
+
+-- and again, check the results are what we expect.
+SELECT * FROM
+  (SELECT empno,
+          depname,
+          salary,
+          count(empno) OVER (PARTITION BY depname ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+ empno |  depname  | salary | c 
+-------+-----------+--------+---
+     8 | develop   |   6000 | 1
+    10 | develop   |   5200 | 3
+    11 | develop   |   5200 | 3
+     2 | personnel |   3900 | 1
+     5 | personnel |   3500 | 2
+     1 | sales     |   5000 | 1
+     4 | sales     |   4800 | 3
+     3 | sales     |   4800 | 3
+(8 rows)
+
+-- Some more complex cases with multiple window clauses
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT *,
+          count(salary) OVER (PARTITION BY depname || '') c1, -- w1
+          row_number() OVER (PARTITION BY depname) rn, -- w2
+          count(*) OVER (PARTITION BY depname) c2, -- w2
+          count(*) OVER (PARTITION BY '' || depname) c3 -- w3
+   FROM empsalary
+) e WHERE rn <= 1 AND c1 <= 3;
+                                        QUERY PLAN                                         
+-------------------------------------------------------------------------------------------
+ Subquery Scan on e
+   ->  WindowAgg
+         Filter: ((row_number() OVER (?)) <= 1)
+         Run Condition: (count(empsalary.salary) OVER (?) <= 3)
+         ->  Sort
+               Sort Key: (((empsalary.depname)::text || ''::text))
+               ->  WindowAgg
+                     Run Condition: (row_number() OVER (?) <= 1)
+                     ->  Sort
+                           Sort Key: empsalary.depname
+                           ->  WindowAgg
+                                 ->  Sort
+                                       Sort Key: ((''::text || (empsalary.depname)::text))
+                                       ->  Seq Scan on empsalary
+(14 rows)
+
+-- Ensure we correctly filter out all of the run conditions from each window
+SELECT * FROM
+  (SELECT *,
+          count(salary) OVER (PARTITION BY depname || '') c1, -- w1
+          row_number() OVER (PARTITION BY depname) rn, -- w2
+          count(*) OVER (PARTITION BY depname) c2, -- w2
+          count(*) OVER (PARTITION BY '' || depname) c3 -- w3
+   FROM empsalary
+) e WHERE rn <= 1 AND c1 <= 3;
+  depname  | empno | salary | enroll_date | c1 | rn | c2 | c3 
+-----------+-------+--------+-------------+----+----+----+----
+ personnel |     5 |   3500 | 12-10-2007  |  2 |  1 |  2 |  2
+ sales     |     3 |   4800 | 08-01-2007  |  3 |  1 |  3 |  3
+(2 rows)
+
+-- Tests to ensure we don't push down the run condition when it's not valid to
+-- do so.
+-- Ensure we don't push down when the frame options show that the window
+-- function is not monotonically increasing
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) c
+   FROM empsalary) emp
+WHERE c <= 3;
+                  QUERY PLAN                   
+-----------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.c <= 3)
+   ->  WindowAgg
+         ->  Sort
+               Sort Key: empsalary.salary DESC
+               ->  Seq Scan on empsalary
+(6 rows)
+
+-- Ensure we don't push down when the window function's monotonic properties
+-- don't match that of the clauses.
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary) c
+   FROM empsalary) emp
+WHERE 3 <= c;
+                QUERY PLAN                
+------------------------------------------
+ Subquery Scan on emp
+   Filter: (3 <= emp.c)
+   ->  WindowAgg
+         ->  Sort
+               Sort Key: empsalary.salary
+               ->  Seq Scan on empsalary
+(6 rows)
+
+-- Ensure we don't pushdown when there are multiple window clauses to evaluate
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY empno DESC) c,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+                           QUERY PLAN                            
+-----------------------------------------------------------------
+ Subquery Scan on emp
+   Filter: (emp.dr = 1)
+   ->  WindowAgg
+         Filter: ((dense_rank() OVER (?)) <= 1)
+         ->  Sort
+               Sort Key: empsalary.empno DESC
+               ->  WindowAgg
+                     Run Condition: (dense_rank() OVER (?) <= 1)
+                     ->  Sort
+                           Sort Key: empsalary.salary DESC
+                           ->  Seq Scan on empsalary
+(11 rows)
+
 -- Test Sort node collapsing
 EXPLAIN (COSTS OFF)
 SELECT * FROM
diff --git a/src/test/regress/sql/window.sql b/src/test/regress/sql/window.sql
index 41a8e0d152..967b9413de 100644
--- a/src/test/regress/sql/window.sql
+++ b/src/test/regress/sql/window.sql
@@ -988,6 +988,212 @@ SELECT * FROM
    FROM empsalary) emp
 WHERE depname = 'sales';
 
+-- Test window function run conditions are properly pushed down into the
+-- WindowAgg
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+
+-- The following 3 statements should result the same result.
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE 3 > rn;
+
+SELECT * FROM
+  (SELECT empno,
+          row_number() OVER (ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE 2 >= rn;
+
+-- Ensure r <= 3 is pushed down into the run condition of the window agg
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          rank() OVER (ORDER BY salary DESC) r
+   FROM empsalary) emp
+WHERE r <= 3;
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          rank() OVER (ORDER BY salary DESC) r
+   FROM empsalary) emp
+WHERE r <= 3;
+
+-- Ensure dr = 1 is converted to dr <= 1 to get all rows leading up to dr = 1
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+
+-- Check COUNT() and COUNT(*)
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(empno) OVER (ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) c
+   FROM empsalary) emp
+WHERE c >= 3;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER () c
+   FROM empsalary) emp
+WHERE 11 <= c;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC) c,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+
+-- Ensure we get a run condition when there's a PARTITION BY clause
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          depname,
+          row_number() OVER (PARTITION BY depname ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+
+-- and ensure we get the correct results from the above plan
+SELECT * FROM
+  (SELECT empno,
+          depname,
+          row_number() OVER (PARTITION BY depname ORDER BY empno) rn
+   FROM empsalary) emp
+WHERE rn < 3;
+
+-- likewise with count(empno) instead of row_number()
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          depname,
+          salary,
+          count(empno) OVER (PARTITION BY depname ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+-- and again, check the results are what we expect.
+SELECT * FROM
+  (SELECT empno,
+          depname,
+          salary,
+          count(empno) OVER (PARTITION BY depname ORDER BY salary DESC) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+-- Some more complex cases with multiple window clauses
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT *,
+          count(salary) OVER (PARTITION BY depname || '') c1, -- w1
+          row_number() OVER (PARTITION BY depname) rn, -- w2
+          count(*) OVER (PARTITION BY depname) c2, -- w2
+          count(*) OVER (PARTITION BY '' || depname) c3 -- w3
+   FROM empsalary
+) e WHERE rn <= 1 AND c1 <= 3;
+
+-- Ensure we correctly filter out all of the run conditions from each window
+SELECT * FROM
+  (SELECT *,
+          count(salary) OVER (PARTITION BY depname || '') c1, -- w1
+          row_number() OVER (PARTITION BY depname) rn, -- w2
+          count(*) OVER (PARTITION BY depname) c2, -- w2
+          count(*) OVER (PARTITION BY '' || depname) c3 -- w3
+   FROM empsalary
+) e WHERE rn <= 1 AND c1 <= 3;
+
+-- Tests to ensure we don't push down the run condition when it's not valid to
+-- do so.
+
+-- Ensure we don't push down when the frame options show that the window
+-- function is not monotonically increasing
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) c
+   FROM empsalary) emp
+WHERE c <= 3;
+
+-- Ensure we don't push down when the window function's monotonic properties
+-- don't match that of the clauses.
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY salary) c
+   FROM empsalary) emp
+WHERE 3 <= c;
+
+-- Ensure we don't pushdown when there are multiple window clauses to evaluate
+EXPLAIN (COSTS OFF)
+SELECT * FROM
+  (SELECT empno,
+          salary,
+          count(*) OVER (ORDER BY empno DESC) c,
+          dense_rank() OVER (ORDER BY salary DESC) dr
+   FROM empsalary) emp
+WHERE dr = 1;
+
 -- Test Sort node collapsing
 EXPLAIN (COSTS OFF)
 SELECT * FROM
-- 
2.35.1.windows.2

#32

Zhihong Yu

zyu@yugabyte.com

almost 4 years ago

In reply to: David Rowley (#31)

Re: Window Function "Run Conditions"

On Thu, Apr 7, 2022 at 7:11 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Thu, 7 Apr 2022 at 19:01, David Rowley <dgrowleyml@gmail.com> wrote:

On Thu, 7 Apr 2022 at 15:41, Zhihong Yu <zyu@yugabyte.com> wrote:

+ * We must keep the original qual in place if there is

a

+ * PARTITION BY clause as the top-level WindowAgg

remains in

+ * pass-through mode and does nothing to filter out

unwanted
+                * tuples.
+                */
+               *keep_original = false;
The comment talks about keeping original qual but the assignment uses
the value false.

Maybe the comment can be rephrased so that it matches the assignment.

Thanks. I've just removed that comment locally now. You're right, it
was out of date.

I've attached the updated patch with the fixed comment and a few other
comments reworded slightly.

I've also pgindented the patch.

Barring any objection, I'm planning to push this one in around 10 hours
time.

David

Hi,

+   WINDOWAGG_PASSTHROUGH_STRICT    /* Pass-through plus don't store new
+                                    * tuples during spool */

I think the comment in code is illustrative:

+                    * STRICT pass-through mode is required for the top
window
+                    * when there is a PARTITION BY clause.  Otherwise we
must
+                    * ensure we store tuples that don't match the
+                    * runcondition so they're available to WindowAggs
above.

If you think the above is too long where WINDOWAGG_PASSTHROUGH_STRICT is
defined, maybe point to the longer version so that people can find that
more easily.

Cheers

#33

David Rowley

dgrowleyml@gmail.com

almost 4 years ago

In reply to: David Rowley (#31)

Re: Window Function "Run Conditions"

On Fri, 8 Apr 2022 at 02:11, David Rowley <dgrowleyml@gmail.com> wrote:

Barring any objection, I'm planning to push this one in around 10 hours time.

Pushed. 9d9c02ccd

Thank you all for the reviews.

David