Fix incorrect start up costs for WindowAgg paths (bug #17862)

Started by David Rowleyalmost 3 years ago16 messages

dgrowleyml@gmail.com

almost 3 years ago

1 attachment(s)

Over on [1]/messages/by-id/17862-1ab8f74b0f7b0611@postgresql.org, Tim reported that the planner is making some bad choices
when the plan contains a WindowFunc which requires reading all of, or
a large portion of the WindowAgg subnode in order to produce the first
WindowAgg row.

For example:

EXPLAIN (ANALYZE, TIMING OFF)
SELECT COUNT(*) OVER ()
FROM tenk1 t1 INNER JOIN tenk1 t2 ON t1.unique1 = t2.tenthous
LIMIT 1;

With master, we get the following plan:

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.29..0.67 rows=1 width=8) (actual time=47.491..47.492
rows=1 loops=1)
-> WindowAgg (cost=0.29..3815.00 rows=10000 width=8) (actual
time=47.489..47.490 rows=1 loops=1)
-> Nested Loop (cost=0.29..3690.00 rows=10000 width=0)
(actual time=0.026..42.972 rows=10000 loops=1)
-> Seq Scan on tenk1 t2 (cost=0.00..445.00 rows=10000
width=4) (actual time=0.009..1.734 rows=10000 loops=1)
-> Index Only Scan using tenk1_unique1 on tenk1 t1
(cost=0.29..0.31 rows=1 width=4) (actual time=0.003..0.004 rows=1
loops=10000)
Index Cond: (unique1 = t2.tenthous)
Heap Fetches: 0
Planning Time: 0.420 ms
Execution Time: 48.107 ms

You can see that the time to get the first WindowAgg row (47.489 ms)
is not well aligned to the startup cost (0.29). This effectively
causes the planner to choose a Nested Loop plan as it thinks it'll
read just 1 row from the join. Due to the OVER (), we'll read all
rows! Not good.

It's not hard to imagine that a slightly different schema could yield
a *far* worse plan if it opted to use a non-parameterised nested loop
plan and proceed to read all rows from it.

With the attached patch, that turns into:

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=928.02..928.02 rows=1 width=8) (actual
time=29.308..29.310 rows=1 loops=1)
-> WindowAgg (cost=928.02..928.07 rows=10000 width=8) (actual
time=29.306..29.308 rows=1 loops=1)
-> Hash Join (cost=395.57..803.07 rows=10000 width=0)
(actual time=10.674..22.032 rows=10000 loops=1)
Hash Cond: (t1.unique1 = t2.tenthous)
-> Index Only Scan using tenk1_unique1 on tenk1 t1
(cost=0.29..270.29 rows=10000 width=4) (actual time=0.036..4.961
rows=10000 loops=1)
Heap Fetches: 0
-> Hash (cost=270.29..270.29 rows=10000 width=4)
(actual time=10.581..10.582 rows=10000 loops=1)
Buckets: 16384 Batches: 1 Memory Usage: 480kB
-> Index Only Scan using tenk1_thous_tenthous on
tenk1 t2 (cost=0.29..270.29 rows=10000 width=4) (actual
time=0.055..5.437 rows=10000 loops=1)
Heap Fetches: 0
Planning Time: 2.415 ms
Execution Time: 30.554 ms

I'm not sure if we should consider backpatching a fix for this bug.
We tend not to commit stuff that would destabilise plans in the back
branches. On the other hand, it's fairly hard to imagine how we
could make this much worse even given bad estimates.

I do think we should fix this in v16, however.

I'll add this to the "Older bugs affecting stable branches" section of
the PG 16 open items list

David

[1]: /messages/by-id/17862-1ab8f74b0f7b0611@postgresql.org

Attachments:

fix_bug_17862.patchapplication/octet-stream; name=fix_bug_17862.patchDownload

diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 0a2562c149..00f80e4f34 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2809,6 +2809,96 @@ cost_agg(Path *path, PlannerInfo *root,
 	path->total_cost = total_cost;
 }
 
+/*
+ * get_windowclause_startup_tuples
+ *		Determine how many tuples we'll need to fetch from a WindowAgg's
+ *		subnode before we can output the first WindowAgg tuple.
+ */
+static double
+get_windowclause_startup_tuples(PlannerInfo *root, WindowClause *wc,
+								double input_tuples)
+{
+	double partition_tuples;
+
+	/* Figure out how many partitions there are likely to be */
+	if (wc->partitionClause != NIL)
+	{
+		double	num_partitions;
+		List   *partexprs = get_sortgrouplist_exprs(wc->partitionClause,
+													root->parse->targetList);
+
+		num_partitions = estimate_num_groups(root, partexprs, input_tuples,
+											 NULL, NULL);
+		list_free(partexprs);
+
+		partition_tuples = input_tuples / num_partitions;
+	}
+	else
+	{
+		/* all tuples belong to the same partition */
+		partition_tuples = input_tuples;
+	}
+
+	/*
+	 * An empty ORDER BY means all rows are peers and WindowAgg will process
+	 * the entire partition for the first tuple.
+	 */
+	if (wc->orderClause == NIL)
+		return clamp_row_est(partition_tuples);
+
+	/*
+	 * BETWEEN ... AND CURRENT ROW means the executor only needs to read at
+	 * most 1 row from the subnode before outputting the first tuple.
+	 */
+	if (wc->frameOptions & FRAMEOPTION_END_CURRENT_ROW)
+		return 1.0;
+
+	/*
+	 * BETWEEN ... AND UNBOUNDED FOLLOWING means we need to read the entire
+	 * partition before outputting the first tuple.
+	 */
+	if (wc->frameOptions & FRAMEOPTION_END_UNBOUNDED_FOLLOWING)
+		return clamp_row_est(partition_tuples);
+
+	/*
+	 * BETWEEN ... AND N PRECEDING does not need to read from the subnode for
+	 * the first N tuples.  To be safe, we return 1.0.  The endOffset could be
+	 * 0.
+	 */
+	if (wc->frameOptions & FRAMEOPTION_END_OFFSET_PRECEDING)
+		return 1.0;
+
+	/*
+	 * BETWEEN ... AND N FOLLOWING is more complex.  We need to try to
+	 * determine the value of endOffset.
+	 */
+	if (wc->frameOptions & FRAMEOPTION_END_OFFSET_FOLLOWING &&
+		IsA(wc->endOffset, Const))
+	{
+		Const *endOffset = (Const *) wc->endOffset;
+
+		if (endOffset->constisnull)
+		{
+			/*
+			 * This should raise an ERROR during execution, but we'd better do
+			 * something sane here.
+			 */
+			return 1.0;
+		}
+
+		/* XXX do we expect any other type of Const? */
+		else if (endOffset->consttype == INT4OID)
+			return clamp_row_est(Min(partition_tuples,
+									 (double) DatumGetInt32(endOffset->constvalue)));
+		else if (endOffset->consttype == INT8OID)
+			return clamp_row_est(Min(partition_tuples,
+									 (double) DatumGetInt64(endOffset->constvalue)));
+	}
+
+	/* Just assume we'll need DEFAULT_INEQ_SEL of the partition's tuples */
+	return clamp_row_est(partition_tuples * DEFAULT_INEQ_SEL);
+}
+
 /*
  * cost_windowagg
  *		Determines and returns the cost of performing a WindowAgg plan node,
@@ -2818,17 +2908,34 @@ cost_agg(Path *path, PlannerInfo *root,
  */
 void
 cost_windowagg(Path *path, PlannerInfo *root,
-			   List *windowFuncs, int numPartCols, int numOrderCols,
+			   List *windowFuncs, WindowClause *winclause,
 			   Cost input_startup_cost, Cost input_total_cost,
 			   double input_tuples)
 {
 	Cost		startup_cost;
 	Cost		total_cost;
+	double		startup_tuples;
+	int			numPartCols;
+	int			numOrderCols;
 	ListCell   *lc;
 
+	numPartCols = list_length(winclause->partitionClause);
+	numOrderCols = list_length(winclause->orderClause);
+
 	startup_cost = input_startup_cost;
 	total_cost = input_total_cost;
 
+	/*
+	 * Estimate how many tuples we'll need to read from the subnode before we
+	 * can output the first WindowAgg row.
+	 *
+	 * XXX is it worth caching this value in the WindowClause to prevent
+	 * having to calculate it for each path we generate?  The number of
+	 * input_tuples could vary from path to path, however.
+	 */
+	startup_tuples = get_windowclause_startup_tuples(root, winclause,
+													 input_tuples);
+
 	/*
 	 * Window functions are assumed to cost their stated execution cost, plus
 	 * the cost of evaluating their input expressions, per tuple.  Since they
@@ -2880,6 +2987,18 @@ cost_windowagg(Path *path, PlannerInfo *root,
 	path->rows = input_tuples;
 	path->startup_cost = startup_cost;
 	path->total_cost = total_cost;
+
+	/*
+	 * Also take into account how many tuples we need to read from the subnode
+	 * in order to produce the first tuple from the WindowAgg.  To do this we
+	 * proportion the run cost (total cost not including startup costs) over
+	 * the estimated startup tuples.  We already included the startup cost of
+	 * the subnode, so we only need to do this when the estimated startup
+	 * tuples is above 1.0.
+	 */
+	if (startup_tuples > 1.0)
+		path->startup_cost += (total_cost - startup_cost) / input_tuples *
+							  (startup_tuples - 1.0);
 }
 
 /*
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 65a191ebfd..8c4dbff00a 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3445,8 +3445,7 @@ create_windowagg_path(PlannerInfo *root,
 	 */
 	cost_windowagg(&pathnode->path, root,
 				   windowFuncs,
-				   list_length(winclause->partitionClause),
-				   list_length(winclause->orderClause),
+				   winclause,
 				   subpath->startup_cost,
 				   subpath->total_cost,
 				   subpath->rows);
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 6cf49705d3..bee090ffc2 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -131,7 +131,7 @@ extern void cost_agg(Path *path, PlannerInfo *root,
 					 Cost input_startup_cost, Cost input_total_cost,
 					 double input_tuples, double input_width);
 extern void cost_windowagg(Path *path, PlannerInfo *root,
-						   List *windowFuncs, int numPartCols, int numOrderCols,
+						   List *windowFuncs, WindowClause *winclause,
 						   Cost input_startup_cost, Cost input_total_cost,
 						   double input_tuples);
 extern void cost_group(Path *path, PlannerInfo *root,
diff --git a/src/test/regress/expected/window.out b/src/test/regress/expected/window.out
index 1d4b78b9b2..2628033327 100644
--- a/src/test/regress/expected/window.out
+++ b/src/test/regress/expected/window.out
@@ -4808,6 +4808,80 @@ SELECT i, b, bool_and(b) OVER w, bool_or(b) OVER w
  5 | t | t        | t
 (5 rows)
 
+--
+-- Test WindowAgg costing takes into account the number of rows that need to
+-- be fetched before the first row can be output.
+--
+-- Ensure we get a cheap start up plan as the WindowAgg can output the first
+-- row after reading 1 row from the join.
+EXPLAIN (COSTS OFF)
+SELECT COUNT(*) OVER (ORDER BY t1.unique1)
+FROM tenk1 t1 INNER JOIN tenk1 t2 ON t1.unique1 = t2.tenthous
+LIMIT 1;
+                                QUERY PLAN                                
+--------------------------------------------------------------------------
+ Limit
+   ->  WindowAgg
+         ->  Nested Loop
+               ->  Index Only Scan using tenk1_unique1 on tenk1 t1
+               ->  Index Only Scan using tenk1_thous_tenthous on tenk1 t2
+                     Index Cond: (tenthous = t1.unique1)
+(6 rows)
+
+-- Ensure we get a cheap total plan.  Lack of ORDER BY in the WindowClause
+-- means that all rows must be read from the join, so a cheap startup plan
+-- isn't a good choice.
+EXPLAIN (COSTS OFF)
+SELECT COUNT(*) OVER ()
+FROM tenk1 t1 INNER JOIN tenk1 t2 ON t1.unique1 = t2.tenthous
+LIMIT 1;
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Limit
+   ->  WindowAgg
+         ->  Hash Join
+               Hash Cond: (t1.unique1 = t2.tenthous)
+               ->  Index Only Scan using tenk1_unique1 on tenk1 t1
+               ->  Hash
+                     ->  Index Only Scan using tenk1_thous_tenthous on tenk1 t2
+(7 rows)
+
+-- Ensure we get a cheap total plan.  This time use UNBOUNDED FOLLOWING, which
+-- needs to read all join rows to output the first WindowAgg row.
+EXPLAIN (COSTS OFF)
+SELECT COUNT(*) OVER (ORDER BY t1.unique1 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
+FROM tenk1 t1 INNER JOIN tenk1 t2 ON t1.unique1 = t2.tenthous
+LIMIT 1;
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Limit
+   ->  WindowAgg
+         ->  Merge Join
+               Merge Cond: (t1.unique1 = t2.tenthous)
+               ->  Index Only Scan using tenk1_unique1 on tenk1 t1
+               ->  Sort
+                     Sort Key: t2.tenthous
+                     ->  Index Only Scan using tenk1_thous_tenthous on tenk1 t2
+(8 rows)
+
+-- Ensure we get a cheap total plan.  This time use 10000 FOLLOWING so we need
+-- to read all join rows.
+EXPLAIN (COSTS OFF)
+SELECT COUNT(*) OVER (ORDER BY t1.unique1 ROWS BETWEEN UNBOUNDED PRECEDING AND 10000 FOLLOWING)
+FROM tenk1 t1 INNER JOIN tenk1 t2 ON t1.unique1 = t2.tenthous
+LIMIT 1;
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Limit
+   ->  WindowAgg
+         ->  Merge Join
+               Merge Cond: (t1.unique1 = t2.tenthous)
+               ->  Index Only Scan using tenk1_unique1 on tenk1 t1
+               ->  Sort
+                     Sort Key: t2.tenthous
+                     ->  Index Only Scan using tenk1_thous_tenthous on tenk1 t2
+(8 rows)
+
 -- Tests for problems with failure to walk or mutate expressions
 -- within window frame clauses.
 -- test walker (fails with collation error if expressions are not walked)
diff --git a/src/test/regress/sql/window.sql b/src/test/regress/sql/window.sql
index 3ab6ac715d..4789de0937 100644
--- a/src/test/regress/sql/window.sql
+++ b/src/test/regress/sql/window.sql
@@ -1716,6 +1716,40 @@ SELECT i, b, bool_and(b) OVER w, bool_or(b) OVER w
   FROM (VALUES (1,true), (2,true), (3,false), (4,false), (5,true)) v(i,b)
   WINDOW w AS (ORDER BY i ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING);
 
+--
+-- Test WindowAgg costing takes into account the number of rows that need to
+-- be fetched before the first row can be output.
+--
+
+-- Ensure we get a cheap start up plan as the WindowAgg can output the first
+-- row after reading 1 row from the join.
+EXPLAIN (COSTS OFF)
+SELECT COUNT(*) OVER (ORDER BY t1.unique1)
+FROM tenk1 t1 INNER JOIN tenk1 t2 ON t1.unique1 = t2.tenthous
+LIMIT 1;
+
+-- Ensure we get a cheap total plan.  Lack of ORDER BY in the WindowClause
+-- means that all rows must be read from the join, so a cheap startup plan
+-- isn't a good choice.
+EXPLAIN (COSTS OFF)
+SELECT COUNT(*) OVER ()
+FROM tenk1 t1 INNER JOIN tenk1 t2 ON t1.unique1 = t2.tenthous
+LIMIT 1;
+
+-- Ensure we get a cheap total plan.  This time use UNBOUNDED FOLLOWING, which
+-- needs to read all join rows to output the first WindowAgg row.
+EXPLAIN (COSTS OFF)
+SELECT COUNT(*) OVER (ORDER BY t1.unique1 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
+FROM tenk1 t1 INNER JOIN tenk1 t2 ON t1.unique1 = t2.tenthous
+LIMIT 1;
+
+-- Ensure we get a cheap total plan.  This time use 10000 FOLLOWING so we need
+-- to read all join rows.
+EXPLAIN (COSTS OFF)
+SELECT COUNT(*) OVER (ORDER BY t1.unique1 ROWS BETWEEN UNBOUNDED PRECEDING AND 10000 FOLLOWING)
+FROM tenk1 t1 INNER JOIN tenk1 t2 ON t1.unique1 = t2.tenthous
+LIMIT 1;
+
 -- Tests for problems with failure to walk or mutate expressions
 -- within window frame clauses.

Andy Fan

zhihui.fan1213@gmail.com

almost 3 years ago

In reply to: David Rowley (#1)

Re: Fix incorrect start up costs for WindowAgg paths (bug #17862)

On Wed, Apr 12, 2023 at 5:04 PM David Rowley <dgrowleyml@gmail.com> wrote:

With the attached patch, that turns into:

The concept of startup_tuples for a WindowAgg looks good to me, but I
can't follow up with the below line:

+ return clamp_row_est(partition_tuples * DEFAULT_INEQ_SEL);

# select count(*) over() from tenk1 limit 1;
count
-------
10000 --> We need to scan all the tuples.

Should we just return clamp_row_est(partition_tuples)?

--
Best Regards
Andy Fan

David Rowley

dgrowleyml@gmail.com

almost 3 years ago

In reply to: Andy Fan (#2)

Re: Fix incorrect start up costs for WindowAgg paths (bug #17862)

.On Thu, 13 Apr 2023 at 02:28, Andy Fan <zhihui.fan1213@gmail.com> wrote:

The concept of startup_tuples for a WindowAgg looks good to me, but I
can't follow up with the below line:

+ return clamp_row_est(partition_tuples * DEFAULT_INEQ_SEL);

# select count(*) over() from tenk1 limit 1;
count
-------
10000 --> We need to scan all the tuples.

Should we just return clamp_row_est(partition_tuples)?

For the case you've shown, it will. It's handled by this code:

if (wc->orderClause == NIL)
return clamp_row_est(partition_tuples);

It would take something like the following to hit the code you're
concerned about:

explain select count(*) over(order by unique1 rows between unbounded
preceding and 10*random() following) from tenk1;
QUERY PLAN
--------------------------------------------------------------------------------------------
WindowAgg (cost=140.23..420.29 rows=10000 width=12)
-> Index Only Scan using tenk1_unique1 on tenk1
(cost=0.29..270.29 rows=10000 width=4)
(2 rows)

You can see the startup cost is about 33% of the total cost for that,
which is from the DEFAULT_INEQ_SEL. I'm not exactly set on that
having to be DEFAULT_INEQ_SEL, but I'm not really sure what we could
put that's better. I don't really follow why assuming all rows are
required is better. That'll just mean we favour cheap startup plans
less, but there might be a case where a cheap startup plan is
favourable. I was opting for a happy medium when I thought to use
DEFAULT_INEQ_SEL.

I also see I might need to do a bit more work on this as the following
is not handled correctly:

select count(*) over(rows between unbounded preceding and 10
following) from tenk1;

it's assuming all rows due to lack of ORDER BY, but it seems like it
should be 10 rows due to the 10 FOLLOWING end bound.

David

Andy Fan

zhihui.fan1213@gmail.com

almost 3 years ago

In reply to: David Rowley (#3)

Re: Fix incorrect start up costs for WindowAgg paths (bug #17862)

On Thu, Apr 13, 2023 at 6:09 AM David Rowley <dgrowleyml@gmail.com> wrote:

.On Thu, 13 Apr 2023 at 02:28, Andy Fan <zhihui.fan1213@gmail.com> wrote:

The concept of startup_tuples for a WindowAgg looks good to me, but I
can't follow up with the below line:

+ return clamp_row_est(partition_tuples * DEFAULT_INEQ_SEL);

# select count(*) over() from tenk1 limit 1;
count
-------
10000 --> We need to scan all the tuples.

Should we just return clamp_row_est(partition_tuples)?

For the case you've shown, it will. It's handled by this code:

if (wc->orderClause == NIL)
return clamp_row_est(partition_tuples);

My fault. I should have real debugging to double check my

understanding, surely I will next time.

It would take something like the following to hit the code you're

concerned about:

explain select count(*) over(order by unique1 rows between unbounded
preceding and 10*random() following) from tenk1;
QUERY PLAN

--------------------------------------------------------------------------------------------
WindowAgg (cost=140.23..420.29 rows=10000 width=12)
-> Index Only Scan using tenk1_unique1 on tenk1
(cost=0.29..270.29 rows=10000 width=4)
(2 rows)

You can see the startup cost is about 33% of the total cost for that,
which is from the DEFAULT_INEQ_SEL. I'm not exactly set on that
having to be DEFAULT_INEQ_SEL, but I'm not really sure what we could
put that's better. I don't really follow why assuming all rows are
required is better. That'll just mean we favour cheap startup plans
less, but there might be a case where a cheap startup plan is
favourable. I was opting for a happy medium when I thought to use
DEFAULT_INEQ_SEL.

That looks reasonable to me. My suggestion came from my misreading
before, It was a bit late in my time zone when writing. Thanks for the
detailed explanation!

I also see I might need to do a bit more work on this as the following
is not handled correctly:

select count(*) over(rows between unbounded preceding and 10
following) from tenk1;

it's assuming all rows due to lack of ORDER BY, but it seems like it
should be 10 rows due to the 10 FOLLOWING end bound.

True to me.

--
Best Regards
Andy Fan

David Rowley

dgrowleyml@gmail.com

almost 3 years ago

In reply to: David Rowley (#3)

2 attachment(s)

Re: Fix incorrect start up costs for WindowAgg paths (bug #17862)

On Thu, 13 Apr 2023 at 10:09, David Rowley <dgrowleyml@gmail.com> wrote:

I also see I might need to do a bit more work on this as the following
is not handled correctly:

select count(*) over(rows between unbounded preceding and 10
following) from tenk1;

it's assuming all rows due to lack of ORDER BY, but it seems like it
should be 10 rows due to the 10 FOLLOWING end bound.

Well, as it turned out, it was quite a bit more work. The frame
options have had quite a few additions since I last looked in detail.

I've attached v2 of the patch. I've included a DEBUG1 message which
is useful to check what the estimate comes out as without having to
have a debugger attached all the time.

Here are a few samples of the estimator getting things right:

# select count(*) over (order by four range between unbounded
preceding and 2 following exclude current row) from tenk1 limit 1;
DEBUG: startup_tuples = 7499
count
-------
7499

# select count(*) over (order by four rows between unbounded preceding
and 4000 following) from tenk1 limit 1;
DEBUG: startup_tuples = 4001
count
-------
4001

# select count(*) over (order by four rows between unbounded preceding
and 4000 following exclude group) from tenk1 limit 1;
DEBUG: startup_tuples = 1501
count
-------
1501

You can see in each case, startup_tuples was estimated correctly as
confirmed by count(*) during execution.

I've attached some more of these in sample_tests.txt, which all are
correct with the caveat of get_windowclause_startup_tuples() never
returning 0 due to it using clamp_row_est(). In practice, that's a
non-issue due to the way the startup_tuples value is used to calculate
the startup costs.

David

Attachments:

sample_tests.txttext/plain; charset=US-ASCII; name=sample_tests.txtDownload

fix_bug_17862_v2.patchapplication/octet-stream; name=fix_bug_17862_v2.patchDownload

diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 0a2562c149..e9b503063a 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2809,6 +2809,190 @@ cost_agg(Path *path, PlannerInfo *root,
 	path->total_cost = total_cost;
 }
 
+/*
+ * get_windowclause_startup_tuples
+ *		Determine how many tuples we'll need to fetch from a WindowAgg's
+ *		subnode before we can output the first WindowAgg tuple.
+ */
+static double
+get_windowclause_startup_tuples(PlannerInfo *root, WindowClause *wc,
+								double input_tuples)
+{
+	int			frameOptions = wc->frameOptions;
+	double		partition_tuples;
+	double		return_tuples;
+	double		peer_tuples;
+
+	/* Figure out how many partitions there are likely to be */
+	if (wc->partitionClause != NIL)
+	{
+		double		num_partitions;
+		List	   *partexprs = get_sortgrouplist_exprs(wc->partitionClause,
+														root->parse->targetList);
+
+		num_partitions = estimate_num_groups(root, partexprs, input_tuples,
+											 NULL, NULL);
+		list_free(partexprs);
+
+		partition_tuples = input_tuples / num_partitions;
+	}
+	else
+	{
+		/* all tuples belong to the same partition */
+		partition_tuples = input_tuples;
+	}
+
+	/* estimate the number of tuples in each peer group */
+	if (wc->orderClause != NIL)
+	{
+		double		num_groups;
+		List	   *orderexprs;
+
+		orderexprs = get_sortgrouplist_exprs(wc->orderClause,
+											 root->parse->targetList);
+
+		/*
+		 * First figure out how many peer groups there are likely to be in the
+		 * partition.
+		 */
+		num_groups = estimate_num_groups(root, orderexprs,
+										 partition_tuples, NULL,
+										 NULL);
+		list_free(orderexprs);
+		peer_tuples = partition_tuples / num_groups;
+	}
+	else
+	{
+		/* no ORDER BY so only 1 tuple belongs in each peer group */
+		peer_tuples = 1.0;
+	}
+
+	if (frameOptions & FRAMEOPTION_END_UNBOUNDED_FOLLOWING)
+	{
+		/* include all partition rows */
+		return_tuples = partition_tuples;
+	}
+	else if (frameOptions & FRAMEOPTION_END_CURRENT_ROW)
+	{
+		if (frameOptions & FRAMEOPTION_ROWS)
+		{
+			/* just count the current row */
+			return_tuples = 1.0;
+		}
+		else if (frameOptions & (FRAMEOPTION_RANGE | FRAMEOPTION_GROUPS))
+		{
+			/*
+			 * When in RANGE/GROUPS mode, it's more complex.  If there's no
+			 * ORDER BY, then all rows in the partition are peers, otherwise
+			 * we'll need to read the first group of peers.
+			 */
+			if (wc->orderClause == NIL)
+				return_tuples = partition_tuples;
+			else
+				return_tuples = peer_tuples;
+		}
+		else
+		{
+			Assert(false);
+			return_tuples = 1.0;
+		}
+	}
+	else if (frameOptions & FRAMEOPTION_END_OFFSET_PRECEDING)
+	{
+		/*
+		 * BETWEEN ... AND N PRECEDING will only need to read the WindowAgg's
+		 * subnode after N ROWS/RANGES/GROUPS.  N can be 0, but not negative,
+		 * so we'll just assume only the current row needs to be read to fetch
+		 * the first WindowAgg row.
+		 */
+		return_tuples = 1.0;
+	}
+	else if (frameOptions & FRAMEOPTION_END_OFFSET_FOLLOWING)
+	{
+		Const	   *endOffset = (Const *) wc->endOffset;
+		double		end_offset_value;
+
+		/*
+		 * Try and figure out the value specified in the endOffset. XXX is it
+		 * better to use ExecEvalExprSwitchContext()?
+		 */
+		if (!IsA(endOffset, Const))
+		{
+			/*
+			 * When the end bound is not a Const, we'll just need to guess. We
+			 * just make use of DEFAULT_INEQ_SEL.
+			 */
+			end_offset_value = partition_tuples / peer_tuples * DEFAULT_INEQ_SEL;
+		}
+		else if (endOffset->constisnull)
+		{
+			/*
+			 * NULLs are not allowed, but currently, there's no code to error
+			 * out if there's a NULL Const.  We'll only discover this during
+			 * execution.  For now, just pretend everything is ok and assume
+			 * that just the current row/range/group will be needed.
+			 */
+			end_offset_value = 1.0;
+		}
+		else
+		{
+			switch (endOffset->consttype)
+			{
+				case INT2OID:
+					end_offset_value = (double) DatumGetInt16(endOffset->constvalue);
+					break;
+				case INT4OID:
+					end_offset_value = (double) DatumGetInt32(endOffset->constvalue);
+					break;
+				case INT8OID:
+					end_offset_value = (double) DatumGetInt64(endOffset->constvalue);
+					break;
+				default:
+					end_offset_value = partition_tuples / peer_tuples * DEFAULT_INEQ_SEL;
+					break;
+			}
+		}
+
+		if (frameOptions & FRAMEOPTION_ROWS)
+		{
+			/* include the N FOLLOWING and the current row */
+			return_tuples = end_offset_value + 1.0;
+		}
+		else if (frameOptions & (FRAMEOPTION_RANGE | FRAMEOPTION_GROUPS))
+		{
+			/* include N FOLLOWING ranges/group and the initial range/group */
+			return_tuples = peer_tuples * (end_offset_value + 1.0);
+		}
+		else
+		{
+			Assert(false);
+			return_tuples = 1.0;
+		}
+	}
+	else
+	{
+		return_tuples = 1.0;
+		Assert(false);
+	}
+
+	/*
+	 * Cap the return value so it's never higher than the expected tuples in
+	 * the partition.  This must be done before processing the exclusions as
+	 * the exclusions are eliminated from the start of the frame, not the end.
+	 */
+	return_tuples = Min(return_tuples, partition_tuples);
+
+	/* process any exclusions */
+	if (frameOptions & FRAMEOPTION_EXCLUDE_CURRENT_ROW)
+		return_tuples -= 1.0;
+	else if (frameOptions & FRAMEOPTION_EXCLUDE_GROUP)
+		return_tuples -= peer_tuples;
+	else if (frameOptions & FRAMEOPTION_EXCLUDE_TIES)
+		return_tuples -= (peer_tuples - 1.0);
+
+	return clamp_row_est(return_tuples);
+}
+
 /*
  * cost_windowagg
  *		Determines and returns the cost of performing a WindowAgg plan node,
@@ -2818,17 +3002,32 @@ cost_agg(Path *path, PlannerInfo *root,
  */
 void
 cost_windowagg(Path *path, PlannerInfo *root,
-			   List *windowFuncs, int numPartCols, int numOrderCols,
+			   List *windowFuncs, WindowClause *winclause,
 			   Cost input_startup_cost, Cost input_total_cost,
 			   double input_tuples)
 {
 	Cost		startup_cost;
 	Cost		total_cost;
+	double		startup_tuples;
+	int			numPartCols;
+	int			numOrderCols;
 	ListCell   *lc;
 
+	numPartCols = list_length(winclause->partitionClause);
+	numOrderCols = list_length(winclause->orderClause);
+
 	startup_cost = input_startup_cost;
 	total_cost = input_total_cost;
 
+	/*
+	 * Estimate how many tuples we'll need to read from the subnode before we
+	 * can output the first WindowAgg row.
+	 */
+	startup_tuples = get_windowclause_startup_tuples(root, winclause,
+													 input_tuples);
+
+	elog(DEBUG1, "startup_tuples = %g", startup_tuples);	/* XXX not for commit */
+
 	/*
 	 * Window functions are assumed to cost their stated execution cost, plus
 	 * the cost of evaluating their input expressions, per tuple.  Since they
@@ -2880,6 +3079,18 @@ cost_windowagg(Path *path, PlannerInfo *root,
 	path->rows = input_tuples;
 	path->startup_cost = startup_cost;
 	path->total_cost = total_cost;
+
+	/*
+	 * Also, take into account how many tuples we need to read from the
+	 * subnode in order to produce the first tuple from the WindowAgg.  To do
+	 * this we proportion the run cost (total cost not including startup cost)
+	 * over the estimated startup tuples.  We already included the startup
+	 * cost of the subnode, so we only need to do this when the estimated
+	 * startup tuples is above 1.0.
+	 */
+	if (startup_tuples > 1.0)
+		path->startup_cost += (total_cost - startup_cost) / input_tuples *
+			(startup_tuples - 1.0);
 }
 
 /*
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 65a191ebfd..8c4dbff00a 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3445,8 +3445,7 @@ create_windowagg_path(PlannerInfo *root,
 	 */
 	cost_windowagg(&pathnode->path, root,
 				   windowFuncs,
-				   list_length(winclause->partitionClause),
-				   list_length(winclause->orderClause),
+				   winclause,
 				   subpath->startup_cost,
 				   subpath->total_cost,
 				   subpath->rows);
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 6cf49705d3..bee090ffc2 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -131,7 +131,7 @@ extern void cost_agg(Path *path, PlannerInfo *root,
 					 Cost input_startup_cost, Cost input_total_cost,
 					 double input_tuples, double input_width);
 extern void cost_windowagg(Path *path, PlannerInfo *root,
-						   List *windowFuncs, int numPartCols, int numOrderCols,
+						   List *windowFuncs, WindowClause *winclause,
 						   Cost input_startup_cost, Cost input_total_cost,
 						   double input_tuples);
 extern void cost_group(Path *path, PlannerInfo *root,
diff --git a/src/test/regress/expected/window.out b/src/test/regress/expected/window.out
index 1d4b78b9b2..2628033327 100644
--- a/src/test/regress/expected/window.out
+++ b/src/test/regress/expected/window.out
@@ -4808,6 +4808,80 @@ SELECT i, b, bool_and(b) OVER w, bool_or(b) OVER w
  5 | t | t        | t
 (5 rows)
 
+--
+-- Test WindowAgg costing takes into account the number of rows that need to
+-- be fetched before the first row can be output.
+--
+-- Ensure we get a cheap start up plan as the WindowAgg can output the first
+-- row after reading 1 row from the join.
+EXPLAIN (COSTS OFF)
+SELECT COUNT(*) OVER (ORDER BY t1.unique1)
+FROM tenk1 t1 INNER JOIN tenk1 t2 ON t1.unique1 = t2.tenthous
+LIMIT 1;
+                                QUERY PLAN                                
+--------------------------------------------------------------------------
+ Limit
+   ->  WindowAgg
+         ->  Nested Loop
+               ->  Index Only Scan using tenk1_unique1 on tenk1 t1
+               ->  Index Only Scan using tenk1_thous_tenthous on tenk1 t2
+                     Index Cond: (tenthous = t1.unique1)
+(6 rows)
+
+-- Ensure we get a cheap total plan.  Lack of ORDER BY in the WindowClause
+-- means that all rows must be read from the join, so a cheap startup plan
+-- isn't a good choice.
+EXPLAIN (COSTS OFF)
+SELECT COUNT(*) OVER ()
+FROM tenk1 t1 INNER JOIN tenk1 t2 ON t1.unique1 = t2.tenthous
+LIMIT 1;
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Limit
+   ->  WindowAgg
+         ->  Hash Join
+               Hash Cond: (t1.unique1 = t2.tenthous)
+               ->  Index Only Scan using tenk1_unique1 on tenk1 t1
+               ->  Hash
+                     ->  Index Only Scan using tenk1_thous_tenthous on tenk1 t2
+(7 rows)
+
+-- Ensure we get a cheap total plan.  This time use UNBOUNDED FOLLOWING, which
+-- needs to read all join rows to output the first WindowAgg row.
+EXPLAIN (COSTS OFF)
+SELECT COUNT(*) OVER (ORDER BY t1.unique1 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
+FROM tenk1 t1 INNER JOIN tenk1 t2 ON t1.unique1 = t2.tenthous
+LIMIT 1;
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Limit
+   ->  WindowAgg
+         ->  Merge Join
+               Merge Cond: (t1.unique1 = t2.tenthous)
+               ->  Index Only Scan using tenk1_unique1 on tenk1 t1
+               ->  Sort
+                     Sort Key: t2.tenthous
+                     ->  Index Only Scan using tenk1_thous_tenthous on tenk1 t2
+(8 rows)
+
+-- Ensure we get a cheap total plan.  This time use 10000 FOLLOWING so we need
+-- to read all join rows.
+EXPLAIN (COSTS OFF)
+SELECT COUNT(*) OVER (ORDER BY t1.unique1 ROWS BETWEEN UNBOUNDED PRECEDING AND 10000 FOLLOWING)
+FROM tenk1 t1 INNER JOIN tenk1 t2 ON t1.unique1 = t2.tenthous
+LIMIT 1;
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Limit
+   ->  WindowAgg
+         ->  Merge Join
+               Merge Cond: (t1.unique1 = t2.tenthous)
+               ->  Index Only Scan using tenk1_unique1 on tenk1 t1
+               ->  Sort
+                     Sort Key: t2.tenthous
+                     ->  Index Only Scan using tenk1_thous_tenthous on tenk1 t2
+(8 rows)
+
 -- Tests for problems with failure to walk or mutate expressions
 -- within window frame clauses.
 -- test walker (fails with collation error if expressions are not walked)
diff --git a/src/test/regress/sql/window.sql b/src/test/regress/sql/window.sql
index 3ab6ac715d..4789de0937 100644
--- a/src/test/regress/sql/window.sql
+++ b/src/test/regress/sql/window.sql
@@ -1716,6 +1716,40 @@ SELECT i, b, bool_and(b) OVER w, bool_or(b) OVER w
   FROM (VALUES (1,true), (2,true), (3,false), (4,false), (5,true)) v(i,b)
   WINDOW w AS (ORDER BY i ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING);
 
+--
+-- Test WindowAgg costing takes into account the number of rows that need to
+-- be fetched before the first row can be output.
+--
+
+-- Ensure we get a cheap start up plan as the WindowAgg can output the first
+-- row after reading 1 row from the join.
+EXPLAIN (COSTS OFF)
+SELECT COUNT(*) OVER (ORDER BY t1.unique1)
+FROM tenk1 t1 INNER JOIN tenk1 t2 ON t1.unique1 = t2.tenthous
+LIMIT 1;
+
+-- Ensure we get a cheap total plan.  Lack of ORDER BY in the WindowClause
+-- means that all rows must be read from the join, so a cheap startup plan
+-- isn't a good choice.
+EXPLAIN (COSTS OFF)
+SELECT COUNT(*) OVER ()
+FROM tenk1 t1 INNER JOIN tenk1 t2 ON t1.unique1 = t2.tenthous
+LIMIT 1;
+
+-- Ensure we get a cheap total plan.  This time use UNBOUNDED FOLLOWING, which
+-- needs to read all join rows to output the first WindowAgg row.
+EXPLAIN (COSTS OFF)
+SELECT COUNT(*) OVER (ORDER BY t1.unique1 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
+FROM tenk1 t1 INNER JOIN tenk1 t2 ON t1.unique1 = t2.tenthous
+LIMIT 1;
+
+-- Ensure we get a cheap total plan.  This time use 10000 FOLLOWING so we need
+-- to read all join rows.
+EXPLAIN (COSTS OFF)
+SELECT COUNT(*) OVER (ORDER BY t1.unique1 ROWS BETWEEN UNBOUNDED PRECEDING AND 10000 FOLLOWING)
+FROM tenk1 t1 INNER JOIN tenk1 t2 ON t1.unique1 = t2.tenthous
+LIMIT 1;
+
 -- Tests for problems with failure to walk or mutate expressions
 -- within window frame clauses.

David Rowley

dgrowleyml@gmail.com

over 2 years ago

In reply to: David Rowley (#1)

Re: Fix incorrect start up costs for WindowAgg paths (bug #17862)

On Wed, 12 Apr 2023 at 21:03, David Rowley <dgrowleyml@gmail.com> wrote:

I'm not sure if we should consider backpatching a fix for this bug.
We tend not to commit stuff that would destabilise plans in the back
branches. On the other hand, it's fairly hard to imagine how we
could make this much worse even given bad estimates.

I do think we should fix this in v16, however.

I'll add this to the "Older bugs affecting stable branches" section of
the PG 16 open items list

When I wrote the above, it was very soon after the feature freeze for
PG16. I wondered, since we tend not to do cost changes as part of bug
fixes due to not wanting to destabilise plans between minor versions
if we could instead just fix it in PG16 given the freeze had *just*
started. That's no longer the case, so I'm just going to move this
out from where I added it in the PG16 Open items "Live issues" section
and just add a July CF entry for it instead.

David

David Rowley

dgrowleyml@gmail.com

over 2 years ago

In reply to: David Rowley (#6)

1 attachment(s)

Re: Fix incorrect start up costs for WindowAgg paths (bug #17862)

On Wed, 31 May 2023 at 12:59, David Rowley <dgrowleyml@gmail.com> wrote:

On Wed, 12 Apr 2023 at 21:03, David Rowley <dgrowleyml@gmail.com> wrote:

I'll add this to the "Older bugs affecting stable branches" section of
the PG 16 open items list

When I wrote the above, it was very soon after the feature freeze for
PG16. I wondered, since we tend not to do cost changes as part of bug
fixes due to not wanting to destabilise plans between minor versions
if we could instead just fix it in PG16 given the freeze had *just*
started. That's no longer the case, so I'm just going to move this
out from where I added it in the PG16 Open items "Live issues" section
and just add a July CF entry for it instead.

I'm keen to move this patch along. It's not a particularly
interesting patch and don't expect much interest in it, but I feel
it's pretty important to have the planner not accidentally choose a
cheap startup plan when a WindowAgg is going to fetch the entire
subplan's tuples.

I've made another pass over the patch and made a bunch of cosmetic
changes. As far as mechanical changes, I only changed the EXCLUDE
TIES and EXCLUDE GROUP behaviour when there is no ORDER BY clause in
the WindowClause. If there's no ORDER BY then subtracting 1.0 rows
seems like the right thing to do rather than what the previous patch
did.

I (temporarily) left the DEBUG1 elog in there if anyone wants to test
for themselves (saves debugger use). In the absence of that, I'm
planning on just pushing it to master only tomorrow. It seems fairly
low risk and unlikely to attract too much interest since it only
affects startup costs of WindowAgg nodes. I'm currently thinking it's
a bad idea to backpatch this but I'd consider it more if someone else
thought it was a good idea or if more people came along complaining
about poor plan choice in plans containing WindowAggs. Currently, it
seems better not to destabilise plans in the back branches. (CC'd Tim,
who reported #17862, as he may have an opinion on this)

The only thought I had while looking at this again aside from what I
changed was if get_windowclause_startup_tuples() should go in
selfuncs.c. I wondered if it would be neater to use
convert_numeric_to_scalar() instead of the code I had to add to
convert the (SMALL|BIG)INT Consts in <Const> FOLLOWING to double.
Aside from that reason, it seems we don't have many usages of
DEFAULT_INEQ_SEL outside of selfuncs.c. I didn't feel strongly enough
about this to actually move the function.

The updated patch is attached.

Here are the results of my testing (note the DEBUG message matches the
COUNT(*) result in all cases apart from one case where COUNT(*)
returns 0 and the estimated tuples is 1.0).

create table ab (a int, b int);
insert into ab select a,b from generate_series(1,100) a,
generate_series(1,100) b;
analyze ab;
set client_min_messages=debug1;

# select count(*) over () from ab limit 1;
DEBUG: startup_tuples = 10000
count
-------
10000
(1 row)

# select count(*) over (partition by a) from ab limit 1;
DEBUG: startup_tuples = 100
count
-------
100
(1 row)

# select count(*) over (partition by a order by b) from ab limit 1;
DEBUG: startup_tuples = 1
count
-------
1
(1 row)

# select count(*) over (partition by a order by b rows between current
row and unbounded following) from ab limit 1;
DEBUG: startup_tuples = 100
count
-------
100
(1 row)

# select count(*) over (partition by a order by b rows between current
row and 10 following) from ab limit 1;
DEBUG: startup_tuples = 11
count
-------
11
(1 row)

# select count(*) over (partition by a order by b rows between current
row and 10 following exclude current row) from ab limit 1;
DEBUG: startup_tuples = 10
count
-------
10
(1 row)

# select count(*) over (partition by a order by b rows between current
row and 10 following exclude ties) from ab limit 1;
DEBUG: startup_tuples = 11
count
-------
11
(1 row)

# select count(*) over (partition by a order by b range between
current row and 10 following exclude ties) from ab limit 1;
DEBUG: startup_tuples = 11
count
-------
11
(1 row)

# select count(*) over (partition by a order by b range between
current row and unbounded following exclude ties) from ab limit 1;
DEBUG: startup_tuples = 100
count
-------
100
(1 row)

# select count(*) over (partition by a order by b range between
current row and unbounded following exclude group) from ab limit 1;
DEBUG: startup_tuples = 99
count
-------
99
(1 row)

# select count(*) over (partition by a order by b groups between
current row and unbounded following exclude group) from ab limit 1;
DEBUG: startup_tuples = 99
count
-------
99
(1 row)

# select count(*) over (partition by a rows between current row and
unbounded following exclude group) from ab limit 1;
DEBUG: startup_tuples = 1
count
-------
0
(1 row)

# select count(*) over (partition by a rows between current row and
unbounded following exclude ties) from ab limit 1;
DEBUG: startup_tuples = 1
count
-------
1
(1 row)

# select count(*) over (partition by a order by b rows between current
row and unbounded following exclude ties) from ab limit 1;
DEBUG: startup_tuples = 100
count
-------
100
(1 row)

# select count(*) over (partition by a order by b rows between current
row and unbounded following exclude current row) from ab limit 1;
DEBUG: startup_tuples = 99
count
-------
99
(1 row)

# select count(*) over (partition by a order by b range between
current row and 9223372036854775807 following exclude ties) from ab
limit 1;
DEBUG: startup_tuples = 100
count
-------
100
(1 row)

David

Attachments:

fix_bug_17862_v3.patchapplication/octet-stream; name=fix_bug_17862_v3.patchDownload

diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index ef475d95a1..c39eff3f70 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2809,6 +2809,227 @@ cost_agg(Path *path, PlannerInfo *root,
 	path->total_cost = total_cost;
 }
 
+/*
+ * get_windowclause_startup_tuples
+ *		Estimate how many tuples we'll need to fetch from a WindowAgg's
+ *		subnode before we can output the first WindowAgg tuple.
+ *
+ * How many tuples need to be read depends on the WindowClause.  For example,
+ * a WindowClause with no PARTITION BY and no ORDER BY requires that all
+ * subnode tuples are read and aggregated before the WindowAgg can output
+ * anything.  If there's a PARTITION BY, then we only need to look at tuples
+ * in the first partition.  Here we attempt to estimate just how many
+ * 'input_tuples' the WindowAgg will need to read for the given WindowClause
+ * before the first tuple can be output.
+ */
+static double
+get_windowclause_startup_tuples(PlannerInfo *root, WindowClause *wc,
+								double input_tuples)
+{
+	int			frameOptions = wc->frameOptions;
+	double		partition_tuples;
+	double		return_tuples;
+	double		peer_tuples;
+
+	/*
+	 * First, figure out how many partitions there are likely to be and set
+	 * partition_tuples according to that estimate.
+	 */
+	if (wc->partitionClause != NIL)
+	{
+		double		num_partitions;
+		List	   *partexprs = get_sortgrouplist_exprs(wc->partitionClause,
+														root->parse->targetList);
+
+		num_partitions = estimate_num_groups(root, partexprs, input_tuples,
+											 NULL, NULL);
+		list_free(partexprs);
+
+		partition_tuples = input_tuples / num_partitions;
+	}
+	else
+	{
+		/* all tuples belong to the same partition */
+		partition_tuples = input_tuples;
+	}
+
+	/* estimate the number of tuples in each peer group */
+	if (wc->orderClause != NIL)
+	{
+		double		num_groups;
+		List	   *orderexprs;
+
+		orderexprs = get_sortgrouplist_exprs(wc->orderClause,
+											 root->parse->targetList);
+
+		/* estimate out how many peer groups there are in the partition */
+		num_groups = estimate_num_groups(root, orderexprs,
+										 partition_tuples, NULL,
+										 NULL);
+		list_free(orderexprs);
+		peer_tuples = partition_tuples / num_groups;
+	}
+	else
+	{
+		/* no ORDER BY so only 1 tuple belongs in each peer group */
+		peer_tuples = 1.0;
+	}
+
+	if (frameOptions & FRAMEOPTION_END_UNBOUNDED_FOLLOWING)
+	{
+		/* include all partition rows */
+		return_tuples = partition_tuples;
+	}
+	else if (frameOptions & FRAMEOPTION_END_CURRENT_ROW)
+	{
+		if (frameOptions & FRAMEOPTION_ROWS)
+		{
+			/* just count the current row */
+			return_tuples = 1.0;
+		}
+		else if (frameOptions & (FRAMEOPTION_RANGE | FRAMEOPTION_GROUPS))
+		{
+			/*
+			 * When in RANGE/GROUPS mode, it's more complex.  If there's no
+			 * ORDER BY, then all rows in the partition are peers, otherwise
+			 * we'll need to read the first group of peers.
+			 */
+			if (wc->orderClause == NIL)
+				return_tuples = partition_tuples;
+			else
+				return_tuples = peer_tuples;
+		}
+		else
+		{
+			/*
+			 * Something new we don't support yet?  This needs attention.  We'll just
+			 * return 1.0 in the meantime.
+			 */
+			Assert(false);
+			return_tuples = 1.0;
+		}
+	}
+	else if (frameOptions & FRAMEOPTION_END_OFFSET_PRECEDING)
+	{
+		/*
+		 * BETWEEN ... AND N PRECEDING will only need to read the WindowAgg's
+		 * subnode after N ROWS/RANGES/GROUPS.  N can be 0, but not negative,
+		 * so we'll just assume only the current row needs to be read to fetch
+		 * the first WindowAgg row.
+		 */
+		return_tuples = 1.0;
+	}
+	else if (frameOptions & FRAMEOPTION_END_OFFSET_FOLLOWING)
+	{
+		Const	   *endOffset = (Const *) wc->endOffset;
+		double		end_offset_value;
+
+		/* try and figure out the value specified in the endOffset. */
+		if (IsA(endOffset, Const))
+		{
+			if (endOffset->constisnull)
+			{
+				/*
+				 * NULLs are not allowed, but currently, there's no code to
+				 * error out if there's a NULL Const.  We'll only discover
+				 * this during execution.  For now, just pretend everything is
+				 * fine and assume that just the current row/range/group will
+				 * be needed.
+				 */
+				end_offset_value = 1.0;
+			}
+			else
+			{
+				switch (endOffset->consttype)
+				{
+					case INT2OID:
+						end_offset_value =
+								(double) DatumGetInt16(endOffset->constvalue);
+						break;
+					case INT4OID:
+						end_offset_value =
+								(double) DatumGetInt32(endOffset->constvalue);
+						break;
+					case INT8OID:
+						end_offset_value =
+								(double) DatumGetInt64(endOffset->constvalue);
+						break;
+					default:
+						end_offset_value =
+								partition_tuples / peer_tuples *
+								DEFAULT_INEQ_SEL;
+						break;
+				}
+			}
+		}
+		else
+		{
+			/*
+			 * When the end bound is not a Const, we'll just need to guess. We
+			 * just make use of DEFAULT_INEQ_SEL.
+			 */
+			end_offset_value =
+					partition_tuples / peer_tuples * DEFAULT_INEQ_SEL;
+		}
+
+		if (frameOptions & FRAMEOPTION_ROWS)
+		{
+			/* include the N FOLLOWING and the current row */
+			return_tuples = end_offset_value + 1.0;
+		}
+		else if (frameOptions & (FRAMEOPTION_RANGE | FRAMEOPTION_GROUPS))
+		{
+			/* include N FOLLOWING ranges/group and the initial range/group */
+			return_tuples = peer_tuples * (end_offset_value + 1.0);
+		}
+		else
+		{
+			/*
+			 * Something new we don't support yet?  This needs attention.
+			 * We'll just return 1.0 in the meantime.
+			 */
+			Assert(false);
+			return_tuples = 1.0;
+		}
+	}
+	else
+	{
+		/*
+		 * Something new we don't support yet?  This needs attention.  We'll
+		 * just return 1.0 in the meantime.
+		 */
+		Assert(false);
+		return_tuples = 1.0;
+	}
+
+	/*
+	 * Cap the return value so it's never higher than the expected tuples in
+	 * the partition.  This must be done before processing the exclusions as
+	 * the exclusions are eliminated from the start of the frame, not the end.
+	 */
+	return_tuples = Min(return_tuples, partition_tuples);
+
+	/* process any exclusions */
+	if (frameOptions & FRAMEOPTION_EXCLUDE_CURRENT_ROW)
+		return_tuples -= 1.0;
+	else if (frameOptions & FRAMEOPTION_EXCLUDE_GROUP)
+	{
+		if (wc->orderClause != NIL)
+			return_tuples -= peer_tuples;
+		else
+			return_tuples = 1.0;
+	}
+	else if (frameOptions & FRAMEOPTION_EXCLUDE_TIES)
+	{
+		if (wc->orderClause != NIL)
+			return_tuples -= (peer_tuples - 1.0);
+		else
+			return_tuples = 1.0;
+	}
+
+	return clamp_row_est(return_tuples);
+}
+
 /*
  * cost_windowagg
  *		Determines and returns the cost of performing a WindowAgg plan node,
@@ -2818,17 +3039,32 @@ cost_agg(Path *path, PlannerInfo *root,
  */
 void
 cost_windowagg(Path *path, PlannerInfo *root,
-			   List *windowFuncs, int numPartCols, int numOrderCols,
+			   List *windowFuncs, WindowClause *winclause,
 			   Cost input_startup_cost, Cost input_total_cost,
 			   double input_tuples)
 {
 	Cost		startup_cost;
 	Cost		total_cost;
+	double		startup_tuples;
+	int			numPartCols;
+	int			numOrderCols;
 	ListCell   *lc;
 
+	numPartCols = list_length(winclause->partitionClause);
+	numOrderCols = list_length(winclause->orderClause);
+
 	startup_cost = input_startup_cost;
 	total_cost = input_total_cost;
 
+	/*
+	 * Estimate how many tuples we'll need to read from the subnode before we
+	 * can output the first WindowAgg row.
+	 */
+	startup_tuples = get_windowclause_startup_tuples(root, winclause,
+													 input_tuples);
+
+	elog(DEBUG1, "startup_tuples = %g", startup_tuples);	/* XXX not for commit */
+
 	/*
 	 * Window functions are assumed to cost their stated execution cost, plus
 	 * the cost of evaluating their input expressions, per tuple.  Since they
@@ -2880,6 +3116,18 @@ cost_windowagg(Path *path, PlannerInfo *root,
 	path->rows = input_tuples;
 	path->startup_cost = startup_cost;
 	path->total_cost = total_cost;
+
+	/*
+	 * Also, take into account how many tuples we need to read from the
+	 * subnode in order to produce the first tuple from the WindowAgg.  To do
+	 * this we proportion the run cost (total cost not including startup cost)
+	 * over the estimated startup tuples.  We already included the startup
+	 * cost of the subnode, so we only need to do this when the estimated
+	 * startup tuples is above 1.0.
+	 */
+	if (startup_tuples > 1.0)
+		path->startup_cost += (total_cost - startup_cost) / input_tuples *
+			(startup_tuples - 1.0);
 }
 
 /*
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index f123fcb41e..754f0b9f34 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3457,8 +3457,7 @@ create_windowagg_path(PlannerInfo *root,
 	 */
 	cost_windowagg(&pathnode->path, root,
 				   windowFuncs,
-				   list_length(winclause->partitionClause),
-				   list_length(winclause->orderClause),
+				   winclause,
 				   subpath->startup_cost,
 				   subpath->total_cost,
 				   subpath->rows);
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 6cf49705d3..bee090ffc2 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -131,7 +131,7 @@ extern void cost_agg(Path *path, PlannerInfo *root,
 					 Cost input_startup_cost, Cost input_total_cost,
 					 double input_tuples, double input_width);
 extern void cost_windowagg(Path *path, PlannerInfo *root,
-						   List *windowFuncs, int numPartCols, int numOrderCols,
+						   List *windowFuncs, WindowClause *winclause,
 						   Cost input_startup_cost, Cost input_total_cost,
 						   double input_tuples);
 extern void cost_group(Path *path, PlannerInfo *root,
diff --git a/src/test/regress/expected/window.out b/src/test/regress/expected/window.out
index 1d4b78b9b2..2628033327 100644
--- a/src/test/regress/expected/window.out
+++ b/src/test/regress/expected/window.out
@@ -4808,6 +4808,80 @@ SELECT i, b, bool_and(b) OVER w, bool_or(b) OVER w
  5 | t | t        | t
 (5 rows)
 
+--
+-- Test WindowAgg costing takes into account the number of rows that need to
+-- be fetched before the first row can be output.
+--
+-- Ensure we get a cheap start up plan as the WindowAgg can output the first
+-- row after reading 1 row from the join.
+EXPLAIN (COSTS OFF)
+SELECT COUNT(*) OVER (ORDER BY t1.unique1)
+FROM tenk1 t1 INNER JOIN tenk1 t2 ON t1.unique1 = t2.tenthous
+LIMIT 1;
+                                QUERY PLAN                                
+--------------------------------------------------------------------------
+ Limit
+   ->  WindowAgg
+         ->  Nested Loop
+               ->  Index Only Scan using tenk1_unique1 on tenk1 t1
+               ->  Index Only Scan using tenk1_thous_tenthous on tenk1 t2
+                     Index Cond: (tenthous = t1.unique1)
+(6 rows)
+
+-- Ensure we get a cheap total plan.  Lack of ORDER BY in the WindowClause
+-- means that all rows must be read from the join, so a cheap startup plan
+-- isn't a good choice.
+EXPLAIN (COSTS OFF)
+SELECT COUNT(*) OVER ()
+FROM tenk1 t1 INNER JOIN tenk1 t2 ON t1.unique1 = t2.tenthous
+LIMIT 1;
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Limit
+   ->  WindowAgg
+         ->  Hash Join
+               Hash Cond: (t1.unique1 = t2.tenthous)
+               ->  Index Only Scan using tenk1_unique1 on tenk1 t1
+               ->  Hash
+                     ->  Index Only Scan using tenk1_thous_tenthous on tenk1 t2
+(7 rows)
+
+-- Ensure we get a cheap total plan.  This time use UNBOUNDED FOLLOWING, which
+-- needs to read all join rows to output the first WindowAgg row.
+EXPLAIN (COSTS OFF)
+SELECT COUNT(*) OVER (ORDER BY t1.unique1 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
+FROM tenk1 t1 INNER JOIN tenk1 t2 ON t1.unique1 = t2.tenthous
+LIMIT 1;
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Limit
+   ->  WindowAgg
+         ->  Merge Join
+               Merge Cond: (t1.unique1 = t2.tenthous)
+               ->  Index Only Scan using tenk1_unique1 on tenk1 t1
+               ->  Sort
+                     Sort Key: t2.tenthous
+                     ->  Index Only Scan using tenk1_thous_tenthous on tenk1 t2
+(8 rows)
+
+-- Ensure we get a cheap total plan.  This time use 10000 FOLLOWING so we need
+-- to read all join rows.
+EXPLAIN (COSTS OFF)
+SELECT COUNT(*) OVER (ORDER BY t1.unique1 ROWS BETWEEN UNBOUNDED PRECEDING AND 10000 FOLLOWING)
+FROM tenk1 t1 INNER JOIN tenk1 t2 ON t1.unique1 = t2.tenthous
+LIMIT 1;
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Limit
+   ->  WindowAgg
+         ->  Merge Join
+               Merge Cond: (t1.unique1 = t2.tenthous)
+               ->  Index Only Scan using tenk1_unique1 on tenk1 t1
+               ->  Sort
+                     Sort Key: t2.tenthous
+                     ->  Index Only Scan using tenk1_thous_tenthous on tenk1 t2
+(8 rows)
+
 -- Tests for problems with failure to walk or mutate expressions
 -- within window frame clauses.
 -- test walker (fails with collation error if expressions are not walked)
diff --git a/src/test/regress/sql/window.sql b/src/test/regress/sql/window.sql
index 3ab6ac715d..4789de0937 100644
--- a/src/test/regress/sql/window.sql
+++ b/src/test/regress/sql/window.sql
@@ -1716,6 +1716,40 @@ SELECT i, b, bool_and(b) OVER w, bool_or(b) OVER w
   FROM (VALUES (1,true), (2,true), (3,false), (4,false), (5,true)) v(i,b)
   WINDOW w AS (ORDER BY i ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING);
 
+--
+-- Test WindowAgg costing takes into account the number of rows that need to
+-- be fetched before the first row can be output.
+--
+
+-- Ensure we get a cheap start up plan as the WindowAgg can output the first
+-- row after reading 1 row from the join.
+EXPLAIN (COSTS OFF)
+SELECT COUNT(*) OVER (ORDER BY t1.unique1)
+FROM tenk1 t1 INNER JOIN tenk1 t2 ON t1.unique1 = t2.tenthous
+LIMIT 1;
+
+-- Ensure we get a cheap total plan.  Lack of ORDER BY in the WindowClause
+-- means that all rows must be read from the join, so a cheap startup plan
+-- isn't a good choice.
+EXPLAIN (COSTS OFF)
+SELECT COUNT(*) OVER ()
+FROM tenk1 t1 INNER JOIN tenk1 t2 ON t1.unique1 = t2.tenthous
+LIMIT 1;
+
+-- Ensure we get a cheap total plan.  This time use UNBOUNDED FOLLOWING, which
+-- needs to read all join rows to output the first WindowAgg row.
+EXPLAIN (COSTS OFF)
+SELECT COUNT(*) OVER (ORDER BY t1.unique1 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
+FROM tenk1 t1 INNER JOIN tenk1 t2 ON t1.unique1 = t2.tenthous
+LIMIT 1;
+
+-- Ensure we get a cheap total plan.  This time use 10000 FOLLOWING so we need
+-- to read all join rows.
+EXPLAIN (COSTS OFF)
+SELECT COUNT(*) OVER (ORDER BY t1.unique1 ROWS BETWEEN UNBOUNDED PRECEDING AND 10000 FOLLOWING)
+FROM tenk1 t1 INNER JOIN tenk1 t2 ON t1.unique1 = t2.tenthous
+LIMIT 1;
+
 -- Tests for problems with failure to walk or mutate expressions
 -- within window frame clauses.

Andy Fan

zhihui.fan1213@gmail.com

over 2 years ago

In reply to: David Rowley (#7)

Re: Fix incorrect start up costs for WindowAgg paths (bug #17862)

Hi David:

Sorry for feedback at the last minute! I study the patch and find the
following cases.

1. ORDER BY or PARTITION BY

select *, count(two) over (order by unique1) from tenk1 limit 1;
DEBUG: startup_tuples = 1
DEBUG: startup_tuples = 1

select *, count(two) over (partition by unique1) from tenk1 limit 1;
DEBUG: startup_tuples = 1
DEBUG: startup_tuples = 1

Due to the Executor of nodeWindowAgg, we have to fetch the next tuple
until it mismatches with the current one, then we can calculate the
WindowAgg function. In the current patch, we didn't count the
mismatched tuple. I verified my thought with 'break at IndexNext'
function and see IndexNext is called twice, so in the above case the
startup_tuples should be 2?

2. ORDER BY and PARTITION BY

select two, hundred,
count(two) over (partition by ten order by hundred)
from tenk1 limit 1;

DEBUG: startup_tuples = 10
two | hundred | count
-----+---------+-------
0 | 0 | 100

If we consider the mismatched tuples, it should be 101?

3. As we can see the log for startup_tuples is logged twice sometimes,
the reason is because it is used in cost_windowagg, so it is calculated
for every create_one_window_path. I think the startup_tuples should be
independent with the physical path, maybe we can cache it somewhere to
save some planning cycles?

Thanks for the patch!

--
Best Regards
Andy Fan

David Rowley

dgrowleyml@gmail.com

over 2 years ago

In reply to: Andy Fan (#8)

1 attachment(s)

Re: Fix incorrect start up costs for WindowAgg paths (bug #17862)

Thanks for having a look at this.

On Thu, 3 Aug 2023 at 18:49, Andy Fan <zhihui.fan1213@gmail.com> wrote:

1. ORDER BY or PARTITION BY

select *, count(two) over (order by unique1) from tenk1 limit 1;
DEBUG: startup_tuples = 1
DEBUG: startup_tuples = 1

select *, count(two) over (partition by unique1) from tenk1 limit 1;
DEBUG: startup_tuples = 1
DEBUG: startup_tuples = 1

Due to the Executor of nodeWindowAgg, we have to fetch the next tuple
until it mismatches with the current one, then we can calculate the
WindowAgg function. In the current patch, we didn't count the
mismatched tuple. I verified my thought with 'break at IndexNext'
function and see IndexNext is called twice, so in the above case the
startup_tuples should be 2?

You're probably right here. I'd considered that it wasn't that
critical and aimed to attempt to keep the estimate close to the number
of rows that'll be aggregated. I think that's probably not the best
thing to do as if you consider the EXCLUDE options, those just exclude
tuples from aggregation, it does not mean we read fewer tuples from
the subnode. I've updated the patch accordingly.

2. ORDER BY and PARTITION BY

select two, hundred,
count(two) over (partition by ten order by hundred)
from tenk1 limit 1;

DEBUG: startup_tuples = 10
two | hundred | count
-----+---------+-------
0 | 0 | 100

If we consider the mismatched tuples, it should be 101?

I don't really see how we could do better with the current level of
statistics. The stats don't know that there are only 10 distinct
"hundred" values for rows which have ten=1. All we have is n_distinct
on tenk1.hundred, which is 100.

3. As we can see the log for startup_tuples is logged twice sometimes,
the reason is because it is used in cost_windowagg, so it is calculated
for every create_one_window_path. I think the startup_tuples should be
independent with the physical path, maybe we can cache it somewhere to
save some planning cycles?

I wondered about that too but I ended up writing off the idea of
caching because the input_tuple count comes from the Path and the
extra calls are coming from other Paths, which could well have some
completely different value for input_tuples.

David

Attachments:

fix_bug_17862_v4.patchapplication/octet-stream; name=fix_bug_17862_v4.patchDownload

diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index ef475d95a1..cdbc0b5a50 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2809,6 +2809,226 @@ cost_agg(Path *path, PlannerInfo *root,
 	path->total_cost = total_cost;
 }
 
+/*
+ * get_windowclause_startup_tuples
+ *		Estimate how many tuples we'll need to fetch from a WindowAgg's
+ *		subnode before we can output the first WindowAgg tuple.
+ *
+ * How many tuples need to be read depends on the WindowClause.  For example,
+ * a WindowClause with no PARTITION BY and no ORDER BY requires that all
+ * subnode tuples are read and aggregated before the WindowAgg can output
+ * anything.  If there's a PARTITION BY, then we only need to look at tuples
+ * in the first partition.  Here we attempt to estimate just how many
+ * 'input_tuples' the WindowAgg will need to read for the given WindowClause
+ * before the first tuple can be output.
+ */
+static double
+get_windowclause_startup_tuples(PlannerInfo *root, WindowClause *wc,
+								double input_tuples)
+{
+	int			frameOptions = wc->frameOptions;
+	double		partition_tuples;
+	double		return_tuples;
+	double		peer_tuples;
+
+	/*
+	 * First, figure out how many partitions there are likely to be and set
+	 * partition_tuples according to that estimate.
+	 */
+	if (wc->partitionClause != NIL)
+	{
+		double		num_partitions;
+		List	   *partexprs = get_sortgrouplist_exprs(wc->partitionClause,
+														root->parse->targetList);
+
+		num_partitions = estimate_num_groups(root, partexprs, input_tuples,
+											 NULL, NULL);
+		list_free(partexprs);
+
+		partition_tuples = input_tuples / num_partitions;
+	}
+	else
+	{
+		/* all tuples belong to the same partition */
+		partition_tuples = input_tuples;
+	}
+
+	/* estimate the number of tuples in each peer group */
+	if (wc->orderClause != NIL)
+	{
+		double		num_groups;
+		List	   *orderexprs;
+
+		orderexprs = get_sortgrouplist_exprs(wc->orderClause,
+											 root->parse->targetList);
+
+		/* estimate out how many peer groups there are in the partition */
+		num_groups = estimate_num_groups(root, orderexprs,
+										 partition_tuples, NULL,
+										 NULL);
+		list_free(orderexprs);
+		peer_tuples = partition_tuples / num_groups;
+	}
+	else
+	{
+		/* no ORDER BY so only 1 tuple belongs in each peer group */
+		peer_tuples = 1.0;
+	}
+
+	if (frameOptions & FRAMEOPTION_END_UNBOUNDED_FOLLOWING)
+	{
+		/* include all partition rows */
+		return_tuples = partition_tuples;
+	}
+	else if (frameOptions & FRAMEOPTION_END_CURRENT_ROW)
+	{
+		if (frameOptions & FRAMEOPTION_ROWS)
+		{
+			/* just count the current row */
+			return_tuples = 1.0;
+		}
+		else if (frameOptions & (FRAMEOPTION_RANGE | FRAMEOPTION_GROUPS))
+		{
+			/*
+			 * When in RANGE/GROUPS mode, it's more complex.  If there's no
+			 * ORDER BY, then all rows in the partition are peers, otherwise
+			 * we'll need to read the first group of peers.
+			 */
+			if (wc->orderClause == NIL)
+				return_tuples = partition_tuples;
+			else
+				return_tuples = peer_tuples;
+		}
+		else
+		{
+			/*
+			 * Something new we don't support yet?  This needs attention.  We'll just
+			 * return 1.0 in the meantime.
+			 */
+			Assert(false);
+			return_tuples = 1.0;
+		}
+	}
+	else if (frameOptions & FRAMEOPTION_END_OFFSET_PRECEDING)
+	{
+		/*
+		 * BETWEEN ... AND N PRECEDING will only need to read the WindowAgg's
+		 * subnode after N ROWS/RANGES/GROUPS.  N can be 0, but not negative,
+		 * so we'll just assume only the current row needs to be read to fetch
+		 * the first WindowAgg row.
+		 */
+		return_tuples = 1.0;
+	}
+	else if (frameOptions & FRAMEOPTION_END_OFFSET_FOLLOWING)
+	{
+		Const	   *endOffset = (Const *) wc->endOffset;
+		double		end_offset_value;
+
+		/* try and figure out the value specified in the endOffset. */
+		if (IsA(endOffset, Const))
+		{
+			if (endOffset->constisnull)
+			{
+				/*
+				 * NULLs are not allowed, but currently, there's no code to
+				 * error out if there's a NULL Const.  We'll only discover
+				 * this during execution.  For now, just pretend everything is
+				 * fine and assume that just the current row/range/group will
+				 * be needed.
+				 */
+				end_offset_value = 1.0;
+			}
+			else
+			{
+				switch (endOffset->consttype)
+				{
+					case INT2OID:
+						end_offset_value =
+								(double) DatumGetInt16(endOffset->constvalue);
+						break;
+					case INT4OID:
+						end_offset_value =
+								(double) DatumGetInt32(endOffset->constvalue);
+						break;
+					case INT8OID:
+						end_offset_value =
+								(double) DatumGetInt64(endOffset->constvalue);
+						break;
+					default:
+						end_offset_value =
+								partition_tuples / peer_tuples *
+								DEFAULT_INEQ_SEL;
+						break;
+				}
+			}
+		}
+		else
+		{
+			/*
+			 * When the end bound is not a Const, we'll just need to guess. We
+			 * just make use of DEFAULT_INEQ_SEL.
+			 */
+			end_offset_value =
+					partition_tuples / peer_tuples * DEFAULT_INEQ_SEL;
+		}
+
+		if (frameOptions & FRAMEOPTION_ROWS)
+		{
+			/* include the N FOLLOWING and the current row */
+			return_tuples = end_offset_value + 1.0;
+		}
+		else if (frameOptions & (FRAMEOPTION_RANGE | FRAMEOPTION_GROUPS))
+		{
+			/* include N FOLLOWING ranges/group and the initial range/group */
+			return_tuples = peer_tuples * (end_offset_value + 1.0);
+		}
+		else
+		{
+			/*
+			 * Something new we don't support yet?  This needs attention.
+			 * We'll just return 1.0 in the meantime.
+			 */
+			Assert(false);
+			return_tuples = 1.0;
+		}
+	}
+	else
+	{
+		/*
+		 * Something new we don't support yet?  This needs attention.  We'll
+		 * just return 1.0 in the meantime.
+		 */
+		Assert(false);
+		return_tuples = 1.0;
+	}
+
+	if (wc->partitionClause != NIL || wc->orderClause != NIL)
+	{
+		/*
+		 * Cap the return value to the estimated partition tuples and account
+		 * for the extra tuple WindowAgg will need to read to confirm the next
+		 * tuple does not belong to the same partition or peer group.
+		 */
+		return_tuples = Min(return_tuples + 1.0, partition_tuples);
+	}
+	else
+	{
+		/*
+		 * Cap the return value so it's never higher than the expected tuples
+		 * in the partition.
+		 */
+		return_tuples = Min(return_tuples, partition_tuples);
+	}
+
+	/*
+	 * We needn't worry about any EXCLUDE options as those only exclude rows
+	 * from being aggregated, not from being read from the WindowAgg's
+	 * subnode.
+	 */
+
+	return clamp_row_est(return_tuples);
+}
+
 /*
  * cost_windowagg
  *		Determines and returns the cost of performing a WindowAgg plan node,
@@ -2818,17 +3038,32 @@ cost_agg(Path *path, PlannerInfo *root,
  */
 void
 cost_windowagg(Path *path, PlannerInfo *root,
-			   List *windowFuncs, int numPartCols, int numOrderCols,
+			   List *windowFuncs, WindowClause *winclause,
 			   Cost input_startup_cost, Cost input_total_cost,
 			   double input_tuples)
 {
 	Cost		startup_cost;
 	Cost		total_cost;
+	double		startup_tuples;
+	int			numPartCols;
+	int			numOrderCols;
 	ListCell   *lc;
 
+	numPartCols = list_length(winclause->partitionClause);
+	numOrderCols = list_length(winclause->orderClause);
+
 	startup_cost = input_startup_cost;
 	total_cost = input_total_cost;
 
+	/*
+	 * Estimate how many tuples we'll need to read from the subnode before we
+	 * can output the first WindowAgg row.
+	 */
+	startup_tuples = get_windowclause_startup_tuples(root, winclause,
+													 input_tuples);
+
+	elog(DEBUG1, "startup_tuples = %g", startup_tuples);	/* XXX not for commit */
+
 	/*
 	 * Window functions are assumed to cost their stated execution cost, plus
 	 * the cost of evaluating their input expressions, per tuple.  Since they
@@ -2880,6 +3115,18 @@ cost_windowagg(Path *path, PlannerInfo *root,
 	path->rows = input_tuples;
 	path->startup_cost = startup_cost;
 	path->total_cost = total_cost;
+
+	/*
+	 * Also, take into account how many tuples we need to read from the
+	 * subnode in order to produce the first tuple from the WindowAgg.  To do
+	 * this we proportion the run cost (total cost not including startup cost)
+	 * over the estimated startup tuples.  We already included the startup
+	 * cost of the subnode, so we only need to do this when the estimated
+	 * startup tuples is above 1.0.
+	 */
+	if (startup_tuples > 1.0)
+		path->startup_cost += (total_cost - startup_cost) / input_tuples *
+			(startup_tuples - 1.0);
 }
 
 /*
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index f123fcb41e..754f0b9f34 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3457,8 +3457,7 @@ create_windowagg_path(PlannerInfo *root,
 	 */
 	cost_windowagg(&pathnode->path, root,
 				   windowFuncs,
-				   list_length(winclause->partitionClause),
-				   list_length(winclause->orderClause),
+				   winclause,
 				   subpath->startup_cost,
 				   subpath->total_cost,
 				   subpath->rows);
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 6cf49705d3..bee090ffc2 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -131,7 +131,7 @@ extern void cost_agg(Path *path, PlannerInfo *root,
 					 Cost input_startup_cost, Cost input_total_cost,
 					 double input_tuples, double input_width);
 extern void cost_windowagg(Path *path, PlannerInfo *root,
-						   List *windowFuncs, int numPartCols, int numOrderCols,
+						   List *windowFuncs, WindowClause *winclause,
 						   Cost input_startup_cost, Cost input_total_cost,
 						   double input_tuples);
 extern void cost_group(Path *path, PlannerInfo *root,
diff --git a/src/test/regress/expected/window.out b/src/test/regress/expected/window.out
index 1d4b78b9b2..2628033327 100644
--- a/src/test/regress/expected/window.out
+++ b/src/test/regress/expected/window.out
@@ -4808,6 +4808,80 @@ SELECT i, b, bool_and(b) OVER w, bool_or(b) OVER w
  5 | t | t        | t
 (5 rows)
 
+--
+-- Test WindowAgg costing takes into account the number of rows that need to
+-- be fetched before the first row can be output.
+--
+-- Ensure we get a cheap start up plan as the WindowAgg can output the first
+-- row after reading 1 row from the join.
+EXPLAIN (COSTS OFF)
+SELECT COUNT(*) OVER (ORDER BY t1.unique1)
+FROM tenk1 t1 INNER JOIN tenk1 t2 ON t1.unique1 = t2.tenthous
+LIMIT 1;
+                                QUERY PLAN                                
+--------------------------------------------------------------------------
+ Limit
+   ->  WindowAgg
+         ->  Nested Loop
+               ->  Index Only Scan using tenk1_unique1 on tenk1 t1
+               ->  Index Only Scan using tenk1_thous_tenthous on tenk1 t2
+                     Index Cond: (tenthous = t1.unique1)
+(6 rows)
+
+-- Ensure we get a cheap total plan.  Lack of ORDER BY in the WindowClause
+-- means that all rows must be read from the join, so a cheap startup plan
+-- isn't a good choice.
+EXPLAIN (COSTS OFF)
+SELECT COUNT(*) OVER ()
+FROM tenk1 t1 INNER JOIN tenk1 t2 ON t1.unique1 = t2.tenthous
+LIMIT 1;
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Limit
+   ->  WindowAgg
+         ->  Hash Join
+               Hash Cond: (t1.unique1 = t2.tenthous)
+               ->  Index Only Scan using tenk1_unique1 on tenk1 t1
+               ->  Hash
+                     ->  Index Only Scan using tenk1_thous_tenthous on tenk1 t2
+(7 rows)
+
+-- Ensure we get a cheap total plan.  This time use UNBOUNDED FOLLOWING, which
+-- needs to read all join rows to output the first WindowAgg row.
+EXPLAIN (COSTS OFF)
+SELECT COUNT(*) OVER (ORDER BY t1.unique1 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
+FROM tenk1 t1 INNER JOIN tenk1 t2 ON t1.unique1 = t2.tenthous
+LIMIT 1;
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Limit
+   ->  WindowAgg
+         ->  Merge Join
+               Merge Cond: (t1.unique1 = t2.tenthous)
+               ->  Index Only Scan using tenk1_unique1 on tenk1 t1
+               ->  Sort
+                     Sort Key: t2.tenthous
+                     ->  Index Only Scan using tenk1_thous_tenthous on tenk1 t2
+(8 rows)
+
+-- Ensure we get a cheap total plan.  This time use 10000 FOLLOWING so we need
+-- to read all join rows.
+EXPLAIN (COSTS OFF)
+SELECT COUNT(*) OVER (ORDER BY t1.unique1 ROWS BETWEEN UNBOUNDED PRECEDING AND 10000 FOLLOWING)
+FROM tenk1 t1 INNER JOIN tenk1 t2 ON t1.unique1 = t2.tenthous
+LIMIT 1;
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Limit
+   ->  WindowAgg
+         ->  Merge Join
+               Merge Cond: (t1.unique1 = t2.tenthous)
+               ->  Index Only Scan using tenk1_unique1 on tenk1 t1
+               ->  Sort
+                     Sort Key: t2.tenthous
+                     ->  Index Only Scan using tenk1_thous_tenthous on tenk1 t2
+(8 rows)
+
 -- Tests for problems with failure to walk or mutate expressions
 -- within window frame clauses.
 -- test walker (fails with collation error if expressions are not walked)
diff --git a/src/test/regress/sql/window.sql b/src/test/regress/sql/window.sql
index 3ab6ac715d..4789de0937 100644
--- a/src/test/regress/sql/window.sql
+++ b/src/test/regress/sql/window.sql
@@ -1716,6 +1716,40 @@ SELECT i, b, bool_and(b) OVER w, bool_or(b) OVER w
   FROM (VALUES (1,true), (2,true), (3,false), (4,false), (5,true)) v(i,b)
   WINDOW w AS (ORDER BY i ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING);
 
+--
+-- Test WindowAgg costing takes into account the number of rows that need to
+-- be fetched before the first row can be output.
+--
+
+-- Ensure we get a cheap start up plan as the WindowAgg can output the first
+-- row after reading 1 row from the join.
+EXPLAIN (COSTS OFF)
+SELECT COUNT(*) OVER (ORDER BY t1.unique1)
+FROM tenk1 t1 INNER JOIN tenk1 t2 ON t1.unique1 = t2.tenthous
+LIMIT 1;
+
+-- Ensure we get a cheap total plan.  Lack of ORDER BY in the WindowClause
+-- means that all rows must be read from the join, so a cheap startup plan
+-- isn't a good choice.
+EXPLAIN (COSTS OFF)
+SELECT COUNT(*) OVER ()
+FROM tenk1 t1 INNER JOIN tenk1 t2 ON t1.unique1 = t2.tenthous
+LIMIT 1;
+
+-- Ensure we get a cheap total plan.  This time use UNBOUNDED FOLLOWING, which
+-- needs to read all join rows to output the first WindowAgg row.
+EXPLAIN (COSTS OFF)
+SELECT COUNT(*) OVER (ORDER BY t1.unique1 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
+FROM tenk1 t1 INNER JOIN tenk1 t2 ON t1.unique1 = t2.tenthous
+LIMIT 1;
+
+-- Ensure we get a cheap total plan.  This time use 10000 FOLLOWING so we need
+-- to read all join rows.
+EXPLAIN (COSTS OFF)
+SELECT COUNT(*) OVER (ORDER BY t1.unique1 ROWS BETWEEN UNBOUNDED PRECEDING AND 10000 FOLLOWING)
+FROM tenk1 t1 INNER JOIN tenk1 t2 ON t1.unique1 = t2.tenthous
+LIMIT 1;
+
 -- Tests for problems with failure to walk or mutate expressions
 -- within window frame clauses.

#10

Andy Fan

zhihui.fan1213@gmail.com

over 2 years ago

In reply to: David Rowley (#9)

Re: Fix incorrect start up costs for WindowAgg paths (bug #17862)

On Thu, Aug 3, 2023 at 7:29 PM David Rowley <dgrowleyml@gmail.com> wrote:

Thanks for having a look at this.

On Thu, 3 Aug 2023 at 18:49, Andy Fan <zhihui.fan1213@gmail.com> wrote:

1. ORDER BY or PARTITION BY

select *, count(two) over (order by unique1) from tenk1 limit 1;
DEBUG: startup_tuples = 1
DEBUG: startup_tuples = 1

select *, count(two) over (partition by unique1) from tenk1 limit 1;
DEBUG: startup_tuples = 1
DEBUG: startup_tuples = 1

Due to the Executor of nodeWindowAgg, we have to fetch the next tuple
until it mismatches with the current one, then we can calculate the
WindowAgg function. In the current patch, we didn't count the
mismatched tuple. I verified my thought with 'break at IndexNext'
function and see IndexNext is called twice, so in the above case the
startup_tuples should be 2?

You're probably right here. I'd considered that it wasn't that
critical and aimed to attempt to keep the estimate close to the number
of rows that'll be aggregated. I think that's probably not the best
thing to do as if you consider the EXCLUDE options, those just exclude
tuples from aggregation, it does not mean we read fewer tuples from
the subnode. I've updated the patch accordingly.

Thanks.

2. ORDER BY and PARTITION BY

select two, hundred,
count(two) over (partition by ten order by hundred)
from tenk1 limit 1;

DEBUG: startup_tuples = 10
two | hundred | count
-----+---------+-------
0 | 0 | 100

If we consider the mismatched tuples, it should be 101?

I don't really see how we could do better with the current level of
statistics. The stats don't know that there are only 10 distinct
"hundred" values for rows which have ten=1. All we have is n_distinct
on tenk1.hundred, which is 100.

Yes, actually I didn't figure it out before / after my posting.

3. As we can see the log for startup_tuples is logged twice sometimes,
the reason is because it is used in cost_windowagg, so it is calculated
for every create_one_window_path. I think the startup_tuples should be
independent with the physical path, maybe we can cache it somewhere to
save some planning cycles?

I wondered about that too but I ended up writing off the idea of
caching because the input_tuple count comes from the Path and the
extra calls are coming from other Paths, which could well have some
completely different value for input_tuples.

Looks reasonable.

I have checked the updated patch and LGTM.

--
Best Regards
Andy Fan

#11

Tim Palmer

tim3sp@gmail.com

over 2 years ago

In reply to: David Rowley (#7)

Re: Fix incorrect start up costs for WindowAgg paths (bug #17862)

On Thu, 3 Aug 2023 at 05:50, David Rowley <dgrowleyml@gmail.com> wrote:

I'm currently thinking it's
a bad idea to backpatch this but I'd consider it more if someone else
thought it was a good idea or if more people came along complaining
about poor plan choice in plans containing WindowAggs. Currently, it
seems better not to destabilise plans in the back branches. (CC'd Tim,
who reported #17862, as he may have an opinion on this)

I agree it's better not to destabilise plans in the back branches.

Tim

#12

David Rowley

dgrowleyml@gmail.com

over 2 years ago

In reply to: Andy Fan (#10)

Re: Fix incorrect start up costs for WindowAgg paths (bug #17862)

On Fri, 4 Aug 2023 at 02:02, Andy Fan <zhihui.fan1213@gmail.com> wrote:

I have checked the updated patch and LGTM.

Thank you for reviewing. I've pushed the patch to master only.

David

#13

Nathan Bossart

nathandbossart@gmail.com

over 2 years ago

In reply to: David Rowley (#12)

Re: Fix incorrect start up costs for WindowAgg paths (bug #17862)

On Fri, Aug 04, 2023 at 09:28:51AM +1200, David Rowley wrote:

Thank you for reviewing. I've pushed the patch to master only.

I'm seeing some reliable test failures for 32-bit builds on cfbot [0]https://api.cirrus-ci.com/v1/artifact/task/5728127981191168/testrun/build-32/testrun/regress/regress/regression.diffs. At
a glance, it looks like the relations are swapped in the plan.

[0]: https://api.cirrus-ci.com/v1/artifact/task/5728127981191168/testrun/build-32/testrun/regress/regress/regression.diffs

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#14

Tom Lane

tgl@sss.pgh.pa.us

over 2 years ago

In reply to: Nathan Bossart (#13)

Re: Fix incorrect start up costs for WindowAgg paths (bug #17862)

Nathan Bossart <nathandbossart@gmail.com> writes:

On Fri, Aug 04, 2023 at 09:28:51AM +1200, David Rowley wrote:

Thank you for reviewing. I've pushed the patch to master only.

I'm seeing some reliable test failures for 32-bit builds on cfbot [0]. At
a glance, it looks like the relations are swapped in the plan.

Yeah, I got the same result in a 32-bit FreeBSD VM. Probably, the two
plans are of effectively-identical estimated cost, and there's some
roundoff effect in those estimates that differs between machines with
4-byte and 8-byte MAXALIGN.

You could likely stabilize the plan choice by joining two tables that
aren't of identical size -- maybe add an additional WHERE constraint
on one of the tables?

regards, tom lane

#15

David Rowley

dgrowleyml@gmail.com

over 2 years ago

In reply to: Nathan Bossart (#13)

Re: Fix incorrect start up costs for WindowAgg paths (bug #17862)

On Fri, 4 Aug 2023 at 11:54, Nathan Bossart <nathandbossart@gmail.com> wrote:

I'm seeing some reliable test failures for 32-bit builds on cfbot [0]. At
a glance, it looks like the relations are swapped in the plan.

Thank you for the report. I've just pushed a patch which I'm hoping will fix it.

David

#16

Tom Lane

tgl@sss.pgh.pa.us

over 2 years ago

In reply to: David Rowley (#15)

Re: Fix incorrect start up costs for WindowAgg paths (bug #17862)

David Rowley <dgrowleyml@gmail.com> writes:

Thank you for the report. I've just pushed a patch which I'm hoping will fix it.

Passes now on my VM.

regards, tom lane