fix cost subqueryscan wrong parallel cost
The cost_subqueryscan function does not judge whether it is parallel.
regress
-- Incremental sort vs. set operations with varno 0
set enable_hashagg to off;
explain (costs off) select * from t union select * from t order by 1,3;
QUERY PLAN
----------------------------------------------------------
Incremental Sort
Sort Key: t.a, t.c
Presorted Key: t.a
-> Unique
-> Sort
Sort Key: t.a, t.b, t.c
-> Append
-> Gather
Workers Planned: 2
-> Parallel Seq Scan on t
-> Gather
Workers Planned: 2
-> Parallel Seq Scan on t t_1
to
Incremental Sort
Sort Key: t.a, t.c
Presorted Key: t.a
-> Unique
-> Sort
Sort Key: t.a, t.b, t.c
-> Gather
Workers Planned: 2
-> Parallel Append
-> Parallel Seq Scan on t
-> Parallel Seq Scan on t t_1
Obviously the latter is less expensive
bucoo@sohu.com
Attachments:
fix-cost_subqueryscan-get-wrong-parallel-cost.patchapplication/octet-stream; name=fix-cost_subqueryscan-get-wrong-parallel-cost.patchDownload
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index b787c6f81a..777077c125 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1427,6 +1427,28 @@ cost_subqueryscan(SubqueryScanPath *path, PlannerInfo *root,
startup_cost += path->path.pathtarget->cost.startup;
run_cost += path->path.pathtarget->cost.per_tuple * path->path.rows;
+ /* Adjust costing for parallelism, if used. */
+ if (path->path.parallel_workers > 0)
+ {
+ double parallel_divisor = get_parallel_divisor(&path->path);
+
+ /* The CPU cost is divided among all the workers. */
+ run_cost /= parallel_divisor;
+
+ /*
+ * It may be possible to amortize some of the I/O cost, but probably
+ * not very much, because most operating systems already do aggressive
+ * prefetching. For now, we assume that the disk run cost can't be
+ * amortized at all.
+ */
+
+ /*
+ * In the case of a parallel plan, the row count needs to represent
+ * the number of tuples processed per worker.
+ */
+ path->path.rows = clamp_row_est(path->path.rows / parallel_divisor);
+ }
+
path->path.startup_cost += startup_cost;
path->path.total_cost += startup_cost + run_cost;
}
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
index 21c429226f..0d8d77140a 100644
--- a/src/test/regress/expected/incremental_sort.out
+++ b/src/test/regress/expected/incremental_sort.out
@@ -1487,14 +1487,12 @@ explain (costs off) select * from t union select * from t order by 1,3;
-> Unique
-> Sort
Sort Key: t.a, t.b, t.c
- -> Append
- -> Gather
- Workers Planned: 2
+ -> Gather
+ Workers Planned: 2
+ -> Parallel Append
-> Parallel Seq Scan on t
- -> Gather
- Workers Planned: 2
-> Parallel Seq Scan on t t_1
-(13 rows)
+(11 rows)
-- Full sort, not just incremental sort can be pushed below a gather merge path
-- by generate_useful_gather_paths.
On Tue, Apr 12, 2022 at 2:57 AM bucoo@sohu.com <bucoo@sohu.com> wrote:
The cost_subqueryscan function does not judge whether it is parallel.
I don't see any reason why it would need to do that. A subquery scan
isn't parallel aware.
regress
-- Incremental sort vs. set operations with varno 0
set enable_hashagg to off;
explain (costs off) select * from t union select * from t order by 1,3;
QUERY PLAN
----------------------------------------------------------
Incremental Sort
Sort Key: t.a, t.c
Presorted Key: t.a
-> Unique
-> Sort
Sort Key: t.a, t.b, t.c
-> Append
-> Gather
Workers Planned: 2
-> Parallel Seq Scan on t
-> Gather
Workers Planned: 2
-> Parallel Seq Scan on t t_1
to
Incremental Sort
Sort Key: t.a, t.c
Presorted Key: t.a
-> Unique
-> Sort
Sort Key: t.a, t.b, t.c
-> Gather
Workers Planned: 2
-> Parallel Append
-> Parallel Seq Scan on t
-> Parallel Seq Scan on t t_1
Obviously the latter is less expensive
Generally it should be. But there's no subquery scan visible here.
There may well be something wrong here, but I don't think that you've
diagnosed the problem correctly, or explained it clearly.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Fri, Apr 15, 2022 at 12:50 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Apr 12, 2022 at 2:57 AM bucoo@sohu.com <bucoo@sohu.com> wrote:
The cost_subqueryscan function does not judge whether it is parallel.
I don't see any reason why it would need to do that. A subquery scan
isn't parallel aware.regress
-- Incremental sort vs. set operations with varno 0
set enable_hashagg to off;
explain (costs off) select * from t union select * from t order by 1,3;
QUERY PLAN
----------------------------------------------------------
Incremental Sort
Sort Key: t.a, t.c
Presorted Key: t.a
-> Unique
-> Sort
Sort Key: t.a, t.b, t.c
-> Append
-> Gather
Workers Planned: 2
-> Parallel Seq Scan on t
-> Gather
Workers Planned: 2
-> Parallel Seq Scan on t t_1
to
Incremental Sort
Sort Key: t.a, t.c
Presorted Key: t.a
-> Unique
-> Sort
Sort Key: t.a, t.b, t.c
-> Gather
Workers Planned: 2
-> Parallel Append
-> Parallel Seq Scan on t
-> Parallel Seq Scan on t t_1
Obviously the latter is less expensiveGenerally it should be. But there's no subquery scan visible here.
The paths of subtrees in set operations would be type of subqueryscan.
The SubqueryScan nodes are removed later in set_plan_references() in
this case as they are considered as being trivial.
There may well be something wrong here, but I don't think that you've
diagnosed the problem correctly, or explained it clearly.
Some debugging work shows that the second path is generated but then
fails when competing with the first path. So if there is something
wrong, I think cost calculation is the suspicious point.
Not related to this topic but I noticed another problem from the plan.
Note the first Sort node which is to unique-ify the result of the UNION.
Why cannot we re-arrange the sort keys from (a, b, c) to (a, c, b) so
that we can avoid the second Sort node?
Thanks
Richard
Generally it should be. But there's no subquery scan visible here.
I wrote a patch for distinct/union and aggregate support last year(I want restart it again).
/messages/by-id/2021091517250848215321@sohu.com
If not apply this patch, some parallel paths will naver be selected.
Some debugging work shows that the second path is generated but then
fails when competing with the first path. So if there is something
wrong, I think cost calculation is the suspicious point.
Maybe, I will check it again.
Not related to this topic but I noticed another problem from the plan.
Note the first Sort node which is to unique-ify the result of the UNION.
Why cannot we re-arrange the sort keys from (a, b, c) to (a, c, b) so
that we can avoid the second Sort node?
This is a regress test, just for test Incremental Sort plan.
On Fri, Apr 15, 2022 at 05:16:44PM +0800, Richard Guo wrote:
Not related to this topic but I noticed another problem from the plan.
Note the first Sort node which is to unique-ify the result of the UNION.
Why cannot we re-arrange the sort keys from (a, b, c) to (a, c, b) so
that we can avoid the second Sort node?
I don't know, but it's possible there's a solution related to commit db0d67db2
"Optimize order of GROUP BY keys" - DISTINCT is the same as GROUP BY k1, ...,
kN. I guess UNION [DISTINCT] should learn to use GROUP BY rather than
DISTINCT?
--
Justin
On Fri, Apr 15, 2022 at 6:06 AM bucoo@sohu.com <bucoo@sohu.com> wrote:
Generally it should be. But there's no subquery scan visible here.
I wrote a patch for distinct/union and aggregate support last year(I want restart it again).
/messages/by-id/2021091517250848215321@sohu.com
If not apply this patch, some parallel paths will naver be selected.
Sure, but that doesn't make the patch correct. The patch proposes
that, when parallelism in use, a subquery scan will produce fewer rows
than when parallelism is not in use, and that's 100% false. Compare
this with the case of a parallel sequential scan. If a table contains
1000 rows, and we scan it with a regular Seq Scan, the Seq Scan will
return 1000 rows. But if we scan it with a Parallel Seq Scan using
say 4 workers, the number of rows returned in each worker will be
substantially less than 1000, because 1000 is now the *total* number
of rows to be returned across *all* processes, and what we need is the
number of rows returned in *each* process.
The same thing isn't true for a subquery scan. Consider:
Gather
-> Subquery Scan
-> Parallel Seq Scan
One thing is for sure: the number of rows that will be produced by the
subquery scan in each backend is exactly equal to the number of rows
that the subquery scan receives from its subpath. Parallel Seq Scan
can't just return a row count estimate based on the number of rows in
the table, because those rows are going to be divided among the
workers. But the Subquery Scan doesn't do anything like that. If it
receives let's say 250 rows as input in each worker, it's going to
produce 250 output rows in each worker. Your patch says it's going to
produce fewer than that, and that's wrong, regardless of whether it
gives you the plan you want in this particular case.
--
Robert Haas
EDB: http://www.enterprisedb.com
Sure, but that doesn't make the patch correct. The patch proposes
that, when parallelism in use, a subquery scan will produce fewer rows
than when parallelism is not in use, and that's 100% false. Compare
this with the case of a parallel sequential scan. If a table contains
1000 rows, and we scan it with a regular Seq Scan, the Seq Scan will
return 1000 rows. But if we scan it with a Parallel Seq Scan using
say 4 workers, the number of rows returned in each worker will be
substantially less than 1000, because 1000 is now the *total* number
of rows to be returned across *all* processes, and what we need is the
number of rows returned in *each* process.
for now fuction cost_subqueryscan always using *total* rows even parallel
path. like this:
Gather (rows=30000)
Workers Planned: 2
-> Subquery Scan (rows=30000) -- *total* rows, should be equal subpath
-> Parallel Seq Scan (rows=10000)
Maybe the codes:
/* Mark the path with the correct row estimate */
if (param_info)
path->path.rows = param_info->ppi_rows;
else
path->path.rows = baserel->rows;
should change to:
/* Mark the path with the correct row estimate */
if (path->path.parallel_workers > 0)
path->path.rows = path->subpath->rows;
else if (param_info)
path->path.rows = param_info->ppi_rows;
else
path->path.rows = baserel->rows;
bucoo@sohu.com
On Wed, Apr 20, 2022 at 10:01 AM bucoo@sohu.com <bucoo@sohu.com> wrote:
for now fuction cost_subqueryscan always using *total* rows even parallel
path. like this:Gather (rows=30000)
Workers Planned: 2
-> Subquery Scan (rows=30000) -- *total* rows, should be equal subpath
-> Parallel Seq Scan (rows=10000)
OK, that's bad.
Maybe the codes:
/* Mark the path with the correct row estimate */
if (param_info)
path->path.rows = param_info->ppi_rows;
else
path->path.rows = baserel->rows;should change to:
/* Mark the path with the correct row estimate */
if (path->path.parallel_workers > 0)
path->path.rows = path->subpath->rows;
else if (param_info)
path->path.rows = param_info->ppi_rows;
else
path->path.rows = baserel->rows;
Suppose parallelism is not in use and that param_info is NULL. Then,
is path->subpath->rows guaranteed to be equal to baserel->rows? If
yes, then we don't need to a three-part if statement as you propose
here and can just change the "else" clause to say path->path.rows =
path->subpath->rows. If no, then your change gives the wrong answer.
--
Robert Haas
EDB: http://www.enterprisedb.com
for now fuction cost_subqueryscan always using *total* rows even parallel
path. like this:Gather (rows=30000)
Workers Planned: 2
-> Subquery Scan (rows=30000) -- *total* rows, should be equal subpath
-> Parallel Seq Scan (rows=10000)OK, that's bad.
Maybe the codes:
/* Mark the path with the correct row estimate */
if (param_info)
path->path.rows = param_info->ppi_rows;
else
path->path.rows = baserel->rows;should change to:
/* Mark the path with the correct row estimate */
if (path->path.parallel_workers > 0)
path->path.rows = path->subpath->rows;
else if (param_info)
path->path.rows = param_info->ppi_rows;
else
path->path.rows = baserel->rows;Suppose parallelism is not in use and that param_info is NULL. Then,
is path->subpath->rows guaranteed to be equal to baserel->rows? If
yes, then we don't need to a three-part if statement as you propose
here and can just change the "else" clause to say path->path.rows =
path->subpath->rows. If no, then your change gives the wrong answer.
I checked some regress test, Sometimes subquery scan have filter,
so path->subpath->row guaranteed *not* to be equal to baserel->rows.
If the first patch is false, I don't known how to fix this,
looks like need someone's help.
On Thu, Apr 21, 2022 at 2:38 AM bucoo@sohu.com <bucoo@sohu.com> wrote:
Suppose parallelism is not in use and that param_info is NULL. Then,
is path->subpath->rows guaranteed to be equal to baserel->rows? If
yes, then we don't need to a three-part if statement as you propose
here and can just change the "else" clause to say path->path.rows =
path->subpath->rows. If no, then your change gives the wrong answer.I checked some regress test, Sometimes subquery scan have filter,
so path->subpath->row guaranteed *not* to be equal to baserel->rows.
If the first patch is false, I don't known how to fix this,
looks like need someone's help.
Please fix your mailer so that it doesn't send me a bounce message
every time I reply to one of your messages on list.
I don't know how to fix this right now either, then; maybe I or
someone else will have a good idea later.
--
Robert Haas
EDB: http://www.enterprisedb.com
Suppose parallelism is not in use and that param_info is NULL. Then,
is path->subpath->rows guaranteed to be equal to baserel->rows? If
yes, then we don't need to a three-part if statement as you propose
here and can just change the "else" clause to say path->path.rows =
path->subpath->rows. If no, then your change gives the wrong answer.I checked some regress test, Sometimes subquery scan have filter,
so path->subpath->row guaranteed *not* to be equal to baserel->rows.
If the first patch is false, I don't known how to fix this,
looks like need someone's help.Please fix your mailer so that it doesn't send me a bounce message
every time I reply to one of your messages on list.
This message send using Outlook.
I don't know how to fix this right now either, then; maybe I or
someone else will have a good idea later.
I don't known too.
On Wed, Apr 20, 2022 at 11:38 PM bucoo@sohu.com <bucoo@sohu.com> wrote:
for now fuction cost_subqueryscan always using *total* rows even
parallel
path. like this:
Gather (rows=30000)
Workers Planned: 2
-> Subquery Scan (rows=30000) -- *total* rows, should be equalsubpath
-> Parallel Seq Scan (rows=10000)
OK, that's bad.
I don't understand how that plan shape is possible. Gather requires a
parallel aware subpath, so said subpath can be executed multiple times in
parallel, and subquery isn't. If there is parallelism happening within a
subquery the results are consolidated using Append or Gather first - and
the output rows of that path entry (all subpaths of Subquery have the same
->row value per set_subquery_size_estimates), become the input tuples for
Subquery, to which it then applies its selectivity multiplier and stores
the final result in baserel->rows; which the costing code then examines
when costing the RTE_SUBQUERY path entry.
David J.
On Fri, Apr 22, 2022 at 11:55 AM David G. Johnston
<david.g.johnston@gmail.com> wrote:
On Wed, Apr 20, 2022 at 11:38 PM bucoo@sohu.com <bucoo@sohu.com> wrote:
for now fuction cost_subqueryscan always using *total* rows even parallel
path. like this:Gather (rows=30000)
Workers Planned: 2
-> Subquery Scan (rows=30000) -- *total* rows, should be equal subpath
-> Parallel Seq Scan (rows=10000)OK, that's bad.
I don't understand how that plan shape is possible. Gather requires a parallel aware subpath, so said subpath can be executed multiple times in parallel, and subquery isn't. If there is parallelism happening within a subquery the results are consolidated using Append or Gather first - and the output rows of that path entry (all subpaths of Subquery have the same ->row value per set_subquery_size_estimates), become the input tuples for Subquery, to which it then applies its selectivity multiplier and stores the final result in baserel->rows; which the costing code then examines when costing the RTE_SUBQUERY path entry.
Gather doesn't require a parallel aware subpath, just a parallel-safe
subpath. In a case like this, the parallel seq scan will divide the
rows from the underlying relation across the three processes executing
it. Each process will pass the rows it receives through its own copy
of the subquery scan. Then, the Gather node will collect all the rows
from all the workers to produce the final result.
It's an extremely important feature of parallel query that the
parallel-aware node doesn't have to be immediately beneath the Gather.
You need to have a parallel-aware node in there someplace, but it
could be separated from the gather by any number of levels e.g.
Gather
-> Nested Loop
-> Nested Loop
-> Nested Loop
-> Parallel Seq Scan
-> Index Scan
-> Index Scan
-> Index Scan
You can stick as many parameterized index scans in there as you like
and you still only need one parallel-aware node at the bottom. Once
the parallel seq scan divides up the rows across workers, each worker
can perform the index lookups for the rows that it receives without
any coordination with other workers. It neither knows nor cares that
this is happening in the midst of an operation that is parallel
overall; all the nested loops and index scans work just as they would
in a non-parallel plan. The only things that needed to know about
parallelism are the operation that is dividing up the rows among
workers (here, the parallel seq scan) and the operation that is
gathering up all the results produced by individual workers (here, the
gather node).
--
Robert Haas
EDB: http://www.enterprisedb.com
On Thu, Apr 28, 2022 at 9:53 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Apr 22, 2022 at 11:55 AM David G. Johnston
<david.g.johnston@gmail.com> wrote:On Wed, Apr 20, 2022 at 11:38 PM bucoo@sohu.com <bucoo@sohu.com> wrote:
for now fuction cost_subqueryscan always using *total* rows even
parallel
path. like this:
Gather (rows=30000)
Workers Planned: 2
-> Subquery Scan (rows=30000) -- *total* rows, should be equalsubpath
-> Parallel Seq Scan (rows=10000)
OK, that's bad.
Gather doesn't require a parallel aware subpath, just a parallel-safe
subpath. In a case like this, the parallel seq scan will divide the
rows from the underlying relation across the three processes executing
it. Each process will pass the rows it receives through its own copy
of the subquery scan. Then, the Gather node will collect all the rows
from all the workers to produce the final result.
Thank you. I think I got off on a tangent there and do understand the
general design here better now.
I feel like the fact that the 2.4 divisor (for 2 planned workers) isn't
shown in the explain plan anywhere is an oversight.
To move the original complaint forward a bit I am posting the three plan
changes that using path->subpath->rows provokes in the regression tests.
======================================================================
--
-- Check for incorrect optimization when IN subquery contains a SRF
--
select * from int4_tbl o where (f1, f1) in
(select f1, generate_series(1,50) / 10 g from int4_tbl i group by f1);
The material difference between the existing plan and this one is the
estimation of 250 rows
here compared to 1 row.
So (rel.rows != path->subpath->rows) at the top of cost_subqueryscan
+ -> Subquery Scan on "ANY_subquery" (cost=1.06..9.28
rows=250 width=8)
+ Output: "ANY_subquery".f1, "ANY_subquery".g
+ Filter: ("ANY_subquery".f1 = "ANY_subquery".g)
+ -> Result (cost=1.06..6.15 rows=250 width=8)
======================================================================
The second plan change is basically this same thing, going from rows=4 to
rows=1
causes the plan to include a materialize node. The shape for purposes of
the security barrier
remains correct.
======================================================================
select * from t union select * from t order by 1,3;
Gather here costs 2,600 vs the Append being 2,950 in the existing plan
shape.
+ -> Gather (cost=0.00..2600.00 rows=120000 width=12)
+ Workers Planned: 2
+ -> Parallel Append (cost=0.00..2600.00 rows=50000
width=12)
+ -> Parallel Seq Scan on t (cost=0.00..575.00
rows=25000 width=12)
+ -> Parallel Seq Scan on t t_1
(cost=0.00..575.00 rows=25000 width=12)
=======================================================================
I've attached the two raw regression output diffs.
Using path->subpath->rows ignores the impact of the node's own filters, but
the base pre-filter number is/should be the correct one; though it is
difficult to say that with certainty when most of these nodes are discarded
and one cannot debug in the middle but only observe the end results.
Disabling that optimization is presently beyond my skill though I may take
it up anyway as its likely still orders easier to do, and then hope some of
these plans produce using data to check with, than actually diving into a C
debugger for the first time.
Reverse engineering the 350 difference may be another approach - namely is
it strictly due to the different plan shape or is it due to the number of
rows. The fact that the row difference is 35,000 and the cost is 1%
(cpu_tuple_cost = 0.01) of that number seems like a red herring after
thinking it through...to many scans plus the differing shapes.
David J.
Attachments:
r-before.txttext/plain; charset=US-ASCII; name=r-before.txtDownload
diff -U3 /home/vagrant/postgresql/src/test/regress/expected/subselect.out /home/vagrant/postgresql/src/test/regress/results/subselect.out
--- /home/vagrant/postgresql/src/test/regress/expected/subselect.out 2022-04-11 02:01:56.846066150 +0000
+++ /home/vagrant/postgresql/src/test/regress/results/subselect.out 2022-04-29 00:12:41.656445011 +0000
@@ -1298,29 +1298,29 @@
--
-- Check for incorrect optimization when IN subquery contains a SRF
--
-explain (verbose, costs off)
+explain (verbose)
select * from int4_tbl o where (f1, f1) in
(select f1, generate_series(1,50) / 10 g from int4_tbl i group by f1);
- QUERY PLAN
--------------------------------------------------------------------
- Nested Loop Semi Join
+ QUERY PLAN
+-----------------------------------------------------------------------------------------------------
+ Nested Loop Semi Join (cost=1.06..10.40 rows=1 width=4)
Output: o.f1
Join Filter: (o.f1 = "ANY_subquery".f1)
- -> Seq Scan on public.int4_tbl o
+ -> Seq Scan on public.int4_tbl o (cost=0.00..1.05 rows=5 width=4)
Output: o.f1
- -> Materialize
+ -> Materialize (cost=1.06..9.28 rows=1 width=8)
Output: "ANY_subquery".f1, "ANY_subquery".g
- -> Subquery Scan on "ANY_subquery"
+ -> Subquery Scan on "ANY_subquery" (cost=1.06..9.28 rows=1 width=8)
Output: "ANY_subquery".f1, "ANY_subquery".g
Filter: ("ANY_subquery".f1 = "ANY_subquery".g)
- -> Result
+ -> Result (cost=1.06..6.15 rows=250 width=8)
Output: i.f1, ((generate_series(1, 50)) / 10)
- -> ProjectSet
+ -> ProjectSet (cost=1.06..2.40 rows=250 width=8)
Output: generate_series(1, 50), i.f1
- -> HashAggregate
+ -> HashAggregate (cost=1.06..1.11 rows=5 width=4)
Output: i.f1
Group Key: i.f1
- -> Seq Scan on public.int4_tbl i
+ -> Seq Scan on public.int4_tbl i (cost=0.00..1.05 rows=5 width=4)
Output: i.f1
(19 rows)
diff -U3 /home/vagrant/postgresql/src/test/regress/expected/privileges.out /home/vagrant/postgresql/src/test/regress/results/privileges.out
--- /home/vagrant/postgresql/src/test/regress/expected/privileges.out 2022-04-15 22:58:14.784668286 +0000
+++ /home/vagrant/postgresql/src/test/regress/results/privileges.out 2022-04-29 00:12:49.520444688 +0000
@@ -363,17 +363,17 @@
(6 rows)
-- But a security barrier view isolates the leaky operator.
-EXPLAIN (COSTS OFF) SELECT * FROM atest12sbv x, atest12sbv y
+EXPLAIN SELECT * FROM atest12sbv x, atest12sbv y
WHERE x.a = y.b and abs(y.a) <<< 5;
- QUERY PLAN
--------------------------------------
- Nested Loop
+ QUERY PLAN
+-------------------------------------------------------------------------
+ Nested Loop (cost=0.00..340.15 rows=1 width=16)
Join Filter: (atest12_1.a = y.b)
- -> Subquery Scan on y
+ -> Subquery Scan on y (cost=0.00..170.06 rows=1 width=8)
Filter: (abs(y.a) <<< 5)
- -> Seq Scan on atest12
+ -> Seq Scan on atest12 (cost=0.00..170.00 rows=4 width=8)
Filter: (b <<< 5)
- -> Seq Scan on atest12 atest12_1
+ -> Seq Scan on atest12 atest12_1 (cost=0.00..170.00 rows=4 width=8)
Filter: (b <<< 5)
(8 rows)
diff -U3 /home/vagrant/postgresql/src/test/regress/expected/incremental_sort.out /home/vagrant/postgresql/src/test/regress/results/incremental_sort.out
--- /home/vagrant/postgresql/src/test/regress/expected/incremental_sort.out 2022-04-15 22:58:14.776668287 +0000
+++ /home/vagrant/postgresql/src/test/regress/results/incremental_sort.out 2022-04-29 00:12:55.196444456 +0000
@@ -1478,22 +1478,22 @@
-- Incremental sort vs. set operations with varno 0
set enable_hashagg to off;
-explain (costs off) select * from t union select * from t order by 1,3;
- QUERY PLAN
-----------------------------------------------------------
- Incremental Sort
+explain select * from t union select * from t order by 1,3;
+ QUERY PLAN
+---------------------------------------------------------------------------------------------------
+ Incremental Sort (cost=29425.50..47127.57 rows=120000 width=12)
Sort Key: t.a, t.c
Presorted Key: t.a
- -> Unique
- -> Sort
+ -> Unique (cost=29344.85..30544.85 rows=120000 width=12)
+ -> Sort (cost=29344.85..29644.85 rows=120000 width=12)
Sort Key: t.a, t.b, t.c
- -> Append
- -> Gather
+ -> Append (cost=0.00..2950.00 rows=120000 width=12)
+ -> Gather (cost=0.00..575.00 rows=60000 width=12)
Workers Planned: 2
- -> Parallel Seq Scan on t
- -> Gather
+ -> Parallel Seq Scan on t (cost=0.00..575.00 rows=25000 width=12)
+ -> Gather (cost=0.00..575.00 rows=60000 width=12)
Workers Planned: 2
- -> Parallel Seq Scan on t t_1
+ -> Parallel Seq Scan on t t_1 (cost=0.00..575.00 rows=25000 width=12)
(13 rows)
-- Full sort, not just incremental sort can be pushed below a gather merge path
r-after.txttext/plain; charset=US-ASCII; name=r-after.txtDownload
diff -U3 /home/vagrant/postgresql/src/test/regress/expected/subselect.out /home/vagrant/postgresql/src/test/regress/results/subselect.out
--- /home/vagrant/postgresql/src/test/regress/expected/subselect.out 2022-04-11 02:01:56.846066150 +0000
+++ /home/vagrant/postgresql/src/test/regress/results/subselect.out 2022-04-29 00:16:18.988436098 +0000
@@ -1298,31 +1298,35 @@
--
-- Check for incorrect optimization when IN subquery contains a SRF
--
-explain (verbose, costs off)
+explain (verbose)
select * from int4_tbl o where (f1, f1) in
(select f1, generate_series(1,50) / 10 g from int4_tbl i group by f1);
- QUERY PLAN
--------------------------------------------------------------------
- Nested Loop Semi Join
+ QUERY PLAN
+-----------------------------------------------------------------------------------------------------------
+ Hash Join (cost=9.30..10.37 rows=1 width=4)
Output: o.f1
- Join Filter: (o.f1 = "ANY_subquery".f1)
- -> Seq Scan on public.int4_tbl o
+ Inner Unique: true
+ Hash Cond: (o.f1 = "ANY_subquery".f1)
+ -> Seq Scan on public.int4_tbl o (cost=0.00..1.05 rows=5 width=4)
Output: o.f1
- -> Materialize
+ -> Hash (cost=9.29..9.29 rows=1 width=8)
Output: "ANY_subquery".f1, "ANY_subquery".g
- -> Subquery Scan on "ANY_subquery"
+ -> HashAggregate (cost=9.28..9.29 rows=1 width=8)
Output: "ANY_subquery".f1, "ANY_subquery".g
- Filter: ("ANY_subquery".f1 = "ANY_subquery".g)
- -> Result
- Output: i.f1, ((generate_series(1, 50)) / 10)
- -> ProjectSet
- Output: generate_series(1, 50), i.f1
- -> HashAggregate
- Output: i.f1
- Group Key: i.f1
- -> Seq Scan on public.int4_tbl i
+ Group Key: "ANY_subquery".f1, "ANY_subquery".g
+ -> Subquery Scan on "ANY_subquery" (cost=1.06..9.28 rows=250 width=8)
+ Output: "ANY_subquery".f1, "ANY_subquery".g
+ Filter: ("ANY_subquery".f1 = "ANY_subquery".g)
+ -> Result (cost=1.06..6.15 rows=250 width=8)
+ Output: i.f1, ((generate_series(1, 50)) / 10)
+ -> ProjectSet (cost=1.06..2.40 rows=250 width=8)
+ Output: generate_series(1, 50), i.f1
+ -> HashAggregate (cost=1.06..1.11 rows=5 width=4)
Output: i.f1
-(19 rows)
+ Group Key: i.f1
+ -> Seq Scan on public.int4_tbl i (cost=0.00..1.05 rows=5 width=4)
+ Output: i.f1
+(23 rows)
select * from int4_tbl o where (f1, f1) in
(select f1, generate_series(1,50) / 10 g from int4_tbl i group by f1);
diff -U3 /home/vagrant/postgresql/src/test/regress/expected/privileges.out /home/vagrant/postgresql/src/test/regress/results/privileges.out
--- /home/vagrant/postgresql/src/test/regress/expected/privileges.out 2022-04-15 22:58:14.784668286 +0000
+++ /home/vagrant/postgresql/src/test/regress/results/privileges.out 2022-04-29 00:16:26.396435795 +0000
@@ -363,19 +363,20 @@
(6 rows)
-- But a security barrier view isolates the leaky operator.
-EXPLAIN (COSTS OFF) SELECT * FROM atest12sbv x, atest12sbv y
+EXPLAIN SELECT * FROM atest12sbv x, atest12sbv y
WHERE x.a = y.b and abs(y.a) <<< 5;
- QUERY PLAN
--------------------------------------
- Nested Loop
- Join Filter: (atest12_1.a = y.b)
- -> Subquery Scan on y
- Filter: (abs(y.a) <<< 5)
- -> Seq Scan on atest12
- Filter: (b <<< 5)
- -> Seq Scan on atest12 atest12_1
+ QUERY PLAN
+-------------------------------------------------------------------------------------
+ Nested Loop (cost=0.00..340.35 rows=1 width=16)
+ Join Filter: (atest12.a = y.b)
+ -> Seq Scan on atest12 (cost=0.00..170.00 rows=4 width=8)
Filter: (b <<< 5)
-(8 rows)
+ -> Materialize (cost=0.00..170.08 rows=4 width=8)
+ -> Subquery Scan on y (cost=0.00..170.06 rows=4 width=8)
+ Filter: (abs(y.a) <<< 5)
+ -> Seq Scan on atest12 atest12_1 (cost=0.00..170.00 rows=4 width=8)
+ Filter: (b <<< 5)
+(9 rows)
-- Now regress_priv_user1 grants sufficient access to regress_priv_user2.
SET SESSION AUTHORIZATION regress_priv_user1;
diff -U3 /home/vagrant/postgresql/src/test/regress/expected/incremental_sort.out /home/vagrant/postgresql/src/test/regress/results/incremental_sort.out
--- /home/vagrant/postgresql/src/test/regress/expected/incremental_sort.out 2022-04-15 22:58:14.776668287 +0000
+++ /home/vagrant/postgresql/src/test/regress/results/incremental_sort.out 2022-04-29 00:16:32.376435549 +0000
@@ -1478,23 +1478,21 @@
-- Incremental sort vs. set operations with varno 0
set enable_hashagg to off;
-explain (costs off) select * from t union select * from t order by 1,3;
- QUERY PLAN
-----------------------------------------------------------
- Incremental Sort
+explain select * from t union select * from t order by 1,3;
+ QUERY PLAN
+---------------------------------------------------------------------------------------------------
+ Incremental Sort (cost=29075.50..46777.57 rows=120000 width=12)
Sort Key: t.a, t.c
Presorted Key: t.a
- -> Unique
- -> Sort
+ -> Unique (cost=28994.85..30194.85 rows=120000 width=12)
+ -> Sort (cost=28994.85..29294.85 rows=120000 width=12)
Sort Key: t.a, t.b, t.c
- -> Append
- -> Gather
- Workers Planned: 2
- -> Parallel Seq Scan on t
- -> Gather
- Workers Planned: 2
- -> Parallel Seq Scan on t t_1
-(13 rows)
+ -> Gather (cost=0.00..2600.00 rows=120000 width=12)
+ Workers Planned: 2
+ -> Parallel Append (cost=0.00..2600.00 rows=50000 width=12)
+ -> Parallel Seq Scan on t (cost=0.00..575.00 rows=25000 width=12)
+ -> Parallel Seq Scan on t t_1 (cost=0.00..575.00 rows=25000 width=12)
+(11 rows)
-- Full sort, not just incremental sort can be pushed below a gather merge path
-- by generate_useful_gather_paths.
On Fri, Apr 29, 2022 at 12:53 AM Robert Haas <robertmhaas@gmail.com> wrote:
Gather doesn't require a parallel aware subpath, just a parallel-safe
subpath. In a case like this, the parallel seq scan will divide the
rows from the underlying relation across the three processes executing
it. Each process will pass the rows it receives through its own copy
of the subquery scan. Then, the Gather node will collect all the rows
from all the workers to produce the final result.It's an extremely important feature of parallel query that the
parallel-aware node doesn't have to be immediately beneath the Gather.
You need to have a parallel-aware node in there someplace, but it
could be separated from the gather by any number of levels e.g.Gather
-> Nested Loop
-> Nested Loop
-> Nested Loop
-> Parallel Seq Scan
-> Index Scan
-> Index Scan
-> Index Scan
Thanks for the explanation. That's really helpful to understand the
parallel query mechanism.
So for the nodes between Gather and parallel-aware node, how should we
calculate their estimated rows?
Currently subquery scan is using rel->rows (if no parameterization),
which I believe is not correct. That's not the size the subquery scan
node in each worker needs to handle, as the rows have been divided
across workers by the parallel-aware node.
Using subpath->rows is not correct either, as subquery scan node may
have quals.
It seems to me the right way is to divide the rel->rows among all the
workers.
Thanks
Richard
Richard Guo <guofenglinux@gmail.com> writes:
Currently subquery scan is using rel->rows (if no parameterization),
which I believe is not correct. That's not the size the subquery scan
node in each worker needs to handle, as the rows have been divided
across workers by the parallel-aware node.
Really? Maybe I misunderstand the case under consideration, but
what I think will be happening is that each worker will re-execute
the pushed-down subquery in full. Otherwise it can't compute the
correct answer. What gets divided across the set of workers is
the total *number of executions* of the subquery, which should be
independent of the number of workers, so that the cost is (more
or less) the same as the non-parallel case.
At least that's true for a standard correlated subplan, which is
normally run again for each row processed by the parent node.
For hashed subplans and initplans, what would have been "execute
once" semantics becomes "execute once per worker", creating a
strict cost disadvantage for parallelization. I don't know
whether the current costing model accounts for that. But if it
does that wrong, arbitrarily altering the number of rows won't
make it better.
regards, tom lane
On Fri, Apr 29, 2022 at 7:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Richard Guo <guofenglinux@gmail.com> writes:
Currently subquery scan is using rel->rows (if no parameterization),
which I believe is not correct. That's not the size the subquery scan
node in each worker needs to handle, as the rows have been divided
across workers by the parallel-aware node.Really? Maybe I misunderstand the case under consideration, but
what I think will be happening is that each worker will re-execute
the pushed-down subquery in full. Otherwise it can't compute the
correct answer. What gets divided across the set of workers is
the total *number of executions* of the subquery, which should be
independent of the number of workers, so that the cost is (more
or less) the same as the non-parallel case.
I've been operating under the belief that a subquery node (or, rather, all
nodes) in a path get their row count estimates from self-selectivity * rows
provided by the next node down in the path. In a partial path, eventually
the parallel aware node at the bottom of the path divides its row count
estimate by the parallel divisor (2.4 for two workers). That value then
returns back up through the path until it hits a gather. Every node in
between, which are almost exclusively parallel-safe (not parallel aware),
can just use the row count being percolated up the path (i.e., they get the
divided row count in their partial pathing and full row counts in the
normal pathing). The gather node, realizing that the row count it is
dealing with has been divided by the parallel divisor, undoes the division
in order to provide the correct row count to the non-parallel executing
nodes above it.
So a subquery is only ever executed once in a path - but the number of
returned rows depends on the number of planned workers done under the
assumption that a query executed among 2 workers costs 1/2.4 the amount of
the same query done with just the leader. It is an independent factor
compared to everything else going on and so the multiplication can simply
be done first to the original row count.
<starts looking for confirmation in the code>
allpaths.c@2974-2975 (generate_gather_paths)
(L: 2993 as well)
(L: 3167 - generate_useful_gather_paths)
rows =
cheapest_partial_path->rows * cheapest_partial_path->parallel_workers;
Shouldn't that be:
rows =
cheapest_partial_path->rows * (get_parallel_divisor(cheapest_partial_path))
?
David J.
I wrote:
Really? Maybe I misunderstand the case under consideration, but
what I think will be happening is that each worker will re-execute
the pushed-down subquery in full.
Oh ... nevermind that: what I was thinking about was the SubLink/SubPlan
case, which is unrelated to SubqueryScan. -ENOCAFFEINE, sorry about that.
regards, tom lane
I wrote:
-ENOCAFFEINE, sorry about that.
As penance for that blunder, I spent a little time looking into this
issue (responding to Robert's upthread complaint that it wasn't
explained clearly). See the test case in parallel_subquery.sql,
attached, which is adapted from the test in incremental_sort.sql.
It produces these plans:
explain select * from t union select * from t;
Unique (cost=29344.85..30544.85 rows=120000 width=12)
-> Sort (cost=29344.85..29644.85 rows=120000 width=12)
Sort Key: t.a, t.b, t.c
-> Append (cost=0.00..2950.00 rows=120000 width=12)
-> Gather (cost=0.00..575.00 rows=60000 width=12)
Workers Planned: 2
-> Parallel Seq Scan on t (cost=0.00..575.00 rows=25000 width=12)
-> Gather (cost=0.00..575.00 rows=60000 width=12)
Workers Planned: 2
-> Parallel Seq Scan on t t_1 (cost=0.00..575.00 rows=25000 width=12)
explain select * from t union all select * from t;
Gather (cost=0.00..1400.00 rows=120000 width=12)
Workers Planned: 2
-> Parallel Append (cost=0.00..1400.00 rows=50000 width=12)
-> Parallel Seq Scan on t (cost=0.00..575.00 rows=25000 width=12)
-> Parallel Seq Scan on t t_1 (cost=0.00..575.00 rows=25000 width=12)
I take no position on whether the second plan's costs are correct;
but if they are, then the Gather-atop-Append structure is clearly
cheaper than the Append-atop-Gathers structure, so the planner made
the wrong choice in the first case.
I then modified setrefs.c to not remove SubqueryScan nodes
(just make trivial_subqueryscan() return constant false)
and got this output from the UNION case:
Unique (cost=29344.85..30544.85 rows=120000 width=12)
-> Sort (cost=29344.85..29644.85 rows=120000 width=12)
Sort Key: "*SELECT* 1".a, "*SELECT* 1".b, "*SELECT* 1".c
-> Append (cost=0.00..2950.00 rows=120000 width=12)
-> Subquery Scan on "*SELECT* 1" (cost=0.00..1175.00 rows=60000 width=12)
-> Gather (cost=0.00..575.00 rows=60000 width=12)
Workers Planned: 2
-> Parallel Seq Scan on t (cost=0.00..575.00 rows=25000 width=12)
-> Subquery Scan on "*SELECT* 2" (cost=0.00..1175.00 rows=60000 width=12)
-> Gather (cost=0.00..575.00 rows=60000 width=12)
Workers Planned: 2
-> Parallel Seq Scan on t t_1 (cost=0.00..575.00 rows=25000 width=12)
The UNION ALL plan doesn't change, because no SubqueryScans are ever
created in that case (we flattened the UNION ALL to an appendrel
early on). Removing the test case's settings to allow a parallel
plan, the non-parallelized plan for the UNION case looks like
Unique (cost=30044.85..31244.85 rows=120000 width=12)
-> Sort (cost=30044.85..30344.85 rows=120000 width=12)
Sort Key: "*SELECT* 1".a, "*SELECT* 1".b, "*SELECT* 1".c
-> Append (cost=0.00..3650.00 rows=120000 width=12)
-> Subquery Scan on "*SELECT* 1" (cost=0.00..1525.00 rows=60000 width=12)
-> Seq Scan on t (cost=0.00..925.00 rows=60000 width=12)
-> Subquery Scan on "*SELECT* 2" (cost=0.00..1525.00 rows=60000 width=12)
-> Seq Scan on t t_1 (cost=0.00..925.00 rows=60000 width=12)
So: the row counts for SubqueryScan are correct, thus the upthread
proposal to change them is not. The cost estimates, however, seem
pretty bogus. What we're seeing here is that we're charging 600
cost units to pass the data through SubqueryScan, and that's
consistent across the parallel and non-parallel cases, so it's
correct on its own terms. But actually it's totally wrong because
we're going to elide that node later.
So I think the actual problem here is that we leave the decision
to elide no-op SubqueryScan nodes till setrefs.c. We should detect
that earlier, and when it applies, label the SubqueryScanPath with
the exact cost of its child.
(I think the current implementation might've been all right when
it was devised, on the grounds that we had no real planning
flexibility for UNION operations anyway. But now we do, and the
bogus cost charge is causing visible problems.)
regards, tom lane
Attachments:
I wrote:
So I think the actual problem here is that we leave the decision
to elide no-op SubqueryScan nodes till setrefs.c. We should detect
that earlier, and when it applies, label the SubqueryScanPath with
the exact cost of its child.
Hmm ... actually, while doing that seems like it'd be a good idea,
it doesn't have much bearing on the case at hand. I approximated
the results of such a change on this test case by just removing
the "cpu_tuple_cost" component from cost_subqueryscan:
@@ -1420,7 +1420,7 @@ cost_subqueryscan(SubqueryScanPath *path, PlannerInfo *root,
get_restriction_qual_cost(root, baserel, param_info, &qpqual_cost);
startup_cost = qpqual_cost.startup;
- cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
+ cpu_per_tuple = qpqual_cost.per_tuple;
run_cost = cpu_per_tuple * baserel->tuples;
/* tlist eval costs are paid per output row, not per tuple scanned */
and what I got was
Unique (cost=28144.85..29344.85 rows=120000 width=12)
-> Sort (cost=28144.85..28444.85 rows=120000 width=12)
Sort Key: "*SELECT* 1".a, "*SELECT* 1".b, "*SELECT* 1".c
-> Append (cost=0.00..1750.00 rows=120000 width=12)
-> Subquery Scan on "*SELECT* 1" (cost=0.00..575.00 rows=60000 width=12)
-> Gather (cost=0.00..575.00 rows=60000 width=12)
Workers Planned: 2
-> Parallel Seq Scan on t (cost=0.00..575.00 rows=25000 width=12)
-> Subquery Scan on "*SELECT* 2" (cost=0.00..575.00 rows=60000 width=12)
-> Gather (cost=0.00..575.00 rows=60000 width=12)
Workers Planned: 2
-> Parallel Seq Scan on t t_1 (cost=0.00..575.00 rows=25000 width=12)
The subqueryscans are being costed the way we want now, but it's still
going for the append-atop-gathers plan. So I dug a bit deeper, and
found that generate_union_paths does also create the gather-atop-append
plan, but *it's costed at 1750 units*, exactly the same as the other
path. So add_path is just making an arbitrary choice between two paths
of identical cost, and it happens to pick this one.
(If I don't make the above tweak to cost_subqueryscan, the same thing
happens, although the two competing paths now each have cost 2950.)
So: why do we come out with a cost of 1750 for the very same plan
that in the UNION ALL case is costed at 1400? AFAICS it's because
the UNION ALL case thinks that the two inputs of the parallel append
produce 25000 rows apiece so the parallel append produces 50000 rows,
and it costs the append's overhead on that basis. But in the UNION
case, the parallel append sees two inputs that are claimed to return
60000 rows, so the append produces 120000 rows, meaning more append
overhead. We can't readily get EXPLAIN to print this tree since
it's rejected by add_path, but what I'm seeing (with the above
costing hack) is
{GATHERPATH
:rows 120000
:startup_cost 0.00
:total_cost 1750.00
:subpath
{APPENDPATH
:parallel_aware true
:parallel_safe true
:parallel_workers 2
:rows 120000
:startup_cost 0.00
:total_cost 1750.00
:subpaths (
{SUBQUERYSCANPATH
:rows 60000
:startup_cost 0.00
:total_cost 575.00
:subpath
{PATH
:parallel_aware true
:parallel_safe true
:parallel_workers 2
:rows 25000
:startup_cost 0.00
:total_cost 575.00
}
}
{SUBQUERYSCANPATH
:rows 60000
:startup_cost 0.00
:total_cost 575.00
:subpath
{PATH
:parallel_aware true
:parallel_safe true
:parallel_workers 2
:rows 25000
:startup_cost 0.00
:total_cost 575.00
}
}
)
}
}
In short, these SubqueryScans are being labeled as producing 60000 rows
when their input only produces 25000 rows, which is surely insane.
So: even though the SubqueryScan itself isn't parallel-aware, the number
of rows it processes has to be de-rated according to the number of workers
involved. Perhaps something like the original patch in this thread is
indeed correct. But I can't help feeling that this definition of
path_rows is mighty unintuitive and error-prone, and that if
cost_subqueryscan is wrong then it's likely got lots of company.
Maybe we need to take two steps back and rethink the whole approach.
regards, tom lane
On Fri, Apr 29, 2022 at 11:09 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
In short, these SubqueryScans are being labeled as producing 60000 rows
when their input only produces 25000 rows, which is surely insane.So: even though the SubqueryScan itself isn't parallel-aware, the number
of rows it processes has to be de-rated according to the number of workers
involved.
Right, so why does baserel.rows show 60,000 here when path->subpath->rows
only shows 25,000? Because if you substitute path->subpath->rows for
baserel.rows in cost_subquery you get (with your cost change above):
Incremental Sort (cost=27875.50..45577.57 rows=120000 width=12) (actual
time=165.285..235.749 rows=60000 loops=1)
Sort Key: "*SELECT* 1".a, "*SELECT* 1".c
Presorted Key: "*SELECT* 1".a
Full-sort Groups: 10 Sort Method: quicksort Average Memory: 28kB Peak
Memory: 28kB
Pre-sorted Groups: 10 Sort Method: quicksort Average Memory: 521kB
Peak Memory: 521kB
-> Unique (cost=27794.85..28994.85 rows=120000 width=12) (actual
time=157.882..220.501 rows=60000 loops=1)
-> Sort (cost=27794.85..28094.85 rows=120000 width=12) (actual
time=157.881..187.232 rows=120000 loops=1)
Sort Key: "*SELECT* 1".a, "*SELECT* 1".b, "*SELECT* 1".c
Sort Method: external merge Disk: 2600kB
-> Gather (cost=0.00..1400.00 rows=120000 width=12)
(actual time=0.197..22.705 rows=120000 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Append (cost=0.00..1400.00 rows=50000
width=12) (actual time=0.015..13.101 rows=40000 loops=3)
-> Subquery Scan on "*SELECT* 1"
(cost=0.00..575.00 rows=25000 width=12) (actual time=0.014..6.864
rows=30000 loops=2)
-> Parallel Seq Scan on t
(cost=0.00..575.00 rows=25000 width=12) (actual time=0.014..3.708
rows=30000 loops=2)
-> Subquery Scan on "*SELECT* 2"
(cost=0.00..575.00 rows=25000 width=12) (actual time=0.010..6.918
rows=30000 loops=2)
-> Parallel Seq Scan on t t_1
(cost=0.00..575.00 rows=25000 width=12) (actual time=0.010..3.769
rows=30000 loops=2)
Planning Time: 0.137 ms
Execution Time: 239.958 ms
(19 rows)
Which shows your 1400 cost goal from union all, and the expected row
counts, for gather-atop-append.
The fact that (baserel.rows > path->subpath->rows) here seems like a
straight bug: there are no filters involved in this case but in the
presence of filters baserel->rows should be strictly (<=
path->subpath->rows), right?
David J.
"David G. Johnston" <david.g.johnston@gmail.com> writes:
The fact that (baserel.rows > path->subpath->rows) here seems like a
straight bug: there are no filters involved in this case but in the
presence of filters baserel->rows should be strictly (<=
path->subpath->rows), right?
No, because the subpath's rowcount has been derated to represent the
number of rows any one worker is expected to process. Since the
SubqueryScan is below the Gather, it has to represent that number
of rows too. Like I said, this design is pretty confusing; though
I do not have a better alternative to suggest.
Anyway, after chewing on this for awhile, it strikes me that the
sanest way to proceed is for cost_subqueryscan to compute its rows
estimate as the subpath's rows estimate times the selectivity of
the subqueryscan's quals (if any). That'd be the natural way to
proceed for most sorts of non-bottom-level paths to begin with.
I think there are a few reasons why cost_subqueryscan currently
does it the way it does:
* By analogy to other sorts of relation-scan nodes. But those don't
have any subpath that they could consult instead. This analogy is
really a bit faulty, since SubqueryScan isn't a relation scan node
in any ordinary meaning of that term.
* To ensure that all paths for the same relation have the same rowcount
estimate (modulo different parameterization or parallelism). But we'll
have that anyway because the only possible path type for an unflattened
RTE_SUBQUERY rel is a SubqueryScan, so there's no risk of different
cost_xxx functions arriving at slightly different estimates.
* To avoid redundant computation of the quals' selectivity.
This is slightly annoying; but in practice the quals will always
be RestrictInfos which will cache the per-qual selectivities,
so it's not going to cost us very much to perform that calculation
over again.
So perhaps we should do it more like the attached, which produces
this plan for the UNION case:
Unique (cost=28994.85..30194.85 rows=120000 width=12)
-> Sort (cost=28994.85..29294.85 rows=120000 width=12)
Sort Key: t.a, t.b, t.c
-> Gather (cost=0.00..2600.00 rows=120000 width=12)
Workers Planned: 2
-> Parallel Append (cost=0.00..2600.00 rows=50000 width=12)
-> Parallel Seq Scan on t (cost=0.00..575.00 rows=25000 width=12)
-> Parallel Seq Scan on t t_1 (cost=0.00..575.00 rows=25000 width=12)
It'd still be a good idea to fix the cost estimation to not charge
extra for SubqueryScans that will be elided later, because this
Parallel Append's cost should be 1400 not 2600. But as I showed
upthread, that won't affect the plan choice for this particular test
case.
regards, tom lane
#text/x-diff; name="change-cost_subqueryscan-rowcount-estimate-2.patch" [change-cost_subqueryscan-rowcount-estimate-2.patch] /home/postgres/zzz
I wrote:
So perhaps we should do it more like the attached, which produces
this plan for the UNION case:
sigh ... actually attached this time.
regards, tom lane
Attachments:
change-cost_subqueryscan-rowcount-estimate-2.patchtext/x-diff; charset=us-ascii; name=change-cost_subqueryscan-rowcount-estimate-2.patchDownload
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index b787c6f81a..18749e842d 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1395,6 +1395,7 @@ cost_subqueryscan(SubqueryScanPath *path, PlannerInfo *root,
{
Cost startup_cost;
Cost run_cost;
+ List *qpquals;
QualCost qpqual_cost;
Cost cpu_per_tuple;
@@ -1402,11 +1403,24 @@ cost_subqueryscan(SubqueryScanPath *path, PlannerInfo *root,
Assert(baserel->relid > 0);
Assert(baserel->rtekind == RTE_SUBQUERY);
- /* Mark the path with the correct row estimate */
+ /*
+ * We compute the rowcount estimate as the subplan's estimate times the
+ * selectivity of relevant restriction clauses. In simple cases this will
+ * come out the same as baserel->rows; but when dealing with parallelized
+ * paths we must do it like this to get the right answer.
+ */
if (param_info)
- path->path.rows = param_info->ppi_rows;
+ qpquals = list_concat_copy(param_info->ppi_clauses,
+ baserel->baserestrictinfo);
else
- path->path.rows = baserel->rows;
+ qpquals = baserel->baserestrictinfo;
+
+ path->path.rows = clamp_row_est(path->subpath->rows *
+ clauselist_selectivity(root,
+ qpquals,
+ 0,
+ JOIN_INNER,
+ NULL));
/*
* Cost of path is cost of evaluating the subplan, plus cost of evaluating
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
index 21c429226f..0d8d77140a 100644
--- a/src/test/regress/expected/incremental_sort.out
+++ b/src/test/regress/expected/incremental_sort.out
@@ -1487,14 +1487,12 @@ explain (costs off) select * from t union select * from t order by 1,3;
-> Unique
-> Sort
Sort Key: t.a, t.b, t.c
- -> Append
- -> Gather
- Workers Planned: 2
+ -> Gather
+ Workers Planned: 2
+ -> Parallel Append
-> Parallel Seq Scan on t
- -> Gather
- Workers Planned: 2
-> Parallel Seq Scan on t t_1
-(13 rows)
+(11 rows)
-- Full sort, not just incremental sort can be pushed below a gather merge path
-- by generate_useful_gather_paths.
On Fri, Apr 29, 2022 at 12:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
"David G. Johnston" <david.g.johnston@gmail.com> writes:
The fact that (baserel.rows > path->subpath->rows) here seems like a
straight bug: there are no filters involved in this case but in the
presence of filters baserel->rows should be strictly (<=
path->subpath->rows), right?No, because the subpath's rowcount has been derated to represent the
number of rows any one worker is expected to process. Since the
SubqueryScan is below the Gather, it has to represent that number
of rows too. Like I said, this design is pretty confusing; though
I do not have a better alternative to suggest.Anyway, after chewing on this for awhile, it strikes me that the
sanest way to proceed is for cost_subqueryscan to compute its rows
estimate as the subpath's rows estimate times the selectivity of
the subqueryscan's quals (if any).
This is what Robert was getting at, and I followed-up on.
The question I ended up at is why doesn't baserel->rows already produce the
value you now propose to calculate directly within cost_subquery
set_baserel_size_estimates (multiplies rel->tuples - which is derated - by
selectivity, sets rel->rows)
set_subquery_size_estimates
rel->subroot = subquery_planner(...)
// my expectation is that
sub_final_rel->cheapest_total_path->rows is the derated number of rows;
// the fact you can reach the derated amount later by using
path->subpath->rows seems to affirm this.
sets rel->tuples from sub_final_rel->cheapest_total_path->rows)
set_subquery_pathlist (executes the sizing call stack above, then proceeds
to create_subqueryscan_path which in turn calls cost_subquery)
set_rel_size
...
* By analogy to other sorts of relation-scan nodes. But those don't
have any subpath that they could consult instead. This analogy is
really a bit faulty, since SubqueryScan isn't a relation scan node
in any ordinary meaning of that term.
I did observe this copy-and-paste dynamic; I take it this is why we cannot
or would not just change all of the baserel->rows usages to
path->subpath->rows.
So perhaps we should do it more like the attached, which produces
this plan for the UNION case:
The fact this changes row counts in a costing function is bothersome - it
would be nice to go the other direction and remove the if block. More
generally, path->path.rows should be read-only by the time we get to
costing. But I'm not out to start a revolution here either. But it feels
like we are just papering over a bug in how baserel->rows is computed; per
my analysis above.
In short, the solution seems like it should, and in fact does here, fix the
observed problem. I'm fine with that.
David J.
On Fri, Apr 29, 2022 at 3:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I wrote:
So perhaps we should do it more like the attached, which produces
this plan for the UNION case:sigh ... actually attached this time.
I am not sure whether this is actually correct, but it seems a lot
more believable than the previous proposals. The problem might be more
general, though. I think when I developed this parallel query stuff I
modeled a lot of it on what you did for parameterized paths. Both
parameterized paths and parallelism can create situations where
executing a path to completion produces fewer rows than you would
otherwise get. In the case of parameterized paths, this happens
because we enforce the parameterization we've chosen on top of the
user-supplied quals. In the case of parallelism, it happens because
the rows are split up across the different workers. I think I intended
that the "rows" field of RelOptInfo should be the row count for the
relation in total, and that the "rows" field of the Path should be the
number of rows we expect to get for one execution of the path. But it
seems like this problem is good evidence that I didn't find all the
places that need to be adjusted for parallelism, and I wouldn't be
very surprised if there are a bunch of others that I overlooked.
It's not actually very nice that we end up having to call
clauselist_selectivity() here. We've already called
set_baserel_size_estimates() to figure out how many rows we expect to
have been filtered out by the quals, and it sucks to have to do it
again. Brainstorming wildly and maybe stupidly, I wonder if the whole
model is wrong here. Maybe a path shouldn't have a row count; instead,
maybe it should have a multiplier that it applies to the relation's
row count. Then, if X is parameterized in the same way as its subpath
Y, we can just copy the multiplier up, but now it will be applied to
the new rel's "rows" value, which will have already been adjusted
appropriately by set_baserel_size_estimates().
And having thrown out that wild and crazy idea, I will now run away
quickly and hide someplace.
--
Robert Haas
EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes:
I am not sure whether this is actually correct, but it seems a lot
more believable than the previous proposals. The problem might be more
general, though. I think when I developed this parallel query stuff I
modeled a lot of it on what you did for parameterized paths. Both
parameterized paths and parallelism can create situations where
executing a path to completion produces fewer rows than you would
otherwise get. In the case of parameterized paths, this happens
because we enforce the parameterization we've chosen on top of the
user-supplied quals. In the case of parallelism, it happens because
the rows are split up across the different workers. I think I intended
that the "rows" field of RelOptInfo should be the row count for the
relation in total, and that the "rows" field of the Path should be the
number of rows we expect to get for one execution of the path. But it
seems like this problem is good evidence that I didn't find all the
places that need to be adjusted for parallelism, and I wouldn't be
very surprised if there are a bunch of others that I overlooked.
I did look at the rest of costsize.c for similar instances, and didn't
find any. In any case, I think we have two options:
1. Apply this fix, and in future fix any other places that we identify
later.
2. Invent some entirely new scheme that we hope is less mistake-prone.
Option #2 is unlikely to lead to any near-term fix, and we certainly
wouldn't dare back-patch it.
It's not actually very nice that we end up having to call
clauselist_selectivity() here. We've already called
set_baserel_size_estimates() to figure out how many rows we expect to
have been filtered out by the quals, and it sucks to have to do it
again. Brainstorming wildly and maybe stupidly, I wonder if the whole
model is wrong here. Maybe a path shouldn't have a row count; instead,
maybe it should have a multiplier that it applies to the relation's
row count. Then, if X is parameterized in the same way as its subpath
Y, we can just copy the multiplier up, but now it will be applied to
the new rel's "rows" value, which will have already been adjusted
appropriately by set_baserel_size_estimates().
I've wondered about that too, but it seems to depend on the assumption
that clauses are estimated independently by clauselist_selectivity, which
has not been true for a long time (and is getting less true not more so).
So we could possibly apply something like this for parallelism, but not
for parameterized paths, and that makes it less appealing ... IMO anyway.
I have thought it might be good to explicitly mark partial paths with the
estimated number of workers, which would be effectively the same thing
as what you're talking about. But I wonder if we'd not still be better off
keeping the path rowcount as being number-of-rows-in-each-worker, and
just scale it up by the multiplier for EXPLAIN output. (And then also
print the true total number of rows in EXPLAIN ANALYZE.) If we do the
inverse of that, then we risk bugs from failing to correct the rowcount
during cost-estimation calculations.
I kinda feel that the bottom line here is that cost estimation is
hard, and we're not going to find a magic bullet that removes bugs.
regards, tom lane
On Mon, May 2, 2022 at 5:24 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I did look at the rest of costsize.c for similar instances, and didn't
find any. In any case, I think we have two options:1. Apply this fix, and in future fix any other places that we identify
later.2. Invent some entirely new scheme that we hope is less mistake-prone.
Option #2 is unlikely to lead to any near-term fix, and we certainly
wouldn't dare back-patch it.
Sure, although I think it's questionable whether we should back-patch
anyway, since there's no guarantee that every plan change anybody gets
will be a desirable one.
I've wondered about that too, but it seems to depend on the assumption
that clauses are estimated independently by clauselist_selectivity, which
has not been true for a long time (and is getting less true not more so).
So we could possibly apply something like this for parallelism, but not
for parameterized paths, and that makes it less appealing ... IMO anyway.
I agree. We'd have to correct for that somehow, and that might be awkward.
I have thought it might be good to explicitly mark partial paths with the
estimated number of workers, which would be effectively the same thing
as what you're talking about. But I wonder if we'd not still be better off
keeping the path rowcount as being number-of-rows-in-each-worker, and
just scale it up by the multiplier for EXPLAIN output. (And then also
print the true total number of rows in EXPLAIN ANALYZE.) If we do the
inverse of that, then we risk bugs from failing to correct the rowcount
during cost-estimation calculations.
That I don't like at all. I'm still of the opinion that it's a huge
mistake for EXPLAIN to print int(rowcount/loops) instead of just
rowcount. The division is never what I want and in my experience is
also not what other people want and often causes confusion. Both the
division and the rounding lose information about precisely what row
count was estimated, which makes it harder to figure out where in the
plan things went wrong. I am not at all keen on adding more ways for
what we print out to be different from the information actually stored
in the plan tree. I don't know for sure what we ought to be storing in
the plan tree, but I think whatever we store should also be what we
print. I think the fact that we've chosen to store something in the
plan tree is strong evidence that that exact value, and not some
quantity derived therefrom, is what's interesting.
I kinda feel that the bottom line here is that cost estimation is
hard, and we're not going to find a magic bullet that removes bugs.
Well that much is certainly true.
--
Robert Haas
EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes:
That I don't like at all. I'm still of the opinion that it's a huge
mistake for EXPLAIN to print int(rowcount/loops) instead of just
rowcount. The division is never what I want and in my experience is
also not what other people want and often causes confusion. Both the
division and the rounding lose information about precisely what row
count was estimated, which makes it harder to figure out where in the
plan things went wrong.
I'm inclined to look at it a bit differently: it was a mistake to
use the same "loops" notion for parallelism as for repeated node
execution. But I think we are saying the same thing in one respect,
namely it'd be better if what EXPLAIN shows for parallelism were totals
across all workers rather than per-worker numbers. (I'm unconvinced
about whether repeated node execution ought to work like that.)
I am not at all keen on adding more ways for
what we print out to be different from the information actually stored
in the plan tree.
I think the cost estimation functions want to work with per-worker
rowcounts. We could scale that up to totals when we create the
finished plan tree, perhaps.
I don't know for sure what we ought to be storing in
the plan tree, but I think whatever we store should also be what we
print. I think the fact that we've chosen to store something in the
plan tree is strong evidence that that exact value, and not some
quantity derived therefrom, is what's interesting.
The only reason we store any of this in the plan tree is for
EXPLAIN to print it out. On the other hand, I don't want the
planner expending any large number of cycles modifying the numbers
it works with before putting them in the plan tree, because most
of the time we're not doing EXPLAIN so it'd be wasted effort.
In any case, fundamental redesign of what EXPLAIN prints is a job
for v16 or later. Are you okay with the proposed patch as a v15 fix?
regards, tom lane
On Tue, May 3, 2022 at 2:13 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
In any case, fundamental redesign of what EXPLAIN prints is a job
for v16 or later. Are you okay with the proposed patch as a v15 fix?
Yes. I can't really vouch for it, but I don't object to it.
--
Robert Haas
EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes:
On Tue, May 3, 2022 at 2:13 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
In any case, fundamental redesign of what EXPLAIN prints is a job
for v16 or later. Are you okay with the proposed patch as a v15 fix?
Yes. I can't really vouch for it, but I don't object to it.
I re-read the patch and noticed that I'd missed one additional change
needed:
- run_cost = cpu_per_tuple * baserel->tuples;
+ run_cost = cpu_per_tuple * path->subpath->rows;
That is, we should estimate the number of qual evaluations and
cpu_tuple_cost charges based on the subpath row count not the
non-parallel-aware relation size. So this reduces the cost of
the SubqueryScan itself, as well as its row count, in partial paths.
I don't see any further change in regression test results though.
Pushed with that change.
regards, tom lane