Incorrect cost for MergeAppend
Hello hackers,
While investigating some query plans, I noticed some code that seems
to be wrong: when create_merge_append_path() estimates the cost of
sorting an input, it calls cost_sort() passing subpath->parent->tuples
as the number of tuples. Shouldn't it use subpath->parent->rows or
even subpath->rows instead? The `tuples` variable doesn't account for
the filters on the relation, so this leads to incorrect cost estimates
when there are highly selective filters, and Sort + Append is chosen
instead of MergeAppend.
I don't have a good repro yet because the plans I've been studying
involve a third-party extension, but the code looks incorrect so I
thought I should ask.
Here is a link to this code on GitHub:
https://github.com/postgres/postgres/blob/6a1ea02c491d16474a6214603dce40b5b122d4d1/src/backend/optimizer/util/pathnode.c#L1469-L1477
---
Alexander Kuzmenkov
Timescale
On Mon, Jan 29, 2024 at 6:11 PM Alexander Kuzmenkov
<akuzmenkov@timescale.com> wrote:
Hello hackers,
While investigating some query plans, I noticed some code that seems
to be wrong: when create_merge_append_path() estimates the cost of
sorting an input, it calls cost_sort() passing subpath->parent->tuples
as the number of tuples. Shouldn't it use subpath->parent->rows or
even subpath->rows instead? The `tuples` variable doesn't account for
the filters on the relation, so this leads to incorrect cost estimates
when there are highly selective filters, and Sort + Append is chosen
instead of MergeAppend.
All other callers of cost_sort() except plan_cluster_use_sort() are
using rows instead of tuples. Even plan_cluster_use_sort() has
rel->rows = rel->tuples, it's actually passing rows. So agree with
your suggestion. However a test will be good since this code is quite
old.
--
Best Wishes,
Ashutosh Bapat
Here is a small patch that reproduces the problem on two tables with
inheritance, and fixes it. I'll add it to the Commitfest.
On Tue, Jan 30, 2024 at 8:20 AM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
Show quoted text
On Mon, Jan 29, 2024 at 6:11 PM Alexander Kuzmenkov
<akuzmenkov@timescale.com> wrote:Hello hackers,
While investigating some query plans, I noticed some code that seems
to be wrong: when create_merge_append_path() estimates the cost of
sorting an input, it calls cost_sort() passing subpath->parent->tuples
as the number of tuples. Shouldn't it use subpath->parent->rows or
even subpath->rows instead? The `tuples` variable doesn't account for
the filters on the relation, so this leads to incorrect cost estimates
when there are highly selective filters, and Sort + Append is chosen
instead of MergeAppend.All other callers of cost_sort() except plan_cluster_use_sort() are
using rows instead of tuples. Even plan_cluster_use_sort() has
rel->rows = rel->tuples, it's actually passing rows. So agree with
your suggestion. However a test will be good since this code is quite
old.--
Best Wishes,
Ashutosh Bapat
Attachments:
merge-append-cost-v1.patchtext/x-patch; charset=US-ASCII; name=merge-append-cost-v1.patchDownload
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 8dbf790e89..2e1ec41a54 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1470,7 +1470,7 @@ create_merge_append_path(PlannerInfo *root,
root,
pathkeys,
subpath->total_cost,
- subpath->parent->tuples,
+ subpath->rows,
subpath->pathtarget->width,
0.0,
work_mem,
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index 56a40d50f9..22e9ca0b10 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -1716,6 +1716,28 @@ order by t1.b limit 10;
reset enable_nestloop;
drop table matest0 cascade;
NOTICE: drop cascades to table matest1
+-- Check that merge append is chosen in presence of filters.
+create table ma0(a int primary key);
+create table ma1() inherits (ma0);
+insert into ma0 select generate_series(1, 10000);
+insert into ma1 select generate_series(1, 10000);
+analyze ma0;
+analyze ma1;
+explain (costs off) select * from ma0 where a < 1000 order by a;
+ QUERY PLAN
+---------------------------------------------------
+ Merge Append
+ Sort Key: ma0.a
+ -> Index Only Scan using ma0_pkey on ma0 ma0_1
+ Index Cond: (a < 1000)
+ -> Sort
+ Sort Key: ma0_2.a
+ -> Seq Scan on ma1 ma0_2
+ Filter: (a < 1000)
+(8 rows)
+
+drop table ma0 cascade;
+NOTICE: drop cascades to table ma1
--
-- Test merge-append for UNION ALL append relations
--
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 971aac54fd..811da462b7 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -624,6 +624,19 @@ reset enable_nestloop;
drop table matest0 cascade;
+-- Check that merge append is chosen in presence of filters.
+create table ma0(a int primary key);
+create table ma1() inherits (ma0);
+insert into ma0 select generate_series(1, 10000);
+insert into ma1 select generate_series(1, 10000);
+analyze ma0;
+analyze ma1;
+
+explain (costs off) select * from ma0 where a < 1000 order by a;
+
+drop table ma0 cascade;
+
+
--
-- Test merge-append for UNION ALL append relations
--
Hi,
Here is a small patch that reproduces the problem on two tables with
inheritance, and fixes it. I'll add it to the Commitfest.
Thanks for the patch.
I can confirm that it changes the plan from Sort + Append to MergeAppend.
Before:
```
explain (costs off) select * from ma0 where a < 1000 order by a;
QUERY PLAN
---------------------------------------------------------
Sort
Sort Key: ma0.a
-> Append
-> Index Only Scan using ma0_pkey on ma0 ma0_1
Index Cond: (a < 1000)
-> Seq Scan on ma1 ma0_2
Filter: (a < 1000)
```
After:
```
=# explain (costs off) select * from ma0 where a < 1000 order by a;
QUERY PLAN
---------------------------------------------------
Merge Append
Sort Key: ma0.a
-> Index Only Scan using ma0_pkey on ma0 ma0_1
Index Cond: (a < 1000)
-> Sort
Sort Key: ma0_2.a
-> Seq Scan on ma1 ma0_2
Filter: (a < 1000)
```
The rest of the tests pass.
--
Best regards,
Aleksander Alekseev
On Wed, 31 Jan 2024 at 00:06, Alexander Kuzmenkov
<akuzmenkov@timescale.com> wrote:
Here is a small patch that reproduces the problem on two tables with
inheritance, and fixes it. I'll add it to the Commitfest.
Good catch.
It seems to have been broken since MergeAppends were added in 11cad29c9.
The code fix looks good to me.
The tests look a bit heavy. I wondered if you could have used a UNION
ALL query with the tenk1 table so you didn't have to create tables,
however, I see that case works ok since the parent->tuples is set in
set_subquery_size_estimates() from "rel->tuples =
sub_final_rel->cheapest_total_path->rows;".
I played around with the following and see your patch changes the plan
to a Merge Append away from an Append -> Sort plan.
create table tenk_parent (like tenk1);
alter table tenk1 inherit tenk_parent;
alter table tenk2 inherit tenk_parent;
explain (costs off) select * from tenk_parent where thousand = 0 order
by tenthous;
alter table tenk1 no inherit tenk_parent;
alter table tenk2 no inherit tenk_parent;
I'm just not sure if we should be messing with tenk1 and tenk2 like that.
You should likely focus on trying to find a test that does not require
making 2 tables with 10k rows.
David
On Tue, Jan 30, 2024 at 1:20 PM David Rowley <dgrowleyml@gmail.com> wrote:
You should likely focus on trying to find a test that does not require
making 2 tables with 10k rows.
Is 1k smallint OK? It should fit in one page. Still reproduces the
error, and the entire test case runs in under 10 ms.
Attachments:
merge-append-cost-v2.patchtext/x-patch; charset=US-ASCII; name=merge-append-cost-v2.patchDownload
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 8dbf790e89..2e1ec41a54 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1470,7 +1470,7 @@ create_merge_append_path(PlannerInfo *root,
root,
pathkeys,
subpath->total_cost,
- subpath->parent->tuples,
+ subpath->rows,
subpath->pathtarget->width,
0.0,
work_mem,
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index 56a40d50f9..7804333399 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -1716,6 +1716,28 @@ order by t1.b limit 10;
reset enable_nestloop;
drop table matest0 cascade;
NOTICE: drop cascades to table matest1
+-- Check that merge append is chosen in presence of filters.
+create table ma0(a smallint primary key);
+create table ma1() inherits (ma0);
+insert into ma0 select generate_series(1, 1000);
+insert into ma1 select generate_series(1, 1000);
+analyze ma0;
+analyze ma1;
+explain (costs off) select * from ma0 where a < 100 order by a;
+ QUERY PLAN
+---------------------------------------------------
+ Merge Append
+ Sort Key: ma0.a
+ -> Index Only Scan using ma0_pkey on ma0 ma0_1
+ Index Cond: (a < 100)
+ -> Sort
+ Sort Key: ma0_2.a
+ -> Seq Scan on ma1 ma0_2
+ Filter: (a < 100)
+(8 rows)
+
+drop table ma0 cascade;
+NOTICE: drop cascades to table ma1
--
-- Test merge-append for UNION ALL append relations
--
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 971aac54fd..dae8cc7dbe 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -624,6 +624,19 @@ reset enable_nestloop;
drop table matest0 cascade;
+-- Check that merge append is chosen in presence of filters.
+create table ma0(a smallint primary key);
+create table ma1() inherits (ma0);
+insert into ma0 select generate_series(1, 1000);
+insert into ma1 select generate_series(1, 1000);
+analyze ma0;
+analyze ma1;
+
+explain (costs off) select * from ma0 where a < 100 order by a;
+
+drop table ma0 cascade;
+
+
--
-- Test merge-append for UNION ALL append relations
--
On Wed, 31 Jan 2024 at 02:23, Alexander Kuzmenkov
<akuzmenkov@timescale.com> wrote:
On Tue, Jan 30, 2024 at 1:20 PM David Rowley <dgrowleyml@gmail.com> wrote:
You should likely focus on trying to find a test that does not require
making 2 tables with 10k rows.Is 1k smallint OK? It should fit in one page. Still reproduces the
error, and the entire test case runs in under 10 ms.
I had a go at making it a bit smaller without going dangerously close
to where the plan might change. The following seems to work.
create table ma0(a int primary key);
create table ma1() inherits (ma0);
insert into ma0 select generate_series(1, 400);
insert into ma1 select generate_series(1, 200);
analyze ma0;
analyze ma1;
explain (costs off) select * from ma0 where a < 100 order by a;
drop table ma0 cascade;
As for backpatching this. It does risk plans changing in stable
versions of PostgreSQL, which we normally try to avoid. However, it
seems quite similar to 1e731ed12a, albeit, far more long-standing.
I'm currently thinking we should backpatch this back to 12.
Does anyone disagree?
David
On Wed, Jan 31, 2024 at 4:33 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Wed, 31 Jan 2024 at 02:23, Alexander Kuzmenkov
<akuzmenkov@timescale.com> wrote:On Tue, Jan 30, 2024 at 1:20 PM David Rowley <dgrowleyml@gmail.com> wrote:
You should likely focus on trying to find a test that does not require
making 2 tables with 10k rows.Is 1k smallint OK? It should fit in one page. Still reproduces the
error, and the entire test case runs in under 10 ms.I had a go at making it a bit smaller without going dangerously close
to where the plan might change. The following seems to work.create table ma0(a int primary key);
create table ma1() inherits (ma0);
insert into ma0 select generate_series(1, 400);
insert into ma1 select generate_series(1, 200);
analyze ma0;
analyze ma1;explain (costs off) select * from ma0 where a < 100 order by a;
drop table ma0 cascade;
On my laptop with all default settings
with patch
postgres@231714=#explain select * from ma0 where a < 100 order by a;
QUERY PLAN
--------------------------------------------------------------------------------------
Merge Append (cost=6.94..18.90 rows=198 width=4)
Sort Key: ma0.a
-> Index Only Scan using ma0_pkey on ma0 ma0_1 (cost=0.15..9.88
rows=99 width=4)
Index Cond: (a < 100)
-> Sort (cost=6.78..7.03 rows=99 width=4)
Sort Key: ma0_2.a
-> Seq Scan on ma1 ma0_2 (cost=0.00..3.50 rows=99 width=4)
Filter: (a < 100)
(8 rows)
without patch
#explain select * from ma0 where a < 100 order by a;
QUERY PLAN
----------------------------------------------------------------------
Sort (cost=19.04..19.54 rows=198 width=4)
Sort Key: ma0.a
-> Append (cost=0.00..11.49 rows=198 width=4)
-> Seq Scan on ma0 ma0_1 (cost=0.00..7.00 rows=99 width=4)
Filter: (a < 100)
-> Seq Scan on ma1 ma0_2 (cost=0.00..3.50 rows=99 width=4)
Filter: (a < 100)
(7 rows)
postgres@233864=#select (19.54 - 18.90)/19.54, (19.04 - 18.09)/19.04;
?column? | ?column?
------------------------+------------------------
0.03275332650972364381 | 0.04989495798319327731
(1 row)
Those numbers are higher than 1% (#define STD_FUZZ_FACTOR 1.01) but
slight variation in all the GUCs that affect cost, might bring the
difference closer to STD_FUZZ_FACTOR.
Given how close they are, maybe it's not such a good idea to
backpatch. If the plans change for better, users won't complain but
otherwise they will complain and will have no way to go back to their
good plans in production.
--
Best Wishes,
Ashutosh Bapat
On Wed, 31 Jan 2024 at 18:58, Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
with patch
Merge Append (cost=6.94..18.90 rows=198 width=4)
without patch
Sort (cost=19.04..19.54 rows=198 width=4)
Those numbers are higher than 1% (#define STD_FUZZ_FACTOR 1.01) but
slight variation in all the GUCs that affect cost, might bring the
difference closer to STD_FUZZ_FACTOR.Given how close they are, maybe it's not such a good idea to
backpatch.
The reason those numbers are close is because I reduced the row count
on the test tables to a point where we only just get the Merge Append
plan with a small margin. I don't see the test case costs as a
relevant factor for if we backpatch or not.
What is relevant are things like:
For:
* It's a clear bug and what's happening now is clearly wrong.
* inheritance/partitioned table plan changes for the better in minor versions
Against:
* Nobody has complained for 13 years, so maybe it's unlikely anyone is
suffering too much.
* Possibility of inheritance/partitioned table plans changing for the
worse in minor versions
For now, I'm on the fence on this one.
David
On Wed, Jan 31, 2024 at 12:12 PM David Rowley <dgrowleyml@gmail.com> wrote:
What is relevant are things like:
For:
* It's a clear bug and what's happening now is clearly wrong.
* inheritance/partitioned table plan changes for the better in minor versionsAgainst:
* Nobody has complained for 13 years, so maybe it's unlikely anyone is
suffering too much.
* Possibility of inheritance/partitioned table plans changing for the
worse in minor versions
That's what I am thinking as well. And the plans that may change for
the worse are the ones where the costs with and without the patch are
close.
Just to be clear, the change is for good and should be committed to
the master. It's the backpatching I am worried about.
--
Best Wishes,
Ashutosh Bapat
I'd be happy to see this backpatched. What kind of regressions are we
worried about? I'd say partition-wise sort + merge should be faster
than append + sort for reasonably sized tables. That's basically what
tuplesort does inside. Moreso, this can enable index scans on
partitions, which is an even better plan.
On Wed, Jan 31, 2024 at 7:46 AM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
Show quoted text
On Wed, Jan 31, 2024 at 12:12 PM David Rowley <dgrowleyml@gmail.com> wrote:
What is relevant are things like:
For:
* It's a clear bug and what's happening now is clearly wrong.
* inheritance/partitioned table plan changes for the better in minor versionsAgainst:
* Nobody has complained for 13 years, so maybe it's unlikely anyone is
suffering too much.
* Possibility of inheritance/partitioned table plans changing for the
worse in minor versionsThat's what I am thinking as well. And the plans that may change for
the worse are the ones where the costs with and without the patch are
close.Just to be clear, the change is for good and should be committed to
the master. It's the backpatching I am worried about.--
Best Wishes,
Ashutosh Bapat
On Wed, Jan 31, 2024 at 1:33 PM Alexander Kuzmenkov
<akuzmenkov@timescale.com> wrote:
I'd be happy to see this backpatched. What kind of regressions are we
worried about? I'd say partition-wise sort + merge should be faster
than append + sort for reasonably sized tables. That's basically what
tuplesort does inside. Moreso, this can enable index scans on
partitions, which is an even better plan.
To put it another way, this change enables our usual cost model for
MergeAppend to work correctly in the presence of filters. I think we
can consider this model to be reasonably correct, and we don't
currently have major problems with MergeAppend being chosen instead of
Sort + Append in cases where it's suboptimal, right? So applying it
properly in case with filters is not likely to introduce problems.
On 2024-Jan-31, Alexander Kuzmenkov wrote:
To put it another way, this change enables our usual cost model for
MergeAppend to work correctly in the presence of filters. I think we
can consider this model to be reasonably correct, and we don't
currently have major problems with MergeAppend being chosen instead of
Sort + Append in cases where it's suboptimal, right? So applying it
properly in case with filters is not likely to introduce problems.
Since we have a minor coming up very soon, I think it's not a good idea
to backpatch right now. Maybe you can push to master now, and consider
whether to backpatch later.
The problem is -- if somebody has an application that gets good plans
with the current cost model, and you change the cost model and the plans
become worse, what do they do? If you change this in a major release,
this is not an issue because they must test their queries before
upgrading and if they fail to realize a problem exists then it's their
fault. If you change it in a minor release, then those people will be
very upset that things were changed suddenly, and they may get wary of
future minor upgrades, which we don't want.
Plus, they would need to do careful testing before doing the minor
upgrade.
Maybe plans can only go from bad to good and never from good to bad.
But are we 100% certain that that is the case?
People who are **desperate** to get this improvement can perhaps run a
patched version in the meantime.
--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Hi,
Since we have a minor coming up very soon, I think it's not a good idea
to backpatch right now. Maybe you can push to master now, and consider
whether to backpatch later.
The problem is -- if somebody has an application that gets good plans
with the current cost model, and you change the cost model and the plans
become worse, what do they do? If you change this in a major release,
this is not an issue because they must test their queries before
upgrading and if they fail to realize a problem exists then it's their
fault. If you change it in a minor release, then those people will be
very upset that things were changed suddenly, and they may get wary of
future minor upgrades, which we don't want.
I agree with this, especially as we tell our customers that such changes do not happen from one minor release to another.
Regards
Daniel
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
Since we have a minor coming up very soon, I think it's not a good idea
to backpatch right now. Maybe you can push to master now, and consider
whether to backpatch later.
As a rule, we don't back-patch changes like this ever. People don't
appreciate plans changing in minor releases.
regards, tom lane
On Thu, 1 Feb 2024 at 04:32, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
Since we have a minor coming up very soon, I think it's not a good idea
to backpatch right now. Maybe you can push to master now, and consider
whether to backpatch later.As a rule, we don't back-patch changes like this ever. People don't
appreciate plans changing in minor releases.
Pushed to master.
Thanks for the report and the fix, Alexander.