MergeAppend could consider sorting cheapest child path

Started by Alexander Pyhalovover 1 year ago44 messages

a.pyhalov@postgrespro.ru

over 1 year ago

1 attachment(s)

Hi.

Now when planner finds suitable pathkeys in
generate_orderedappend_paths(), it uses them, even if explicit sort of
the cheapest child path could be more efficient.

We encountered this issue on partitioned table with two indexes, where
one is suitable for sorting, and another is good for selecting data.
MergeAppend was generated
with subpaths doing index scan on suitably ordered index and filtering a
lot of data.
The suggested fix allows MergeAppend to consider sorting on cheapest
childrel total path as an alternative.

--
Best regards,
Alexander Pyhalov,
Postgres Professional

Attachments:

v1-0001-MergeAppend-could-consider-using-sorted-best-path.patchtext/x-diff; name=v1-0001-MergeAppend-could-consider-using-sorted-best-path.patchDownload

From d5eb3d326d83e2ca47c17552fcc6fd27b6b98d4e Mon Sep 17 00:00:00 2001
From: Alexander Pyhalov <a.pyhalov@postgrespro.ru>
Date: Tue, 18 Jun 2024 15:56:18 +0300
Subject: [PATCH] MergeAppend could consider using sorted best path.

This helps when index with suitable pathkeys is not
good for filtering data.
---
 .../postgres_fdw/expected/postgres_fdw.out    |  6 +-
 src/backend/optimizer/path/allpaths.c         | 21 +++++
 src/test/regress/expected/inherit.out         | 45 +++++++++-
 src/test/regress/expected/partition_join.out  | 87 +++++++++++--------
 src/test/regress/expected/union.out           |  6 +-
 src/test/regress/sql/inherit.sql              | 17 ++++
 6 files changed, 141 insertions(+), 41 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index ea566d50341..687591e4a97 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -10074,13 +10074,15 @@ SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a
    ->  Nested Loop
          Join Filter: (t1.a = t2.b)
          ->  Append
-               ->  Foreign Scan on ftprt1_p1 t1_1
+               ->  Sort
+                     Sort Key: t1_1.a
+                     ->  Foreign Scan on ftprt1_p1 t1_1
                ->  Foreign Scan on ftprt1_p2 t1_2
          ->  Materialize
                ->  Append
                      ->  Foreign Scan on ftprt2_p1 t2_1
                      ->  Foreign Scan on ftprt2_p2 t2_2
-(10 rows)
+(12 rows)
 
 SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
   a  |  b  
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 4895cee9944..827bc469269 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1845,6 +1845,27 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 				/* Assert we do have an unparameterized path for this child */
 				Assert(cheapest_total->param_info == NULL);
 			}
+			else
+			{
+				/*
+				 * Even if we found necessary pathkeys, using unsorted path
+				 * can be more efficient.
+				 */
+				Path		sort_path;
+
+				cost_sort(&sort_path,
+						  root,
+						  pathkeys,
+						  childrel->cheapest_total_path->total_cost,
+						  childrel->cheapest_total_path->rows,
+						  childrel->cheapest_total_path->pathtarget->width,
+						  0.0,
+						  work_mem,
+						  -1.0 /* need all tuples to sort them */ );
+
+				if (compare_path_costs(&sort_path, cheapest_total, TOTAL_COST) < 0)
+					cheapest_total = childrel->cheapest_total_path;
+			}
 
 			/*
 			 * When building a fractional path, determine a cheapest
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index ad732134148..16e78c8d2ff 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -1555,6 +1555,7 @@ insert into matest2 (name) values ('Test 4');
 insert into matest3 (name) values ('Test 5');
 insert into matest3 (name) values ('Test 6');
 set enable_indexscan = off;  -- force use of seqscan/sort, so no merge
+set enable_sort = off; -- avoid sorting below MergeAppend
 explain (verbose, costs off) select * from matest0 order by 1-id;
                          QUERY PLAN                         
 ------------------------------------------------------------
@@ -1608,6 +1609,7 @@ select min(1-id) from matest0;
 (1 row)
 
 reset enable_indexscan;
+reset enable_sort;
 set enable_seqscan = off;  -- plan with fewest seqscans should be merge
 set enable_parallel_append = off; -- Don't let parallel-append interfere
 explain (verbose, costs off) select * from matest0 order by 1-id;
@@ -1702,7 +1704,9 @@ order by t1.b limit 10;
          Merge Cond: (t1.b = t2.b)
          ->  Merge Append
                Sort Key: t1.b
-               ->  Index Scan using matest0i on matest0 t1_1
+               ->  Sort
+                     Sort Key: t1_1.b
+                     ->  Seq Scan on matest0 t1_1
                ->  Index Scan using matest1i on matest1 t1_2
          ->  Materialize
                ->  Merge Append
@@ -1711,7 +1715,7 @@ order by t1.b limit 10;
                            Filter: (c = d)
                      ->  Index Scan using matest1i on matest1 t2_2
                            Filter: (c = d)
-(14 rows)
+(16 rows)
 
 reset enable_nestloop;
 drop table matest0 cascade;
@@ -2663,6 +2667,43 @@ explain (costs off) select * from mcrparted where a = 10 order by a, abs(b), c;
 
 reset enable_bitmapscan;
 drop table mcrparted;
+-- Check that sort path can be used by MergeAppend even when there are suitable pathkeys
+create table hash_parted (i int, j int, k int) partition by hash(i);
+create table hash_parted_1 partition of hash_parted for values with (modulus 4, remainder 0);
+create table hash_parted_2 partition of hash_parted for values with (modulus 4, remainder 1);
+create table hash_parted_3 partition of hash_parted for values with (modulus 4, remainder 2);
+create table hash_parted_4 partition of hash_parted for values with (modulus 4, remainder 3);
+--create table hash_parted_5 partition of hash_parted for values with (modulus 6, remainder 4);
+--create table hash_parted_6 partition of hash_parted for values with (modulus 6, remainder 5);
+create index on hash_parted(j);
+create index on hash_parted(k);
+insert into hash_parted select i, i, i from generate_series(1,10000) i;
+analyze hash_parted;
+explain (costs off) select * from hash_parted where k<100 order by j limit 100;
+                               QUERY PLAN                                
+-------------------------------------------------------------------------
+ Limit
+   ->  Merge Append
+         Sort Key: hash_parted.j
+         ->  Sort
+               Sort Key: hash_parted_1.j
+               ->  Index Scan using hash_parted_1_k_idx on hash_parted_1
+                     Index Cond: (k < 100)
+         ->  Sort
+               Sort Key: hash_parted_2.j
+               ->  Index Scan using hash_parted_2_k_idx on hash_parted_2
+                     Index Cond: (k < 100)
+         ->  Sort
+               Sort Key: hash_parted_3.j
+               ->  Index Scan using hash_parted_3_k_idx on hash_parted_3
+                     Index Cond: (k < 100)
+         ->  Sort
+               Sort Key: hash_parted_4.j
+               ->  Index Scan using hash_parted_4_k_idx on hash_parted_4
+                     Index Cond: (k < 100)
+(19 rows)
+
+drop table hash_parted;
 -- Ensure LIST partitions allow an Append to be used instead of a MergeAppend
 create table bool_lp (b bool) partition by list(b);
 create table bool_lp_true partition of bool_lp for values in(true);
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index 6d07f86b9bc..80d480d33d5 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -1309,28 +1309,32 @@ SELECT t1.a, t1.c, t2.b, t2.c, t3.a + t3.b, t3.c FROM (prt1 t1 LEFT JOIN prt2 t2
 -- This should generate a partitionwise join, but currently fails to
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
-                        QUERY PLAN                         
------------------------------------------------------------
- Incremental Sort
+                           QUERY PLAN                            
+-----------------------------------------------------------------
+ Sort
    Sort Key: prt1.a, prt2.b
-   Presorted Key: prt1.a
-   ->  Merge Left Join
-         Merge Cond: (prt1.a = prt2.b)
-         ->  Sort
-               Sort Key: prt1.a
-               ->  Append
-                     ->  Seq Scan on prt1_p1 prt1_1
-                           Filter: ((a < 450) AND (b = 0))
-                     ->  Seq Scan on prt1_p2 prt1_2
-                           Filter: ((a < 450) AND (b = 0))
-         ->  Sort
-               Sort Key: prt2.b
-               ->  Append
+   ->  Merge Right Join
+         Merge Cond: (prt2.b = prt1.a)
+         ->  Append
+               ->  Sort
+                     Sort Key: prt2_1.b
                      ->  Seq Scan on prt2_p2 prt2_1
                            Filter: (b > 250)
+               ->  Sort
+                     Sort Key: prt2_2.b
                      ->  Seq Scan on prt2_p3 prt2_2
                            Filter: (b > 250)
-(19 rows)
+         ->  Materialize
+               ->  Append
+                     ->  Sort
+                           Sort Key: prt1_1.a
+                           ->  Seq Scan on prt1_p1 prt1_1
+                                 Filter: ((a < 450) AND (b = 0))
+                     ->  Sort
+                           Sort Key: prt1_2.a
+                           ->  Seq Scan on prt1_p2 prt1_2
+                                 Filter: ((a < 450) AND (b = 0))
+(23 rows)
 
 SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
   a  |  b  
@@ -1350,25 +1354,33 @@ SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT *
 -- partitionwise join does not apply
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t2.b FROM prt1 t1, prt2 t2 WHERE t1::text = t2::text AND t1.a = t2.b ORDER BY t1.a;
-                                       QUERY PLAN                                        
------------------------------------------------------------------------------------------
+                            QUERY PLAN                            
+------------------------------------------------------------------
  Merge Join
-   Merge Cond: ((t1.a = t2.b) AND (((((t1.*)::prt1))::text) = ((((t2.*)::prt2))::text)))
-   ->  Sort
-         Sort Key: t1.a, ((((t1.*)::prt1))::text)
-         ->  Result
-               ->  Append
-                     ->  Seq Scan on prt1_p1 t1_1
-                     ->  Seq Scan on prt1_p2 t1_2
-                     ->  Seq Scan on prt1_p3 t1_3
-   ->  Sort
-         Sort Key: t2.b, ((((t2.*)::prt2))::text)
-         ->  Result
-               ->  Append
+   Merge Cond: (t1.a = t2.b)
+   Join Filter: ((((t2.*)::prt2))::text = (((t1.*)::prt1))::text)
+   ->  Append
+         ->  Sort
+               Sort Key: t1_1.a
+               ->  Seq Scan on prt1_p1 t1_1
+         ->  Sort
+               Sort Key: t1_2.a
+               ->  Seq Scan on prt1_p2 t1_2
+         ->  Sort
+               Sort Key: t1_3.a
+               ->  Seq Scan on prt1_p3 t1_3
+   ->  Materialize
+         ->  Append
+               ->  Sort
+                     Sort Key: t2_1.b
                      ->  Seq Scan on prt2_p1 t2_1
+               ->  Sort
+                     Sort Key: t2_2.b
                      ->  Seq Scan on prt2_p2 t2_2
+               ->  Sort
+                     Sort Key: t2_3.b
                      ->  Seq Scan on prt2_p3 t2_3
-(16 rows)
+(24 rows)
 
 SELECT t1.a, t2.b FROM prt1 t1, prt2 t2 WHERE t1::text = t2::text AND t1.a = t2.b ORDER BY t1.a;
  a  | b  
@@ -4906,21 +4918,26 @@ EXPLAIN (COSTS OFF)
 SELECT t1.*, t2.* FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a AND t1.b = t2.b) WHERE t1.b >= 125 AND t1.b < 225 ORDER BY t1.a, t1.b;
                              QUERY PLAN                             
 --------------------------------------------------------------------
- Sort
+ Merge Append
    Sort Key: t1.a, t1.b
-   ->  Append
+   ->  Sort
+         Sort Key: t1_1.a, t1_1.b
          ->  Hash Join
                Hash Cond: ((t1_1.a = t2_1.a) AND (t1_1.b = t2_1.b))
                ->  Seq Scan on alpha_neg_p1 t1_1
                      Filter: ((b >= 125) AND (b < 225))
                ->  Hash
                      ->  Seq Scan on beta_neg_p1 t2_1
+   ->  Sort
+         Sort Key: t1_2.a, t1_2.b
          ->  Hash Join
                Hash Cond: ((t2_2.a = t1_2.a) AND (t2_2.b = t1_2.b))
                ->  Seq Scan on beta_neg_p2 t2_2
                ->  Hash
                      ->  Seq Scan on alpha_neg_p2 t1_2
                            Filter: ((b >= 125) AND (b < 225))
+   ->  Sort
+         Sort Key: t1_4.a, t1_4.b
          ->  Hash Join
                Hash Cond: ((t2_4.a = t1_4.a) AND (t2_4.b = t1_4.b))
                ->  Append
@@ -4935,7 +4952,7 @@ SELECT t1.*, t2.* FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a AND t1.b = t2
                                  Filter: ((b >= 125) AND (b < 225))
                            ->  Seq Scan on alpha_pos_p3 t1_6
                                  Filter: ((b >= 125) AND (b < 225))
-(29 rows)
+(34 rows)
 
 SELECT t1.*, t2.* FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a AND t1.b = t2.b) WHERE t1.b >= 125 AND t1.b < 225 ORDER BY t1.a, t1.b;
  a  |  b  |  c   | a  |  b  |  c   
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index 0fd0e1c38b3..4c1c173d8e6 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -1195,12 +1195,14 @@ select event_id
 ----------------------------------------------------------
  Merge Append
    Sort Key: events.event_id
-   ->  Index Scan using events_pkey on events
+   ->  Sort
+         Sort Key: events.event_id
+         ->  Seq Scan on events
    ->  Sort
          Sort Key: events_1.event_id
          ->  Seq Scan on events_child events_1
    ->  Index Scan using other_events_pkey on other_events
-(7 rows)
+(9 rows)
 
 drop table events_child, events, other_events;
 reset enable_indexonlyscan;
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index e3bcfdb181e..5331e49283f 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -586,11 +586,13 @@ insert into matest3 (name) values ('Test 5');
 insert into matest3 (name) values ('Test 6');
 
 set enable_indexscan = off;  -- force use of seqscan/sort, so no merge
+set enable_sort = off; -- avoid sorting below MergeAppend
 explain (verbose, costs off) select * from matest0 order by 1-id;
 select * from matest0 order by 1-id;
 explain (verbose, costs off) select min(1-id) from matest0;
 select min(1-id) from matest0;
 reset enable_indexscan;
+reset enable_sort;
 
 set enable_seqscan = off;  -- plan with fewest seqscans should be merge
 set enable_parallel_append = off; -- Don't let parallel-append interfere
@@ -976,6 +978,21 @@ reset enable_bitmapscan;
 
 drop table mcrparted;
 
+-- Check that sort path can be used by MergeAppend even when there are suitable pathkeys
+create table hash_parted (i int, j int, k int) partition by hash(i);
+create table hash_parted_1 partition of hash_parted for values with (modulus 4, remainder 0);
+create table hash_parted_2 partition of hash_parted for values with (modulus 4, remainder 1);
+create table hash_parted_3 partition of hash_parted for values with (modulus 4, remainder 2);
+create table hash_parted_4 partition of hash_parted for values with (modulus 4, remainder 3);
+--create table hash_parted_5 partition of hash_parted for values with (modulus 6, remainder 4);
+--create table hash_parted_6 partition of hash_parted for values with (modulus 6, remainder 5);
+create index on hash_parted(j);
+create index on hash_parted(k);
+insert into hash_parted select i, i, i from generate_series(1,10000) i;
+analyze hash_parted;
+explain (costs off) select * from hash_parted where k<100 order by j limit 100;
+drop table hash_parted;
+
 -- Ensure LIST partitions allow an Append to be used instead of a MergeAppend
 create table bool_lp (b bool) partition by list(b);
 create table bool_lp_true partition of bool_lp for values in(true);
-- 
2.34.1

Bruce Momjian

bruce@momjian.us

about 1 year ago

In reply to: Alexander Pyhalov (#1)

Re: MergeAppend could consider sorting cheapest child path

Is this still being considered?

---------------------------------------------------------------------------

On Tue, Jun 18, 2024 at 07:45:09PM +0300, Alexander Pyhalov wrote:

Hi.

Now when planner finds suitable pathkeys in generate_orderedappend_paths(),
it uses them, even if explicit sort of the cheapest child path could be more
efficient.

We encountered this issue on partitioned table with two indexes, where one
is suitable for sorting, and another is good for selecting data. MergeAppend
was generated
with subpaths doing index scan on suitably ordered index and filtering a lot
of data.
The suggested fix allows MergeAppend to consider sorting on cheapest
childrel total path as an alternative.

--
Best regards,
Alexander Pyhalov,
Postgres Professional

From d5eb3d326d83e2ca47c17552fcc6fd27b6b98d4e Mon Sep 17 00:00:00 2001
From: Alexander Pyhalov <a.pyhalov@postgrespro.ru>
Date: Tue, 18 Jun 2024 15:56:18 +0300
Subject: [PATCH] MergeAppend could consider using sorted best path.

This helps when index with suitable pathkeys is not
good for filtering data.
---
.../postgres_fdw/expected/postgres_fdw.out | 6 +-
src/backend/optimizer/path/allpaths.c | 21 +++++
src/test/regress/expected/inherit.out | 45 +++++++++-
src/test/regress/expected/partition_join.out | 87 +++++++++++--------
src/test/regress/expected/union.out | 6 +-
src/test/regress/sql/inherit.sql | 17 ++++
6 files changed, 141 insertions(+), 41 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index ea566d50341..687591e4a97 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -10074,13 +10074,15 @@ SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a
->  Nested Loop
Join Filter: (t1.a = t2.b)
->  Append
-               ->  Foreign Scan on ftprt1_p1 t1_1
+               ->  Sort
+                     Sort Key: t1_1.a
+                     ->  Foreign Scan on ftprt1_p1 t1_1
->  Foreign Scan on ftprt1_p2 t1_2
->  Materialize
->  Append
->  Foreign Scan on ftprt2_p1 t2_1
->  Foreign Scan on ftprt2_p2 t2_2
-(10 rows)
+(12 rows)

SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
a  |  b  
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 4895cee9944..827bc469269 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1845,6 +1845,27 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
/* Assert we do have an unparameterized path for this child */
Assert(cheapest_total->param_info == NULL);
}
+			else
+			{
+				/*
+				 * Even if we found necessary pathkeys, using unsorted path
+				 * can be more efficient.
+				 */
+				Path		sort_path;
+
+				cost_sort(&sort_path,
+						  root,
+						  pathkeys,
+						  childrel->cheapest_total_path->total_cost,
+						  childrel->cheapest_total_path->rows,
+						  childrel->cheapest_total_path->pathtarget->width,
+						  0.0,
+						  work_mem,
+						  -1.0 /* need all tuples to sort them */ );
+
+				if (compare_path_costs(&sort_path, cheapest_total, TOTAL_COST) < 0)
+					cheapest_total = childrel->cheapest_total_path;
+			}

/*
* When building a fractional path, determine a cheapest
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index ad732134148..16e78c8d2ff 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -1555,6 +1555,7 @@ insert into matest2 (name) values ('Test 4');
insert into matest3 (name) values ('Test 5');
insert into matest3 (name) values ('Test 6');
set enable_indexscan = off;  -- force use of seqscan/sort, so no merge
+set enable_sort = off; -- avoid sorting below MergeAppend
explain (verbose, costs off) select * from matest0 order by 1-id;
QUERY PLAN                         
------------------------------------------------------------
@@ -1608,6 +1609,7 @@ select min(1-id) from matest0;
(1 row)

reset enable_indexscan;
+reset enable_sort;
set enable_seqscan = off;  -- plan with fewest seqscans should be merge
set enable_parallel_append = off; -- Don't let parallel-append interfere
explain (verbose, costs off) select * from matest0 order by 1-id;
@@ -1702,7 +1704,9 @@ order by t1.b limit 10;
Merge Cond: (t1.b = t2.b)
->  Merge Append
Sort Key: t1.b
-               ->  Index Scan using matest0i on matest0 t1_1
+               ->  Sort
+                     Sort Key: t1_1.b
+                     ->  Seq Scan on matest0 t1_1
->  Index Scan using matest1i on matest1 t1_2
->  Materialize
->  Merge Append
@@ -1711,7 +1715,7 @@ order by t1.b limit 10;
Filter: (c = d)
->  Index Scan using matest1i on matest1 t2_2
Filter: (c = d)
-(14 rows)
+(16 rows)

reset enable_nestloop;
drop table matest0 cascade;
@@ -2663,6 +2667,43 @@ explain (costs off) select * from mcrparted where a = 10 order by a, abs(b), c;

reset enable_bitmapscan;
drop table mcrparted;
+-- Check that sort path can be used by MergeAppend even when there are suitable pathkeys
+create table hash_parted (i int, j int, k int) partition by hash(i);
+create table hash_parted_1 partition of hash_parted for values with (modulus 4, remainder 0);
+create table hash_parted_2 partition of hash_parted for values with (modulus 4, remainder 1);
+create table hash_parted_3 partition of hash_parted for values with (modulus 4, remainder 2);
+create table hash_parted_4 partition of hash_parted for values with (modulus 4, remainder 3);
+--create table hash_parted_5 partition of hash_parted for values with (modulus 6, remainder 4);
+--create table hash_parted_6 partition of hash_parted for values with (modulus 6, remainder 5);
+create index on hash_parted(j);
+create index on hash_parted(k);
+insert into hash_parted select i, i, i from generate_series(1,10000) i;
+analyze hash_parted;
+explain (costs off) select * from hash_parted where k<100 order by j limit 100;
+                               QUERY PLAN                                
+-------------------------------------------------------------------------
+ Limit
+   ->  Merge Append
+         Sort Key: hash_parted.j
+         ->  Sort
+               Sort Key: hash_parted_1.j
+               ->  Index Scan using hash_parted_1_k_idx on hash_parted_1
+                     Index Cond: (k < 100)
+         ->  Sort
+               Sort Key: hash_parted_2.j
+               ->  Index Scan using hash_parted_2_k_idx on hash_parted_2
+                     Index Cond: (k < 100)
+         ->  Sort
+               Sort Key: hash_parted_3.j
+               ->  Index Scan using hash_parted_3_k_idx on hash_parted_3
+                     Index Cond: (k < 100)
+         ->  Sort
+               Sort Key: hash_parted_4.j
+               ->  Index Scan using hash_parted_4_k_idx on hash_parted_4
+                     Index Cond: (k < 100)
+(19 rows)
+
+drop table hash_parted;
-- Ensure LIST partitions allow an Append to be used instead of a MergeAppend
create table bool_lp (b bool) partition by list(b);
create table bool_lp_true partition of bool_lp for values in(true);
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index 6d07f86b9bc..80d480d33d5 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -1309,28 +1309,32 @@ SELECT t1.a, t1.c, t2.b, t2.c, t3.a + t3.b, t3.c FROM (prt1 t1 LEFT JOIN prt2 t2
-- This should generate a partitionwise join, but currently fails to
EXPLAIN (COSTS OFF)
SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
-                        QUERY PLAN                         
------------------------------------------------------------
- Incremental Sort
+                           QUERY PLAN                            
+-----------------------------------------------------------------
+ Sort
Sort Key: prt1.a, prt2.b
-   Presorted Key: prt1.a
-   ->  Merge Left Join
-         Merge Cond: (prt1.a = prt2.b)
-         ->  Sort
-               Sort Key: prt1.a
-               ->  Append
-                     ->  Seq Scan on prt1_p1 prt1_1
-                           Filter: ((a < 450) AND (b = 0))
-                     ->  Seq Scan on prt1_p2 prt1_2
-                           Filter: ((a < 450) AND (b = 0))
-         ->  Sort
-               Sort Key: prt2.b
-               ->  Append
+   ->  Merge Right Join
+         Merge Cond: (prt2.b = prt1.a)
+         ->  Append
+               ->  Sort
+                     Sort Key: prt2_1.b
->  Seq Scan on prt2_p2 prt2_1
Filter: (b > 250)
+               ->  Sort
+                     Sort Key: prt2_2.b
->  Seq Scan on prt2_p3 prt2_2
Filter: (b > 250)
-(19 rows)
+         ->  Materialize
+               ->  Append
+                     ->  Sort
+                           Sort Key: prt1_1.a
+                           ->  Seq Scan on prt1_p1 prt1_1
+                                 Filter: ((a < 450) AND (b = 0))
+                     ->  Sort
+                           Sort Key: prt1_2.a
+                           ->  Seq Scan on prt1_p2 prt1_2
+                                 Filter: ((a < 450) AND (b = 0))
+(23 rows)

SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
a  |  b  
@@ -1350,25 +1354,33 @@ SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT *
-- partitionwise join does not apply
EXPLAIN (COSTS OFF)
SELECT t1.a, t2.b FROM prt1 t1, prt2 t2 WHERE t1::text = t2::text AND t1.a = t2.b ORDER BY t1.a;
-                                       QUERY PLAN                                        
------------------------------------------------------------------------------------------
+                            QUERY PLAN                            
+------------------------------------------------------------------
Merge Join
-   Merge Cond: ((t1.a = t2.b) AND (((((t1.*)::prt1))::text) = ((((t2.*)::prt2))::text)))
-   ->  Sort
-         Sort Key: t1.a, ((((t1.*)::prt1))::text)
-         ->  Result
-               ->  Append
-                     ->  Seq Scan on prt1_p1 t1_1
-                     ->  Seq Scan on prt1_p2 t1_2
-                     ->  Seq Scan on prt1_p3 t1_3
-   ->  Sort
-         Sort Key: t2.b, ((((t2.*)::prt2))::text)
-         ->  Result
-               ->  Append
+   Merge Cond: (t1.a = t2.b)
+   Join Filter: ((((t2.*)::prt2))::text = (((t1.*)::prt1))::text)
+   ->  Append
+         ->  Sort
+               Sort Key: t1_1.a
+               ->  Seq Scan on prt1_p1 t1_1
+         ->  Sort
+               Sort Key: t1_2.a
+               ->  Seq Scan on prt1_p2 t1_2
+         ->  Sort
+               Sort Key: t1_3.a
+               ->  Seq Scan on prt1_p3 t1_3
+   ->  Materialize
+         ->  Append
+               ->  Sort
+                     Sort Key: t2_1.b
->  Seq Scan on prt2_p1 t2_1
+               ->  Sort
+                     Sort Key: t2_2.b
->  Seq Scan on prt2_p2 t2_2
+               ->  Sort
+                     Sort Key: t2_3.b
->  Seq Scan on prt2_p3 t2_3
-(16 rows)
+(24 rows)

SELECT t1.a, t2.b FROM prt1 t1, prt2 t2 WHERE t1::text = t2::text AND t1.a = t2.b ORDER BY t1.a;
a  | b  
@@ -4906,21 +4918,26 @@ EXPLAIN (COSTS OFF)
SELECT t1.*, t2.* FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a AND t1.b = t2.b) WHERE t1.b >= 125 AND t1.b < 225 ORDER BY t1.a, t1.b;
QUERY PLAN                             
--------------------------------------------------------------------
- Sort
+ Merge Append
Sort Key: t1.a, t1.b
-   ->  Append
+   ->  Sort
+         Sort Key: t1_1.a, t1_1.b
->  Hash Join
Hash Cond: ((t1_1.a = t2_1.a) AND (t1_1.b = t2_1.b))
->  Seq Scan on alpha_neg_p1 t1_1
Filter: ((b >= 125) AND (b < 225))
->  Hash
->  Seq Scan on beta_neg_p1 t2_1
+   ->  Sort
+         Sort Key: t1_2.a, t1_2.b
->  Hash Join
Hash Cond: ((t2_2.a = t1_2.a) AND (t2_2.b = t1_2.b))
->  Seq Scan on beta_neg_p2 t2_2
->  Hash
->  Seq Scan on alpha_neg_p2 t1_2
Filter: ((b >= 125) AND (b < 225))
+   ->  Sort
+         Sort Key: t1_4.a, t1_4.b
->  Hash Join
Hash Cond: ((t2_4.a = t1_4.a) AND (t2_4.b = t1_4.b))
->  Append
@@ -4935,7 +4952,7 @@ SELECT t1.*, t2.* FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a AND t1.b = t2
Filter: ((b >= 125) AND (b < 225))
->  Seq Scan on alpha_pos_p3 t1_6
Filter: ((b >= 125) AND (b < 225))
-(29 rows)
+(34 rows)

SELECT t1.*, t2.* FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a AND t1.b = t2.b) WHERE t1.b >= 125 AND t1.b < 225 ORDER BY t1.a, t1.b;
a  |  b  |  c   | a  |  b  |  c   
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index 0fd0e1c38b3..4c1c173d8e6 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -1195,12 +1195,14 @@ select event_id
----------------------------------------------------------
Merge Append
Sort Key: events.event_id
-   ->  Index Scan using events_pkey on events
+   ->  Sort
+         Sort Key: events.event_id
+         ->  Seq Scan on events
->  Sort
Sort Key: events_1.event_id
->  Seq Scan on events_child events_1
->  Index Scan using other_events_pkey on other_events
-(7 rows)
+(9 rows)

drop table events_child, events, other_events;
reset enable_indexonlyscan;
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index e3bcfdb181e..5331e49283f 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -586,11 +586,13 @@ insert into matest3 (name) values ('Test 5');
insert into matest3 (name) values ('Test 6');

set enable_indexscan = off;  -- force use of seqscan/sort, so no merge
+set enable_sort = off; -- avoid sorting below MergeAppend
explain (verbose, costs off) select * from matest0 order by 1-id;
select * from matest0 order by 1-id;
explain (verbose, costs off) select min(1-id) from matest0;
select min(1-id) from matest0;
reset enable_indexscan;
+reset enable_sort;

set enable_seqscan = off; -- plan with fewest seqscans should be merge
set enable_parallel_append = off; -- Don't let parallel-append interfere
@@ -976,6 +978,21 @@ reset enable_bitmapscan;

drop table mcrparted;

+-- Check that sort path can be used by MergeAppend even when there are suitable pathkeys
+create table hash_parted (i int, j int, k int) partition by hash(i);
+create table hash_parted_1 partition of hash_parted for values with (modulus 4, remainder 0);
+create table hash_parted_2 partition of hash_parted for values with (modulus 4, remainder 1);
+create table hash_parted_3 partition of hash_parted for values with (modulus 4, remainder 2);
+create table hash_parted_4 partition of hash_parted for values with (modulus 4, remainder 3);
+--create table hash_parted_5 partition of hash_parted for values with (modulus 6, remainder 4);
+--create table hash_parted_6 partition of hash_parted for values with (modulus 6, remainder 5);
+create index on hash_parted(j);
+create index on hash_parted(k);
+insert into hash_parted select i, i, i from generate_series(1,10000) i;
+analyze hash_parted;
+explain (costs off) select * from hash_parted where k<100 order by j limit 100;
+drop table hash_parted;
+
-- Ensure LIST partitions allow an Append to be used instead of a MergeAppend
create table bool_lp (b bool) partition by list(b);
create table bool_lp_true partition of bool_lp for values in(true);
-- 
2.34.1

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

When a patient asks the doctor, "Am I going to die?", he means
"Am I going to die soon?"

Andy Fan

zhihuifan1213@163.com

about 1 year ago

In reply to: Bruce Momjian (#2)

Re: MergeAppend could consider sorting cheapest child path

Bruce Momjian <bruce@momjian.us> writes:

Is this still being considered?

I'd +1 on this feature. I guess this would be more useful on parallel
case, where the Sort can be pushed down to parallel worker, and in the
distributed database case, where the Sort can be pushed down to multiple
nodes, at the result, the leader just do the merge works.

At the high level implementaion, sorting *cheapest* child path looks
doesn't add too much overhead on the planning effort.

--
Best Regards
Andy Fan

Nikita Malakhov

hukutoc@gmail.com

about 1 year ago

In reply to: Andy Fan (#3)

Re: MergeAppend could consider sorting cheapest child path

Hi!

I've checked this thread and examples in it, and do not see stable
improvements
in base tests. Sometimes base tests are considerably slower with patch,
like:

explain analyze
select t1.* from matest0 t1, matest0 t2
where t1.b = t2.b and t2.c = t2.d
order by t1.b limit 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.46..19.90 rows=10 width=16) (actual time=0.007..0.008
rows=0 loops=1)
-> Merge Join (cost=0.46..181.24 rows=93 width=16) (actual
time=0.007..0.007 rows=0 loops=1)
Merge Cond: (t1.b = t2.b)
-> Merge Append (cost=0.17..90.44 rows=1851 width=16) (actual
time=0.006..0.007 rows=0 loops=1)
Sort Key: t1.b
-> Sort (cost=0.01..0.02 rows=1 width=16) (actual
time=0.004..0.004 rows=0 loops=1)
Sort Key: t1_1.b
Sort Method: quicksort Memory: 25kB
-> Seq Scan on matest0 t1_1 (cost=0.00..0.00 rows=1
width=16) (actual time=0.002..0.002 rows=0 loops=1)
-> Index Scan using matest1i on matest1 t1_2
(cost=0.15..71.90 rows=1850 width=16) (actual time=0.002..0.002 rows=0
loops=1)
-> Materialize (cost=0.29..84.81 rows=10 width=4) (never
executed)
-> Merge Append (cost=0.29..84.78 rows=10 width=4) (never
executed)
Sort Key: t2.b
-> Index Scan using matest0i on matest0 t2_1
(cost=0.12..8.14 rows=1 width=4) (never executed)
Filter: (c = d)
-> Index Scan using matest1i on matest1 t2_2
(cost=0.15..76.53 rows=9 width=4) (never executed)
Filter: (c = d)
Planning Time: 0.252 ms
Execution Time: 0.048 ms
(19 rows)

explain analyze
select t1.* from matest0 t1, matest0 t2
where t1.b = t2.b and t2.c = t2.d
order by t1.b limit 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.57..20.88 rows=10 width=16) (actual time=0.004..0.004
rows=0 loops=1)
-> Merge Join (cost=0.57..189.37 rows=93 width=16) (actual
time=0.003..0.004 rows=0 loops=1)
Merge Cond: (t1.b = t2.b)
-> Merge Append (cost=0.29..98.56 rows=1851 width=16) (actual
time=0.002..0.003 rows=0 loops=1)
Sort Key: t1.b
-> Index Scan using matest0i on matest0 t1_1
(cost=0.12..8.14 rows=1 width=16) (actual time=0.002..0.002 rows=0 loops=1)
-> Index Scan using matest1i on matest1 t1_2
(cost=0.15..71.90 rows=1850 width=16) (actual time=0.001..0.001 rows=0
loops=1)
-> Materialize (cost=0.29..84.81 rows=10 width=4) (never
executed)
-> Merge Append (cost=0.29..84.78 rows=10 width=4) (never
executed)
Sort Key: t2.b
-> Index Scan using matest0i on matest0 t2_1
(cost=0.12..8.14 rows=1 width=4) (never executed)
Filter: (c = d)
-> Index Scan using matest1i on matest1 t2_2
(cost=0.15..76.53 rows=9 width=4) (never executed)
Filter: (c = d)
Planning Time: 0.278 ms
Execution Time: 0.025 ms
(16 rows)

(patched)
explain analyze
select t1.* from matest0 t1, matest0 t2
where t1.b = t2.b and t2.c = t2.d
order by t1.b limit 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.46..19.90 rows=10 width=16) (actual time=0.007..0.008
rows=0 loops=1)
-> Merge Join (cost=0.46..181.24 rows=93 width=16) (actual
time=0.007..0.007 rows=0 loops=1)
Merge Cond: (t1.b = t2.b)
-> Merge Append (cost=0.17..90.44 rows=1851 width=16) (actual
time=0.006..0.007 rows=0 loops=1)
Sort Key: t1.b
-> Sort (cost=0.01..0.02 rows=1 width=16) (actual
time=0.004..0.004 rows=0 loops=1)
Sort Key: t1_1.b
Sort Method: quicksort Memory: 25kB
-> Seq Scan on matest0 t1_1 (cost=0.00..0.00 rows=1
width=16) (actual time=0.002..0.002 rows=0 loops=1)
-> Index Scan using matest1i on matest1 t1_2
(cost=0.15..71.90 rows=1850 width=16) (actual time=0.002..0.002 rows=0
loops=1)
-> Materialize (cost=0.29..84.81 rows=10 width=4) (never
executed)
-> Merge Append (cost=0.29..84.78 rows=10 width=4) (never
executed)
Sort Key: t2.b
-> Index Scan using matest0i on matest0 t2_1
(cost=0.12..8.14 rows=1 width=4) (never executed)
Filter: (c = d)
-> Index Scan using matest1i on matest1 t2_2
(cost=0.15..76.53 rows=9 width=4) (never executed)
Filter: (c = d)
Planning Time: 0.252 ms
Execution Time: 0.048 ms
(19 rows)

(vanilla)
explain analyze
select t1.* from matest0 t1, matest0 t2
where t1.b = t2.b and t2.c = t2.d
order by t1.b limit 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.57..20.88 rows=10 width=16) (actual time=0.004..0.004
rows=0 loops=1)
-> Merge Join (cost=0.57..189.37 rows=93 width=16) (actual
time=0.003..0.004 rows=0 loops=1)
Merge Cond: (t1.b = t2.b)
-> Merge Append (cost=0.29..98.56 rows=1851 width=16) (actual
time=0.002..0.003 rows=0 loops=1)
Sort Key: t1.b
-> Index Scan using matest0i on matest0 t1_1
(cost=0.12..8.14 rows=1 width=16) (actual time=0.002..0.002 rows=0 loops=1)
-> Index Scan using matest1i on matest1 t1_2
(cost=0.15..71.90 rows=1850 width=16) (actual time=0.001..0.001 rows=0
loops=1)
-> Materialize (cost=0.29..84.81 rows=10 width=4) (never
executed)
-> Merge Append (cost=0.29..84.78 rows=10 width=4) (never
executed)
Sort Key: t2.b
-> Index Scan using matest0i on matest0 t2_1
(cost=0.12..8.14 rows=1 width=4) (never executed)
Filter: (c = d)
-> Index Scan using matest1i on matest1 t2_2
(cost=0.15..76.53 rows=9 width=4) (never executed)
Filter: (c = d)
Planning Time: 0.278 ms
Execution Time: 0.025 ms
(16 rows)

--
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/

Alexander Pyhalov

a.pyhalov@postgrespro.ru

10 months ago

In reply to: Andy Fan (#3)

1 attachment(s)

Re: MergeAppend could consider sorting cheapest child path

Andy Fan писал(а) 2024-10-17 03:33:

Bruce Momjian <bruce@momjian.us> writes:

Is this still being considered?

I'd +1 on this feature. I guess this would be more useful on parallel
case, where the Sort can be pushed down to parallel worker, and in the
distributed database case, where the Sort can be pushed down to
multiple
nodes, at the result, the leader just do the merge works.

At the high level implementaion, sorting *cheapest* child path looks
doesn't add too much overhead on the planning effort.

Hi.

I've updated patch. One more interesting case which we found - when
fractional path is selected, it still can be more expensive than sorted
cheapest total path (as we look only on indexes whith necessary
pathkeys, not on indexes which allow efficiently fetch data).
So far couldn't find artificial example, but we've seen inadequate index
selection due to this issue - instead of using index suited for filters
in where, index, suitable for sorting was selected as one having the
cheapest fractional cost.
--
Best regards,
Alexander Pyhalov,
Postgres Professional

Attachments:

v2-0001-MergeAppend-could-consider-using-sorted-best-path.patchtext/x-diff; name=v2-0001-MergeAppend-could-consider-using-sorted-best-path.patchDownload

From 268e09beb85fb5f7ce01367cdacc846ab7af471f Mon Sep 17 00:00:00 2001
From: Alexander Pyhalov <a.pyhalov@postgrespro.ru>
Date: Tue, 18 Jun 2024 15:56:18 +0300
Subject: [PATCH] MergeAppend could consider using sorted best path.

It also can be considered when looking at the
cheapest fractional paths.

This helps when index with suitable pathkeys is not
good for filtering data.
---
 .../postgres_fdw/expected/postgres_fdw.out    |   6 +-
 src/backend/optimizer/path/allpaths.c         |  33 +++++
 .../regress/expected/collate.icu.utf8.out     |   9 +-
 src/test/regress/expected/inherit.out         |  60 ++++++++-
 src/test/regress/expected/partition_join.out  | 114 ++++++++++--------
 src/test/regress/expected/partition_prune.out |  16 +--
 src/test/regress/expected/union.out           |   6 +-
 src/test/regress/sql/inherit.sql              |  16 +++
 8 files changed, 194 insertions(+), 66 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index d1acee5a5fa..11b42f18cb6 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -10277,13 +10277,15 @@ SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a
    ->  Nested Loop
          Join Filter: (t1.a = t2.b)
          ->  Append
-               ->  Foreign Scan on ftprt1_p1 t1_1
+               ->  Sort
+                     Sort Key: t1_1.a
+                     ->  Foreign Scan on ftprt1_p1 t1_1
                ->  Foreign Scan on ftprt1_p2 t1_2
          ->  Materialize
                ->  Append
                      ->  Foreign Scan on ftprt2_p1 t2_1
                      ->  Foreign Scan on ftprt2_p2 t2_2
-(10 rows)
+(12 rows)
 
 SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
   a  |  b  
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index df3453f99f0..7de62798945 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1852,6 +1852,9 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 			Path	   *cheapest_startup,
 					   *cheapest_total,
 					   *cheapest_fractional = NULL;
+			bool		created_sort_path = false;
+			Path		sort_path;
+
 
 			/* Locate the right paths, if they are available. */
 			cheapest_startup =
@@ -1878,6 +1881,28 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 				/* Assert we do have an unparameterized path for this child */
 				Assert(cheapest_total->param_info == NULL);
 			}
+			else
+			{
+				/*
+				 * Even if we found necessary pathkeys, using unsorted path
+				 * can be more efficient.
+				 */
+				cost_sort(&sort_path,
+						  root,
+						  pathkeys,
+						  childrel->cheapest_total_path->disabled_nodes,
+						  childrel->cheapest_total_path->total_cost,
+						  childrel->cheapest_total_path->rows,
+						  childrel->cheapest_total_path->pathtarget->width,
+						  0.0,
+						  work_mem,
+						  -1.0 /* need all tuples to sort them */ );
+
+				created_sort_path = true;
+
+				if (compare_path_costs(&sort_path, cheapest_total, TOTAL_COST) < 0)
+					cheapest_total = childrel->cheapest_total_path;
+			}
 
 			/*
 			 * When building a fractional path, determine a cheapest
@@ -1909,6 +1934,14 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 				 */
 				if (!cheapest_fractional)
 					cheapest_fractional = cheapest_total;
+
+				/*
+				 * Even if we found necessary pathkeys, using sorted cheapest
+				 * total path can be more efficient.
+				 */
+				if (created_sort_path &&
+					compare_fractional_path_costs(&sort_path, cheapest_fractional, path_fraction) < 0)
+					cheapest_fractional = childrel->cheapest_total_path;
 			}
 
 			/*
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index aee4755c083..63c408961e1 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -2399,16 +2399,19 @@ EXPLAIN (COSTS OFF)
 SELECT c collate "C", count(c) FROM pagg_tab3 GROUP BY c collate "C" ORDER BY 1;
                        QUERY PLAN                       
 --------------------------------------------------------
- Sort
+ Merge Append
    Sort Key: ((pagg_tab3.c)::text) COLLATE "C"
-   ->  Append
+   ->  Sort
+         Sort Key: ((pagg_tab3.c)::text) COLLATE "C"
          ->  HashAggregate
                Group Key: (pagg_tab3.c)::text
                ->  Seq Scan on pagg_tab3_p2 pagg_tab3
+   ->  Sort
+         Sort Key: ((pagg_tab3_1.c)::text) COLLATE "C"
          ->  HashAggregate
                Group Key: (pagg_tab3_1.c)::text
                ->  Seq Scan on pagg_tab3_p1 pagg_tab3_1
-(9 rows)
+(12 rows)
 
 SELECT c collate "C", count(c) FROM pagg_tab3 GROUP BY c collate "C" ORDER BY 1;
  c | count 
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index e671975a281..f37e53df844 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -1638,10 +1638,12 @@ insert into matest2 (name) values ('Test 4');
 insert into matest3 (name) values ('Test 5');
 insert into matest3 (name) values ('Test 6');
 set enable_indexscan = off;  -- force use of seqscan/sort, so no merge
+set enable_sort = off; -- avoid sorting below MergeAppend
 explain (verbose, costs off) select * from matest0 order by 1-id;
                          QUERY PLAN                         
 ------------------------------------------------------------
  Sort
+   Disabled: true
    Output: matest0.id, matest0.name, ((1 - matest0.id))
    Sort Key: ((1 - matest0.id))
    ->  Result
@@ -1655,7 +1657,7 @@ explain (verbose, costs off) select * from matest0 order by 1-id;
                      Output: matest0_3.id, matest0_3.name
                ->  Seq Scan on public.matest3 matest0_4
                      Output: matest0_4.id, matest0_4.name
-(14 rows)
+(15 rows)
 
 select * from matest0 order by 1-id;
  id |  name  
@@ -1691,6 +1693,7 @@ select min(1-id) from matest0;
 (1 row)
 
 reset enable_indexscan;
+reset enable_sort;
 set enable_seqscan = off;  -- plan with fewest seqscans should be merge
 set enable_parallel_append = off; -- Don't let parallel-append interfere
 explain (verbose, costs off) select * from matest0 order by 1-id;
@@ -1816,16 +1819,20 @@ order by t1.b limit 10;
          Merge Cond: (t1.b = t2.b)
          ->  Merge Append
                Sort Key: t1.b
-               ->  Index Scan using matest0i on matest0 t1_1
+               ->  Sort
+                     Sort Key: t1_1.b
+                     ->  Seq Scan on matest0 t1_1
                ->  Index Scan using matest1i on matest1 t1_2
          ->  Materialize
                ->  Merge Append
                      Sort Key: t2.b
-                     ->  Index Scan using matest0i on matest0 t2_1
-                           Filter: (c = d)
+                     ->  Sort
+                           Sort Key: t2_1.b
+                           ->  Seq Scan on matest0 t2_1
+                                 Filter: (c = d)
                      ->  Index Scan using matest1i on matest1 t2_2
                            Filter: (c = d)
-(14 rows)
+(18 rows)
 
 reset enable_nestloop;
 drop table matest0 cascade;
@@ -3511,6 +3518,49 @@ explain (costs off) select * from mcrparted where a = 10 order by a, abs(b), c;
 
 reset enable_bitmapscan;
 drop table mcrparted;
+-- Check that sort path can be used by MergeAppend even when there are suitable pathkeys
+create table hash_parted (i int, j int, k int) partition by hash(i);
+create table hash_parted_1 partition of hash_parted for values with (modulus 4, remainder 0);
+create table hash_parted_2 partition of hash_parted for values with (modulus 4, remainder 1);
+create table hash_parted_3 partition of hash_parted for values with (modulus 4, remainder 2);
+create table hash_parted_4 partition of hash_parted for values with (modulus 4, remainder 3);
+create index on hash_parted(i, j);
+create index on hash_parted(k);
+create index on hash_parted(i);
+insert into hash_parted select i, i%1000, i%100  from generate_series(1,10000) i;
+analyze hash_parted;
+explain (costs off) select * from hash_parted where k < 5 order by i,j;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ Merge Append
+   Sort Key: hash_parted.i, hash_parted.j
+   ->  Sort
+         Sort Key: hash_parted_1.i, hash_parted_1.j
+         ->  Bitmap Heap Scan on hash_parted_1
+               Recheck Cond: (k < 5)
+               ->  Bitmap Index Scan on hash_parted_1_k_idx
+                     Index Cond: (k < 5)
+   ->  Sort
+         Sort Key: hash_parted_2.i, hash_parted_2.j
+         ->  Bitmap Heap Scan on hash_parted_2
+               Recheck Cond: (k < 5)
+               ->  Bitmap Index Scan on hash_parted_2_k_idx
+                     Index Cond: (k < 5)
+   ->  Sort
+         Sort Key: hash_parted_3.i, hash_parted_3.j
+         ->  Bitmap Heap Scan on hash_parted_3
+               Recheck Cond: (k < 5)
+               ->  Bitmap Index Scan on hash_parted_3_k_idx
+                     Index Cond: (k < 5)
+   ->  Sort
+         Sort Key: hash_parted_4.i, hash_parted_4.j
+         ->  Bitmap Heap Scan on hash_parted_4
+               Recheck Cond: (k < 5)
+               ->  Bitmap Index Scan on hash_parted_4_k_idx
+                     Index Cond: (k < 5)
+(26 rows)
+
+drop table hash_parted;
 -- Ensure LIST partitions allow an Append to be used instead of a MergeAppend
 create table bool_lp (b bool) partition by list(b);
 create table bool_lp_true partition of bool_lp for values in(true);
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index 938cedd79ad..2b8394e1647 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -65,31 +65,34 @@ SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.b AND t1.b =
 -- inner join with partially-redundant join clauses
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
-                          QUERY PLAN                           
----------------------------------------------------------------
- Sort
+                       QUERY PLAN                        
+---------------------------------------------------------
+ Merge Append
    Sort Key: t1.a
-   ->  Append
-         ->  Merge Join
-               Merge Cond: (t1_1.a = t2_1.a)
-               ->  Index Scan using iprt1_p1_a on prt1_p1 t1_1
-               ->  Sort
-                     Sort Key: t2_1.b
-                     ->  Seq Scan on prt2_p1 t2_1
-                           Filter: (a = b)
+   ->  Merge Join
+         Merge Cond: (t1_1.a = t2_1.a)
+         ->  Index Scan using iprt1_p1_a on prt1_p1 t1_1
+         ->  Sort
+               Sort Key: t2_1.b
+               ->  Seq Scan on prt2_p1 t2_1
+                     Filter: (a = b)
+   ->  Sort
+         Sort Key: t1_2.a
          ->  Hash Join
                Hash Cond: (t1_2.a = t2_2.a)
                ->  Seq Scan on prt1_p2 t1_2
                ->  Hash
                      ->  Seq Scan on prt2_p2 t2_2
                            Filter: (a = b)
+   ->  Sort
+         Sort Key: t1_3.a
          ->  Hash Join
                Hash Cond: (t1_3.a = t2_3.a)
                ->  Seq Scan on prt1_p3 t1_3
                ->  Hash
                      ->  Seq Scan on prt2_p3 t2_3
                            Filter: (a = b)
-(22 rows)
+(25 rows)
 
 SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
  a  |  c   | b  |  c   
@@ -1383,28 +1386,32 @@ SELECT t1.a, t1.c, t2.b, t2.c, t3.a + t3.b, t3.c FROM (prt1 t1 LEFT JOIN prt2 t2
 -- This should generate a partitionwise join, but currently fails to
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
-                        QUERY PLAN                         
------------------------------------------------------------
- Incremental Sort
+                           QUERY PLAN                            
+-----------------------------------------------------------------
+ Sort
    Sort Key: prt1.a, prt2.b
-   Presorted Key: prt1.a
-   ->  Merge Left Join
-         Merge Cond: (prt1.a = prt2.b)
-         ->  Sort
-               Sort Key: prt1.a
-               ->  Append
-                     ->  Seq Scan on prt1_p1 prt1_1
-                           Filter: ((a < 450) AND (b = 0))
-                     ->  Seq Scan on prt1_p2 prt1_2
-                           Filter: ((a < 450) AND (b = 0))
-         ->  Sort
-               Sort Key: prt2.b
-               ->  Append
+   ->  Merge Right Join
+         Merge Cond: (prt2.b = prt1.a)
+         ->  Append
+               ->  Sort
+                     Sort Key: prt2_1.b
                      ->  Seq Scan on prt2_p2 prt2_1
                            Filter: (b > 250)
+               ->  Sort
+                     Sort Key: prt2_2.b
                      ->  Seq Scan on prt2_p3 prt2_2
                            Filter: (b > 250)
-(19 rows)
+         ->  Materialize
+               ->  Append
+                     ->  Sort
+                           Sort Key: prt1_1.a
+                           ->  Seq Scan on prt1_p1 prt1_1
+                                 Filter: ((a < 450) AND (b = 0))
+                     ->  Sort
+                           Sort Key: prt1_2.a
+                           ->  Seq Scan on prt1_p2 prt1_2
+                                 Filter: ((a < 450) AND (b = 0))
+(23 rows)
 
 SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
   a  |  b  
@@ -1424,25 +1431,33 @@ SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT *
 -- partitionwise join does not apply
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t2.b FROM prt1 t1, prt2 t2 WHERE t1::text = t2::text AND t1.a = t2.b ORDER BY t1.a;
-                                       QUERY PLAN                                        
------------------------------------------------------------------------------------------
+                            QUERY PLAN                            
+------------------------------------------------------------------
  Merge Join
-   Merge Cond: ((t1.a = t2.b) AND (((((t1.*)::prt1))::text) = ((((t2.*)::prt2))::text)))
-   ->  Sort
-         Sort Key: t1.a, ((((t1.*)::prt1))::text)
-         ->  Result
-               ->  Append
-                     ->  Seq Scan on prt1_p1 t1_1
-                     ->  Seq Scan on prt1_p2 t1_2
-                     ->  Seq Scan on prt1_p3 t1_3
-   ->  Sort
-         Sort Key: t2.b, ((((t2.*)::prt2))::text)
-         ->  Result
-               ->  Append
+   Merge Cond: (t1.a = t2.b)
+   Join Filter: ((((t2.*)::prt2))::text = (((t1.*)::prt1))::text)
+   ->  Append
+         ->  Sort
+               Sort Key: t1_1.a
+               ->  Seq Scan on prt1_p1 t1_1
+         ->  Sort
+               Sort Key: t1_2.a
+               ->  Seq Scan on prt1_p2 t1_2
+         ->  Sort
+               Sort Key: t1_3.a
+               ->  Seq Scan on prt1_p3 t1_3
+   ->  Materialize
+         ->  Append
+               ->  Sort
+                     Sort Key: t2_1.b
                      ->  Seq Scan on prt2_p1 t2_1
+               ->  Sort
+                     Sort Key: t2_2.b
                      ->  Seq Scan on prt2_p2 t2_2
+               ->  Sort
+                     Sort Key: t2_3.b
                      ->  Seq Scan on prt2_p3 t2_3
-(16 rows)
+(24 rows)
 
 SELECT t1.a, t2.b FROM prt1 t1, prt2 t2 WHERE t1::text = t2::text AND t1.a = t2.b ORDER BY t1.a;
  a  | b  
@@ -5037,21 +5052,26 @@ EXPLAIN (COSTS OFF)
 SELECT t1.*, t2.* FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a AND t1.b = t2.b) WHERE t1.b >= 125 AND t1.b < 225 ORDER BY t1.a, t1.b;
                              QUERY PLAN                             
 --------------------------------------------------------------------
- Sort
+ Merge Append
    Sort Key: t1.a, t1.b
-   ->  Append
+   ->  Sort
+         Sort Key: t1_1.a, t1_1.b
          ->  Hash Join
                Hash Cond: ((t1_1.a = t2_1.a) AND (t1_1.b = t2_1.b))
                ->  Seq Scan on alpha_neg_p1 t1_1
                      Filter: ((b >= 125) AND (b < 225))
                ->  Hash
                      ->  Seq Scan on beta_neg_p1 t2_1
+   ->  Sort
+         Sort Key: t1_2.a, t1_2.b
          ->  Hash Join
                Hash Cond: ((t2_2.a = t1_2.a) AND (t2_2.b = t1_2.b))
                ->  Seq Scan on beta_neg_p2 t2_2
                ->  Hash
                      ->  Seq Scan on alpha_neg_p2 t1_2
                            Filter: ((b >= 125) AND (b < 225))
+   ->  Sort
+         Sort Key: t1_4.a, t1_4.b
          ->  Hash Join
                Hash Cond: ((t2_4.a = t1_4.a) AND (t2_4.b = t1_4.b))
                ->  Append
@@ -5066,7 +5086,7 @@ SELECT t1.*, t2.* FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a AND t1.b = t2
                                  Filter: ((b >= 125) AND (b < 225))
                            ->  Seq Scan on alpha_pos_p3 t1_6
                                  Filter: ((b >= 125) AND (b < 225))
-(29 rows)
+(34 rows)
 
 SELECT t1.*, t2.* FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a AND t1.b = t2.b) WHERE t1.b >= 125 AND t1.b < 225 ORDER BY t1.a, t1.b;
  a  |  b  |  c   | a  |  b  |  c   
diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index 0bf35260b46..b25aa73e946 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -4763,9 +4763,10 @@ select min(a) over (partition by a order by a) from part_abc where a >= stable_o
                Window: w1 AS (PARTITION BY part_abc.a ORDER BY part_abc.a)
                ->  Append
                      Subplans Removed: 1
-                     ->  Index Scan using part_abc_2_a_idx on part_abc_2 part_abc_1
-                           Index Cond: (a >= (stable_one() + 1))
-                           Filter: (d <= stable_one())
+                     ->  Sort
+                           Sort Key: part_abc_1.a
+                           ->  Seq Scan on part_abc_2 part_abc_1
+                                 Filter: ((d <= stable_one()) AND (a >= (stable_one() + 1)))
                      ->  Merge Append
                            Sort Key: part_abc_3.a
                            Subplans Removed: 1
@@ -4780,9 +4781,10 @@ select min(a) over (partition by a order by a) from part_abc where a >= stable_o
                Window: w1 AS (PARTITION BY part_abc_5.a ORDER BY part_abc_5.a)
                ->  Append
                      Subplans Removed: 1
-                     ->  Index Scan using part_abc_2_a_idx on part_abc_2 part_abc_6
-                           Index Cond: (a >= (stable_one() + 1))
-                           Filter: (d >= stable_one())
+                     ->  Sort
+                           Sort Key: part_abc_6.a
+                           ->  Seq Scan on part_abc_2 part_abc_6
+                                 Filter: ((d >= stable_one()) AND (a >= (stable_one() + 1)))
                      ->  Merge Append
                            Sort Key: a
                            Subplans Removed: 1
@@ -4792,7 +4794,7 @@ select min(a) over (partition by a order by a) from part_abc where a >= stable_o
                            ->  Index Scan using part_abc_3_3_a_idx on part_abc_3_3 part_abc_9
                                  Index Cond: (a >= (stable_one() + 1))
                                  Filter: (d >= stable_one())
-(35 rows)
+(37 rows)
 
 drop view part_abc_view;
 drop table part_abc;
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index 96962817ed4..0ccadea910c 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -1207,12 +1207,14 @@ select event_id
 ----------------------------------------------------------
  Merge Append
    Sort Key: events.event_id
-   ->  Index Scan using events_pkey on events
+   ->  Sort
+         Sort Key: events.event_id
+         ->  Seq Scan on events
    ->  Sort
          Sort Key: events_1.event_id
          ->  Seq Scan on events_child events_1
    ->  Index Scan using other_events_pkey on other_events
-(7 rows)
+(9 rows)
 
 drop table events_child, events, other_events;
 reset enable_indexonlyscan;
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 4e73c70495c..cb80f85b7fe 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -634,11 +634,13 @@ insert into matest3 (name) values ('Test 5');
 insert into matest3 (name) values ('Test 6');
 
 set enable_indexscan = off;  -- force use of seqscan/sort, so no merge
+set enable_sort = off; -- avoid sorting below MergeAppend
 explain (verbose, costs off) select * from matest0 order by 1-id;
 select * from matest0 order by 1-id;
 explain (verbose, costs off) select min(1-id) from matest0;
 select min(1-id) from matest0;
 reset enable_indexscan;
+reset enable_sort;
 
 set enable_seqscan = off;  -- plan with fewest seqscans should be merge
 set enable_parallel_append = off; -- Don't let parallel-append interfere
@@ -1384,6 +1386,20 @@ reset enable_bitmapscan;
 
 drop table mcrparted;
 
+-- Check that sort path can be used by MergeAppend even when there are suitable pathkeys
+create table hash_parted (i int, j int, k int) partition by hash(i);
+create table hash_parted_1 partition of hash_parted for values with (modulus 4, remainder 0);
+create table hash_parted_2 partition of hash_parted for values with (modulus 4, remainder 1);
+create table hash_parted_3 partition of hash_parted for values with (modulus 4, remainder 2);
+create table hash_parted_4 partition of hash_parted for values with (modulus 4, remainder 3);
+create index on hash_parted(i, j);
+create index on hash_parted(k);
+create index on hash_parted(i);
+insert into hash_parted select i, i%1000, i%100  from generate_series(1,10000) i;
+analyze hash_parted;
+explain (costs off) select * from hash_parted where k < 5 order by i,j;
+drop table hash_parted;
+
 -- Ensure LIST partitions allow an Append to be used instead of a MergeAppend
 create table bool_lp (b bool) partition by list(b);
 create table bool_lp_true partition of bool_lp for values in(true);
-- 
2.43.0

Andrei Lepikhov

lepihov@gmail.com

9 months ago

In reply to: Alexander Pyhalov (#5)

1 attachment(s)

Re: MergeAppend could consider sorting cheapest child path

On 3/28/25 09:19, Alexander Pyhalov wrote:

Andy Fan писал(а) 2024-10-17 03:33:
I've updated patch. One more interesting case which we found - when
fractional path is selected, it still can be more expensive than sorted
cheapest total path (as we look only on indexes whith necessary
pathkeys, not on indexes which allow efficiently fetch data).
So far couldn't find artificial example, but we've seen inadequate index
selection due to this issue - instead of using index suited for filters
in where, index, suitable for sorting was selected as one having the
cheapest fractional cost.

I think it is necessary to generalise the approach a little.

Each MergeAppend subpath candidate that fits pathkeys should be compared
to the overall-optimal path + explicit Sort node. Let's label this
two-node composition as base_path. There are three cases exist:
startup-optimal, total-optimal and fractional-optimal.
In the startup case, base_path should use cheapest_startup_path, the
total-optimal case - cheapest_total_path and for a fractional case, we
may employ the get_cheapest_fractional_path routine to detect proper
base_path.

It may provide a more effective plan either in full, fractional and
partial scan cases:
1. The Sort node may be pushed down to subpaths under a parallel or
async Append.
2. When a minor set of subpaths doesn't have a proper index, and it is
profitable to sort them instead of switching to plain Append.

In general, analysing the regression tests changed by this code, I see
that the cost model prefers a series of small sortings instead of a
single massive one. May be it will show some profit for execution time.

I am not afraid of any palpable planning overhead here because we just
do cheap cost estimation and comparison operations that don't need
additional memory allocations. The caller is responsible for building a
proper Sort node if this method is chosen as optimal.

In the attachment, see the patch written according to the idea. There
are I introduced two new routines:
get_cheapest_path_for_pathkeys_ext
get_cheapest_fractional_path_for_pathkeys_ext

I have designed the code that way to reduce duplicated code in the
generate_orderedappend_paths routine. But the main point is that I
envision these new routines may be reused elsewhere.

This feature looks сlose to the one we discussed before [1]/messages/by-id/CAN-LCVPxnWB39CUBTgOQ9O7Dd8DrA_tpT1EY3LNVnUuvAX1NjA@mail.gmail.com. So, let me
CC the people from the previous conversation to the discussion.

[1]: /messages/by-id/CAN-LCVPxnWB39CUBTgOQ9O7Dd8DrA_tpT1EY3LNVnUuvAX1NjA@mail.gmail.com
/messages/by-id/CAN-LCVPxnWB39CUBTgOQ9O7Dd8DrA_tpT1EY3LNVnUuvAX1NjA@mail.gmail.com

--
regards, Andrei Lepikhov

Attachments:

v0-0001-Consider-explicit-sort-of-the-MergeAppend-subpath.patchtext/x-patch; charset=UTF-8; name=v0-0001-Consider-explicit-sort-of-the-MergeAppend-subpath.patchDownload

From 98306f0e14c12b6dee92ef5977d85fc1dd324898 Mon Sep 17 00:00:00 2001
From: "Andrei V. Lepikhov" <lepihov@gmail.com>
Date: Thu, 24 Apr 2025 14:03:02 +0200
Subject: [PATCH v0] Consider explicit sort of the MergeAppend subpaths.

Expand the optimiser search scope a little: fetching optimal subpath matching
pathkeys of the planning MergeAppend, consider the extra case of
overall-optimal path plus explicit Sort node at the top.

It may provide a more effective plan in both full and fractional scan cases:
1. The Sort node may be pushed down to subpaths under a parallel or
async Append.
2. The case is when a minor set of subpaths doesn't have a proper index, and it
is profitable to sort them instead of switching to plain Append.

In general, analysing the regression tests changed by this code, I see that
the cost model prefers a series of small sortings instead of a single massive
one.

Overhead:
It seems multiple subpaths may be encountered, as well as many pathkeys.
So, to be as careful as possible here, only cost estimation is performed.
---
 .../postgres_fdw/expected/postgres_fdw.out    |   6 +-
 src/backend/optimizer/path/allpaths.c         |  35 ++---
 src/backend/optimizer/path/pathkeys.c         | 112 ++++++++++++++
 src/include/optimizer/paths.h                 |  10 ++
 .../regress/expected/collate.icu.utf8.out     |   9 +-
 src/test/regress/expected/inherit.out         |  19 ++-
 src/test/regress/expected/partition_join.out  | 139 +++++++++++-------
 src/test/regress/expected/partition_prune.out |  16 +-
 src/test/regress/expected/union.out           |   6 +-
 src/test/regress/sql/inherit.sql              |   4 +-
 10 files changed, 255 insertions(+), 101 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index d1acee5a5fa..11b42f18cb6 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -10277,13 +10277,15 @@ SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a
    ->  Nested Loop
          Join Filter: (t1.a = t2.b)
          ->  Append
-               ->  Foreign Scan on ftprt1_p1 t1_1
+               ->  Sort
+                     Sort Key: t1_1.a
+                     ->  Foreign Scan on ftprt1_p1 t1_1
                ->  Foreign Scan on ftprt1_p2 t1_2
          ->  Materialize
                ->  Append
                      ->  Foreign Scan on ftprt2_p1 t2_1
                      ->  Foreign Scan on ftprt2_p2 t2_2
-(10 rows)
+(12 rows)
 
 SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
   a  |  b  
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 905250b3325..f48f5b94c0a 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1855,29 +1855,13 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 
 			/* Locate the right paths, if they are available. */
 			cheapest_startup =
-				get_cheapest_path_for_pathkeys(childrel->pathlist,
-											   pathkeys,
-											   NULL,
-											   STARTUP_COST,
-											   false);
+				get_cheapest_path_for_pathkeys_ext(root, childrel, pathkeys,
+												   NULL, STARTUP_COST, false);
 			cheapest_total =
-				get_cheapest_path_for_pathkeys(childrel->pathlist,
-											   pathkeys,
-											   NULL,
-											   TOTAL_COST,
-											   false);
+				get_cheapest_path_for_pathkeys_ext(root, childrel, pathkeys,
+												   NULL, TOTAL_COST, false);
 
-			/*
-			 * If we can't find any paths with the right order just use the
-			 * cheapest-total path; we'll have to sort it later.
-			 */
-			if (cheapest_startup == NULL || cheapest_total == NULL)
-			{
-				cheapest_startup = cheapest_total =
-					childrel->cheapest_total_path;
-				/* Assert we do have an unparameterized path for this child */
-				Assert(cheapest_total->param_info == NULL);
-			}
+			Assert(cheapest_total->param_info == NULL);
 
 			/*
 			 * When building a fractional path, determine a cheapest
@@ -1894,10 +1878,11 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 				double		path_fraction = (1.0 / root->tuple_fraction);
 
 				cheapest_fractional =
-					get_cheapest_fractional_path_for_pathkeys(childrel->pathlist,
-															  pathkeys,
-															  NULL,
-															  path_fraction);
+					get_cheapest_fractional_path_for_pathkeys_ext(root,
+																  childrel,
+																  pathkeys,
+																  NULL,
+																  path_fraction);
 
 				/*
 				 * If we found no path with matching pathkeys, use the
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 8b04d40d36d..4a5e37b493c 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -19,11 +19,13 @@
 
 #include "access/stratnum.h"
 #include "catalog/pg_opfamily.h"
+#include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/cost.h"
 #include "optimizer/optimizer.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
+#include "optimizer/planner.h"
 #include "partitioning/partbounds.h"
 #include "rewrite/rewriteManip.h"
 #include "utils/lsyscache.h"
@@ -648,6 +650,72 @@ get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 	return matched_path;
 }
 
+/*
+ * get_cheapest_path_for_pathkeys_ext
+ *	  Calls get_cheapest_path_for_pathkeys to obtain cheapest path that
+ *	  satisfies defined criterias and сonsiders one more option: choose
+ *	  overall-optimal path (according the criterion) and explicitly sort its
+ *	  output to satisfy the pathkeys.
+ *
+ *	  Caller is responsible to insert corresponding sort path at the top of
+ *	  returned path if it will be chosen to be used.
+ *
+ *	  Return NULL if no such path.
+ */
+Path *
+get_cheapest_path_for_pathkeys_ext(PlannerInfo *root, RelOptInfo *rel,
+								   List *pathkeys, Relids required_outer,
+								   CostSelector cost_criterion,
+								   bool require_parallel_safe)
+{
+	Path	sort_path;
+	Path   *base_path;
+	Path   *path;
+
+	path = get_cheapest_path_for_pathkeys(rel->pathlist, pathkeys,
+										  required_outer, cost_criterion,
+										  require_parallel_safe);
+
+
+	if (path == NULL)
+		/*
+		 * Current pathlist doesn't fit the pathkeys. No need to check extra
+		 * sort path ways.
+		 */
+		return rel->cheapest_total_path;
+
+	switch (cost_criterion)
+	{
+		case STARTUP_COST:
+			base_path = rel->cheapest_startup_path;
+			break;
+		case TOTAL_COST:
+			base_path = rel->cheapest_total_path;
+			break;
+		default:
+			/* In case of new criteries */
+			elog(ERROR, "unrecognized cost criterion: %d", cost_criterion);
+			break;
+	}
+
+	/* Stop here if the path doesn't satisfy necessary conditions */
+	if ((require_parallel_safe && !base_path->parallel_safe) ||
+		!bms_is_subset(PATH_REQ_OUTER(base_path), required_outer))
+		return path;
+
+	/* Consider the most startup-optimal path with extra sort */
+	if (base_path && path != base_path)
+	{
+		cost_sort(&sort_path, root, pathkeys, base_path->disabled_nodes,
+				  base_path->total_cost, base_path->rows,
+				  base_path->pathtarget->width, 0.0, work_mem, -1.0);
+
+		if (compare_path_costs(&sort_path, path, cost_criterion) < 0)
+			return base_path;
+	}
+	return path;
+}
+
 /*
  * get_cheapest_fractional_path_for_pathkeys
  *	  Find the cheapest path (for retrieving a specified fraction of all
@@ -690,6 +758,50 @@ get_cheapest_fractional_path_for_pathkeys(List *paths,
 	return matched_path;
 }
 
+/*
+ * get_cheapest_fractional_path_for_pathkeys_ext
+ *	  obtain cheapest fractional path that satisfies defined criterias excluding
+ *	  pathkeys and explicitly sort its output to satisfy the pathkeys.
+ *
+ *	  Caller is responsible to insert corresponding sort path at the top of
+ *	  returned path if it will be chosen to be used.
+ *
+ *	  Return NULL if no such path.
+ */
+Path *
+get_cheapest_fractional_path_for_pathkeys_ext(PlannerInfo *root,
+											  RelOptInfo *rel,
+											  List *pathkeys,
+											  Relids required_outer,
+											  double fraction)
+{
+	Path	sort_path;
+	Path   *base_path;
+	Path   *path;
+
+	path = get_cheapest_fractional_path_for_pathkeys(rel->pathlist, pathkeys,
+													 required_outer, fraction);
+
+
+	base_path = get_cheapest_fractional_path(rel, root->tuple_fraction);
+
+	/* Stop here if the path doesn't satisfy necessary conditions */
+	if (!base_path || !bms_is_subset(PATH_REQ_OUTER(base_path), required_outer))
+		return path;
+
+	if (path != base_path)
+	{
+		cost_sort(&sort_path, root, pathkeys, base_path->disabled_nodes,
+				  base_path->total_cost, base_path->rows,
+				  base_path->pathtarget->width, 0.0, work_mem, -1.0);
+
+		if (!path ||
+			compare_fractional_path_costs(&sort_path, path, fraction) <= 0)
+			return base_path;
+	}
+
+	return path;
+}
 
 /*
  * get_cheapest_parallel_safe_total_inner
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index a48c9721797..4bdc85afca9 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -224,10 +224,20 @@ extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
 											bool require_parallel_safe);
+extern Path *get_cheapest_path_for_pathkeys_ext(PlannerInfo *root,
+												RelOptInfo *rel, List *pathkeys,
+												Relids required_outer,
+												CostSelector cost_criterion,
+												bool require_parallel_safe);
 extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 													   List *pathkeys,
 													   Relids required_outer,
 													   double fraction);
+extern Path *get_cheapest_fractional_path_for_pathkeys_ext(PlannerInfo *root,
+														   RelOptInfo *rel,
+														   List *pathkeys,
+														   Relids required_outer,
+														   double fraction);
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
 								  ScanDirection scandir);
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 69805d4b9ec..48d47bb7455 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -2412,16 +2412,19 @@ EXPLAIN (COSTS OFF)
 SELECT c collate "C", count(c) FROM pagg_tab3 GROUP BY c collate "C" ORDER BY 1;
                        QUERY PLAN                       
 --------------------------------------------------------
- Sort
+ Merge Append
    Sort Key: ((pagg_tab3.c)::text) COLLATE "C"
-   ->  Append
+   ->  Sort
+         Sort Key: ((pagg_tab3.c)::text) COLLATE "C"
          ->  HashAggregate
                Group Key: (pagg_tab3.c)::text
                ->  Seq Scan on pagg_tab3_p2 pagg_tab3
+   ->  Sort
+         Sort Key: ((pagg_tab3_1.c)::text) COLLATE "C"
          ->  HashAggregate
                Group Key: (pagg_tab3_1.c)::text
                ->  Seq Scan on pagg_tab3_p1 pagg_tab3_1
-(9 rows)
+(12 rows)
 
 SELECT c collate "C", count(c) FROM pagg_tab3 GROUP BY c collate "C" ORDER BY 1;
  c | count 
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index f9b0c415cfd..71036dc938f 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -1665,11 +1665,13 @@ insert into matest2 (name) values ('Test 3');
 insert into matest2 (name) values ('Test 4');
 insert into matest3 (name) values ('Test 5');
 insert into matest3 (name) values ('Test 6');
-set enable_indexscan = off;  -- force use of seqscan/sort, so no merge
+set enable_indexscan = off;  -- force use of seqscan/sort
+set enable_sort = off; -- since merge append may employ sort in children we need to disable sort
 explain (verbose, costs off) select * from matest0 order by 1-id;
                          QUERY PLAN                         
 ------------------------------------------------------------
  Sort
+   Disabled: true
    Output: matest0.id, matest0.name, ((1 - matest0.id))
    Sort Key: ((1 - matest0.id))
    ->  Result
@@ -1683,7 +1685,7 @@ explain (verbose, costs off) select * from matest0 order by 1-id;
                      Output: matest0_3.id, matest0_3.name
                ->  Seq Scan on public.matest3 matest0_4
                      Output: matest0_4.id, matest0_4.name
-(14 rows)
+(15 rows)
 
 select * from matest0 order by 1-id;
  id |  name  
@@ -1719,6 +1721,7 @@ select min(1-id) from matest0;
 (1 row)
 
 reset enable_indexscan;
+reset enable_sort;
 set enable_seqscan = off;  -- plan with fewest seqscans should be merge
 set enable_parallel_append = off; -- Don't let parallel-append interfere
 explain (verbose, costs off) select * from matest0 order by 1-id;
@@ -1844,16 +1847,20 @@ order by t1.b limit 10;
          Merge Cond: (t1.b = t2.b)
          ->  Merge Append
                Sort Key: t1.b
-               ->  Index Scan using matest0i on matest0 t1_1
+               ->  Sort
+                     Sort Key: t1_1.b
+                     ->  Seq Scan on matest0 t1_1
                ->  Index Scan using matest1i on matest1 t1_2
          ->  Materialize
                ->  Merge Append
                      Sort Key: t2.b
-                     ->  Index Scan using matest0i on matest0 t2_1
-                           Filter: (c = d)
+                     ->  Sort
+                           Sort Key: t2_1.b
+                           ->  Seq Scan on matest0 t2_1
+                                 Filter: (c = d)
                      ->  Index Scan using matest1i on matest1 t2_2
                            Filter: (c = d)
-(14 rows)
+(18 rows)
 
 reset enable_nestloop;
 drop table matest0 cascade;
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index 6101c8c7cf1..d41979367c9 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -65,31 +65,34 @@ SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.b AND t1.b =
 -- inner join with partially-redundant join clauses
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
-                          QUERY PLAN                           
----------------------------------------------------------------
- Sort
+                       QUERY PLAN                        
+---------------------------------------------------------
+ Merge Append
    Sort Key: t1.a
-   ->  Append
-         ->  Merge Join
-               Merge Cond: (t1_1.a = t2_1.a)
-               ->  Index Scan using iprt1_p1_a on prt1_p1 t1_1
-               ->  Sort
-                     Sort Key: t2_1.b
-                     ->  Seq Scan on prt2_p1 t2_1
-                           Filter: (a = b)
+   ->  Merge Join
+         Merge Cond: (t1_1.a = t2_1.a)
+         ->  Index Scan using iprt1_p1_a on prt1_p1 t1_1
+         ->  Sort
+               Sort Key: t2_1.b
+               ->  Seq Scan on prt2_p1 t2_1
+                     Filter: (a = b)
+   ->  Sort
+         Sort Key: t1_2.a
          ->  Hash Join
                Hash Cond: (t1_2.a = t2_2.a)
                ->  Seq Scan on prt1_p2 t1_2
                ->  Hash
                      ->  Seq Scan on prt2_p2 t2_2
                            Filter: (a = b)
+   ->  Sort
+         Sort Key: t1_3.a
          ->  Hash Join
                Hash Cond: (t1_3.a = t2_3.a)
                ->  Seq Scan on prt1_p3 t1_3
                ->  Hash
                      ->  Seq Scan on prt2_p3 t2_3
                            Filter: (a = b)
-(22 rows)
+(25 rows)
 
 SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
  a  |  c   | b  |  c   
@@ -1383,28 +1386,32 @@ SELECT t1.a, t1.c, t2.b, t2.c, t3.a + t3.b, t3.c FROM (prt1 t1 LEFT JOIN prt2 t2
 -- This should generate a partitionwise join, but currently fails to
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
-                        QUERY PLAN                         
------------------------------------------------------------
- Incremental Sort
+                           QUERY PLAN                            
+-----------------------------------------------------------------
+ Sort
    Sort Key: prt1.a, prt2.b
-   Presorted Key: prt1.a
-   ->  Merge Left Join
-         Merge Cond: (prt1.a = prt2.b)
-         ->  Sort
-               Sort Key: prt1.a
-               ->  Append
-                     ->  Seq Scan on prt1_p1 prt1_1
-                           Filter: ((a < 450) AND (b = 0))
-                     ->  Seq Scan on prt1_p2 prt1_2
-                           Filter: ((a < 450) AND (b = 0))
-         ->  Sort
-               Sort Key: prt2.b
-               ->  Append
+   ->  Merge Right Join
+         Merge Cond: (prt2.b = prt1.a)
+         ->  Append
+               ->  Sort
+                     Sort Key: prt2_1.b
                      ->  Seq Scan on prt2_p2 prt2_1
                            Filter: (b > 250)
+               ->  Sort
+                     Sort Key: prt2_2.b
                      ->  Seq Scan on prt2_p3 prt2_2
                            Filter: (b > 250)
-(19 rows)
+         ->  Materialize
+               ->  Append
+                     ->  Sort
+                           Sort Key: prt1_1.a
+                           ->  Seq Scan on prt1_p1 prt1_1
+                                 Filter: ((a < 450) AND (b = 0))
+                     ->  Sort
+                           Sort Key: prt1_2.a
+                           ->  Seq Scan on prt1_p2 prt1_2
+                                 Filter: ((a < 450) AND (b = 0))
+(23 rows)
 
 SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
   a  |  b  
@@ -1424,25 +1431,33 @@ SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT *
 -- partitionwise join does not apply
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t2.b FROM prt1 t1, prt2 t2 WHERE t1::text = t2::text AND t1.a = t2.b ORDER BY t1.a;
-                                       QUERY PLAN                                        
------------------------------------------------------------------------------------------
+                            QUERY PLAN                            
+------------------------------------------------------------------
  Merge Join
-   Merge Cond: ((t1.a = t2.b) AND (((((t1.*)::prt1))::text) = ((((t2.*)::prt2))::text)))
-   ->  Sort
-         Sort Key: t1.a, ((((t1.*)::prt1))::text)
-         ->  Result
-               ->  Append
-                     ->  Seq Scan on prt1_p1 t1_1
-                     ->  Seq Scan on prt1_p2 t1_2
-                     ->  Seq Scan on prt1_p3 t1_3
-   ->  Sort
-         Sort Key: t2.b, ((((t2.*)::prt2))::text)
-         ->  Result
-               ->  Append
+   Merge Cond: (t1.a = t2.b)
+   Join Filter: ((((t2.*)::prt2))::text = (((t1.*)::prt1))::text)
+   ->  Append
+         ->  Sort
+               Sort Key: t1_1.a
+               ->  Seq Scan on prt1_p1 t1_1
+         ->  Sort
+               Sort Key: t1_2.a
+               ->  Seq Scan on prt1_p2 t1_2
+         ->  Sort
+               Sort Key: t1_3.a
+               ->  Seq Scan on prt1_p3 t1_3
+   ->  Materialize
+         ->  Append
+               ->  Sort
+                     Sort Key: t2_1.b
                      ->  Seq Scan on prt2_p1 t2_1
+               ->  Sort
+                     Sort Key: t2_2.b
                      ->  Seq Scan on prt2_p2 t2_2
+               ->  Sort
+                     Sort Key: t2_3.b
                      ->  Seq Scan on prt2_p3 t2_3
-(16 rows)
+(24 rows)
 
 SELECT t1.a, t2.b FROM prt1 t1, prt2 t2 WHERE t1::text = t2::text AND t1.a = t2.b ORDER BY t1.a;
  a  | b  
@@ -4508,9 +4523,10 @@ EXPLAIN (COSTS OFF)
 SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2 ON (t1.a = t2.a AND t1.c = t2.c) LEFT JOIN plt1_adv t3 ON (t1.a = t3.a AND t1.c = t3.c) WHERE t1.b < 10 ORDER BY t1.a;
                                    QUERY PLAN                                   
 --------------------------------------------------------------------------------
- Sort
+ Merge Append
    Sort Key: t1.a
-   ->  Append
+   ->  Sort
+         Sort Key: t1_1.a
          ->  Hash Right Join
                Hash Cond: ((t3_1.a = t1_1.a) AND (t3_1.c = t1_1.c))
                ->  Seq Scan on plt1_adv_p1 t3_1
@@ -4521,6 +4537,8 @@ SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2
                            ->  Hash
                                  ->  Seq Scan on plt1_adv_p1 t1_1
                                        Filter: (b < 10)
+   ->  Sort
+         Sort Key: t1_2.a
          ->  Hash Right Join
                Hash Cond: ((t3_2.a = t1_2.a) AND (t3_2.c = t1_2.c))
                ->  Seq Scan on plt1_adv_p2 t3_2
@@ -4531,6 +4549,8 @@ SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2
                            ->  Hash
                                  ->  Seq Scan on plt1_adv_p2 t1_2
                                        Filter: (b < 10)
+   ->  Sort
+         Sort Key: t1_3.a
          ->  Hash Right Join
                Hash Cond: ((t3_3.a = t1_3.a) AND (t3_3.c = t1_3.c))
                ->  Seq Scan on plt1_adv_p3 t3_3
@@ -4541,15 +4561,19 @@ SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2
                            ->  Hash
                                  ->  Seq Scan on plt1_adv_p3 t1_3
                                        Filter: (b < 10)
-         ->  Nested Loop Left Join
-               Join Filter: ((t1_4.a = t3_4.a) AND (t1_4.c = t3_4.c))
-               ->  Nested Loop Left Join
-                     Join Filter: ((t1_4.a = t2_4.a) AND (t1_4.c = t2_4.c))
+   ->  Nested Loop Left Join
+         Join Filter: ((t1_4.a = t3_4.a) AND (t1_4.c = t3_4.c))
+         ->  Merge Left Join
+               Merge Cond: ((t1_4.a = t2_4.a) AND (t1_4.c = t2_4.c))
+               ->  Sort
+                     Sort Key: t1_4.a, t1_4.c
                      ->  Seq Scan on plt1_adv_extra t1_4
                            Filter: (b < 10)
+               ->  Sort
+                     Sort Key: t2_4.a, t2_4.c
                      ->  Seq Scan on plt2_adv_extra t2_4
-               ->  Seq Scan on plt1_adv_extra t3_4
-(41 rows)
+         ->  Seq Scan on plt1_adv_extra t3_4
+(50 rows)
 
 SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2 ON (t1.a = t2.a AND t1.c = t2.c) LEFT JOIN plt1_adv t3 ON (t1.a = t3.a AND t1.c = t3.c) WHERE t1.b < 10 ORDER BY t1.a;
  a  |  c   | a |  c   | a |  c   
@@ -5037,21 +5061,26 @@ EXPLAIN (COSTS OFF)
 SELECT t1.*, t2.* FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a AND t1.b = t2.b) WHERE t1.b >= 125 AND t1.b < 225 ORDER BY t1.a, t1.b;
                              QUERY PLAN                             
 --------------------------------------------------------------------
- Sort
+ Merge Append
    Sort Key: t1.a, t1.b
-   ->  Append
+   ->  Sort
+         Sort Key: t1_1.a, t1_1.b
          ->  Hash Join
                Hash Cond: ((t1_1.a = t2_1.a) AND (t1_1.b = t2_1.b))
                ->  Seq Scan on alpha_neg_p1 t1_1
                      Filter: ((b >= 125) AND (b < 225))
                ->  Hash
                      ->  Seq Scan on beta_neg_p1 t2_1
+   ->  Sort
+         Sort Key: t1_2.a, t1_2.b
          ->  Hash Join
                Hash Cond: ((t2_2.a = t1_2.a) AND (t2_2.b = t1_2.b))
                ->  Seq Scan on beta_neg_p2 t2_2
                ->  Hash
                      ->  Seq Scan on alpha_neg_p2 t1_2
                            Filter: ((b >= 125) AND (b < 225))
+   ->  Sort
+         Sort Key: t1_4.a, t1_4.b
          ->  Hash Join
                Hash Cond: ((t2_4.a = t1_4.a) AND (t2_4.b = t1_4.b))
                ->  Append
@@ -5066,7 +5095,7 @@ SELECT t1.*, t2.* FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a AND t1.b = t2
                                  Filter: ((b >= 125) AND (b < 225))
                            ->  Seq Scan on alpha_pos_p3 t1_6
                                  Filter: ((b >= 125) AND (b < 225))
-(29 rows)
+(34 rows)
 
 SELECT t1.*, t2.* FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a AND t1.b = t2.b) WHERE t1.b >= 125 AND t1.b < 225 ORDER BY t1.a, t1.b;
  a  |  b  |  c   | a  |  b  |  c   
diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index 0bf35260b46..b25aa73e946 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -4763,9 +4763,10 @@ select min(a) over (partition by a order by a) from part_abc where a >= stable_o
                Window: w1 AS (PARTITION BY part_abc.a ORDER BY part_abc.a)
                ->  Append
                      Subplans Removed: 1
-                     ->  Index Scan using part_abc_2_a_idx on part_abc_2 part_abc_1
-                           Index Cond: (a >= (stable_one() + 1))
-                           Filter: (d <= stable_one())
+                     ->  Sort
+                           Sort Key: part_abc_1.a
+                           ->  Seq Scan on part_abc_2 part_abc_1
+                                 Filter: ((d <= stable_one()) AND (a >= (stable_one() + 1)))
                      ->  Merge Append
                            Sort Key: part_abc_3.a
                            Subplans Removed: 1
@@ -4780,9 +4781,10 @@ select min(a) over (partition by a order by a) from part_abc where a >= stable_o
                Window: w1 AS (PARTITION BY part_abc_5.a ORDER BY part_abc_5.a)
                ->  Append
                      Subplans Removed: 1
-                     ->  Index Scan using part_abc_2_a_idx on part_abc_2 part_abc_6
-                           Index Cond: (a >= (stable_one() + 1))
-                           Filter: (d >= stable_one())
+                     ->  Sort
+                           Sort Key: part_abc_6.a
+                           ->  Seq Scan on part_abc_2 part_abc_6
+                                 Filter: ((d >= stable_one()) AND (a >= (stable_one() + 1)))
                      ->  Merge Append
                            Sort Key: a
                            Subplans Removed: 1
@@ -4792,7 +4794,7 @@ select min(a) over (partition by a order by a) from part_abc where a >= stable_o
                            ->  Index Scan using part_abc_3_3_a_idx on part_abc_3_3 part_abc_9
                                  Index Cond: (a >= (stable_one() + 1))
                                  Filter: (d >= stable_one())
-(35 rows)
+(37 rows)
 
 drop view part_abc_view;
 drop table part_abc;
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index 96962817ed4..0ccadea910c 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -1207,12 +1207,14 @@ select event_id
 ----------------------------------------------------------
  Merge Append
    Sort Key: events.event_id
-   ->  Index Scan using events_pkey on events
+   ->  Sort
+         Sort Key: events.event_id
+         ->  Seq Scan on events
    ->  Sort
          Sort Key: events_1.event_id
          ->  Seq Scan on events_child events_1
    ->  Index Scan using other_events_pkey on other_events
-(7 rows)
+(9 rows)
 
 drop table events_child, events, other_events;
 reset enable_indexonlyscan;
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 699e8ac09c8..c58beebbd1e 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -641,12 +641,14 @@ insert into matest2 (name) values ('Test 4');
 insert into matest3 (name) values ('Test 5');
 insert into matest3 (name) values ('Test 6');
 
-set enable_indexscan = off;  -- force use of seqscan/sort, so no merge
+set enable_indexscan = off;  -- force use of seqscan/sort
+set enable_sort = off; -- since merge append may employ sort in children we need to disable sort
 explain (verbose, costs off) select * from matest0 order by 1-id;
 select * from matest0 order by 1-id;
 explain (verbose, costs off) select min(1-id) from matest0;
 select min(1-id) from matest0;
 reset enable_indexscan;
+reset enable_sort;
 
 set enable_seqscan = off;  -- plan with fewest seqscans should be merge
 set enable_parallel_append = off; -- Don't let parallel-append interfere
-- 
2.39.5

Alexander Pyhalov

a.pyhalov@postgrespro.ru

9 months ago

In reply to: Andrei Lepikhov (#6)

Re: MergeAppend could consider sorting cheapest child path

Andrei Lepikhov писал(а) 2025-04-24 16:01:

On 3/28/25 09:19, Alexander Pyhalov wrote:

Andy Fan писал(а) 2024-10-17 03:33:
I've updated patch. One more interesting case which we found - when
fractional path is selected, it still can be more expensive than
sorted cheapest total path (as we look only on indexes whith necessary
pathkeys, not on indexes which allow efficiently fetch data).
So far couldn't find artificial example, but we've seen inadequate
index selection due to this issue - instead of using index suited for
filters in where, index, suitable for sorting was selected as one
having the cheapest fractional cost.

I think it is necessary to generalise the approach a little.

Each MergeAppend subpath candidate that fits pathkeys should be
compared to the overall-optimal path + explicit Sort node. Let's label
this two-node composition as base_path. There are three cases exist:
startup-optimal, total-optimal and fractional-optimal.
In the startup case, base_path should use cheapest_startup_path, the
total-optimal case - cheapest_total_path and for a fractional case, we
may employ the get_cheapest_fractional_path routine to detect proper
base_path.

It may provide a more effective plan either in full, fractional and
partial scan cases:
1. The Sort node may be pushed down to subpaths under a parallel or
async Append.
2. When a minor set of subpaths doesn't have a proper index, and it is
profitable to sort them instead of switching to plain Append.

In general, analysing the regression tests changed by this code, I see
that the cost model prefers a series of small sortings instead of a
single massive one. May be it will show some profit for execution time.

I am not afraid of any palpable planning overhead here because we just
do cheap cost estimation and comparison operations that don't need
additional memory allocations. The caller is responsible for building a
proper Sort node if this method is chosen as optimal.

In the attachment, see the patch written according to the idea. There
are I introduced two new routines:
get_cheapest_path_for_pathkeys_ext
get_cheapest_fractional_path_for_pathkeys_ext

Hi. I'm a bit confused that
get_cheapest_fractional_path_for_pathkeys_ext() looks only on sorting
cheapest fractional path, and get_cheapest_path_for_pathkeys_ext() in
STARTUP_COST case looks only on sorting cheapest_startup_path.
Usually, sorted cheapest_total_path will be cheaper than sorted
fractional/startup path at least by startup cost (as after sorting it
includes total_cost of input path). But we ignore this case when
selecting cheapest_startup and cheapest_fractional subpaths. As result
selected cheapest_startup and cheapest_fractional can be not cheapest
for startup or selecting a fraction of rows.

Consider the partition with the following access paths:

1) cheapest_startup without required pathkeys:
startup_cost: 0.42
total_cost: 4004

2) some index_path with required pathkeys:
startup_cost: 6.6
total_cost: 2000

3) cheapest_total_path:
startup_cost: 0.42
total_cost: 3.48

Here, when selecting cheapest startup subpath we'll compare costs of
index path (2) and sorted cheapest_startup (1), but will ignore sorted
cheapest_total_path (3), despite the fact that it really can be the
cheapest startup path, providing required sorting order.

--
Best regards,
Alexander Pyhalov,
Postgres Professional

Andrei Lepikhov

lepihov@gmail.com

9 months ago

In reply to: Alexander Pyhalov (#7)

Re: MergeAppend could consider sorting cheapest child path

On 4/25/25 11:16, Alexander Pyhalov wrote:

Andrei Lepikhov писал(а) 2025-04-24 16:01:

On 3/28/25 09:19, Alexander Pyhalov wrote:
In the attachment, see the patch written according to the idea. There
are I introduced two new routines:
get_cheapest_path_for_pathkeys_ext
get_cheapest_fractional_path_for_pathkeys_ext

Hi. I'm a bit confused that

Thanks for the participation!

get_cheapest_fractional_path_for_pathkeys_ext() looks only on sorting
cheapest fractional path, and get_cheapest_path_for_pathkeys_ext() in
STARTUP_COST case looks only on sorting cheapest_startup_path.

At first, at the moment, I don't understand why we calculate the
cheapest_startup path at all. In my opinion, after commit 6b94e7a [1,
2], the min-fractional path totally covers the case. I began this
discussion in [3]/messages/by-id/f0206ef2-6b5a-4d07-8770-cfa7cd30f685@gmail.com - maybe we need to scrutinise that issue beforehand.

Looking into the min-fractional-path + Sort, we propose a path for the
case when extracting minor part of tuples with sorting later is cheaper
than doing a massive job of non-selective index scan. You also may
imagine the case of a JOIN as a subpath: non-sorted, highly selective
parameterised NestLoop may be way more optimal than MergeJoin, which
fits the pathkeys.

Usually, sorted cheapest_total_path will be cheaper than sorted
fractional/startup path at least by startup cost (as after sorting it
includes total_cost of input path). But we ignore this case when
selecting cheapest_startup and cheapest_fractional subpaths. As result
selected cheapest_startup and cheapest_fractional can be not cheapest
for startup or selecting a fraction of rows.

I don't know what you mean by that. The cheapest_total_path is
considered when we chose optimal cheapest_total path. The same works for
the fractional path - get_cheapest_fractional_path gives us the most
optimal fractional path and probes cheapest_total_path too.
As above, not sure about min-startup case for now. I can imagine
MergeAppend over sophisticated subquery: non-sorted includes highly
parameterised JOINs and the alternative (with pathkeys) includes
HashJoin, drastically increasing startup cost. It is only a theory, of
course. So, lets discover how min-startup works.

At the end, when the sorted path already estimated, we each time compare
it with previously selected pathkeys-cheapest. So, if the sorted path is
worse, we end up with the original path and don't lose anything.

[1]: /messages/by-id/e8f9ec90-546d-e948-acce-0525f3e92773@enterprisedb.com
/messages/by-id/e8f9ec90-546d-e948-acce-0525f3e92773@enterprisedb.com
[2]: /messages/by-id/1581042da8044e71ada2d6e3a51bf7bb@index.de
/messages/by-id/1581042da8044e71ada2d6e3a51bf7bb@index.de
[3]: /messages/by-id/f0206ef2-6b5a-4d07-8770-cfa7cd30f685@gmail.com
/messages/by-id/f0206ef2-6b5a-4d07-8770-cfa7cd30f685@gmail.com

--
regards, Andrei Lepikhov

Andrei Lepikhov

lepihov@gmail.com

9 months ago

In reply to: Andrei Lepikhov (#8)

2 attachment(s)

Re: MergeAppend could consider sorting cheapest child path

On 4/25/25 17:13, Andrei Lepikhov wrote:

On 4/25/25 11:16, Alexander Pyhalov wrote:

Usually, sorted cheapest_total_path will be cheaper than sorted
fractional/startup path at least by startup cost (as after sorting it
includes total_cost of input path). But we ignore this case when
selecting cheapest_startup and cheapest_fractional subpaths. As result
selected cheapest_startup and cheapest_fractional can be not cheapest
for startup or selecting a fraction of rows.

I don't know what you mean by that. The cheapest_total_path is
considered when we chose optimal cheapest_total path. The same works for
the fractional path - get_cheapest_fractional_path gives us the most
optimal fractional path and probes cheapest_total_path too.
As above, not sure about min-startup case for now. I can imagine
MergeAppend over sophisticated subquery: non-sorted includes highly
parameterised JOINs and the alternative (with pathkeys) includes
HashJoin, drastically increasing startup cost. It is only a theory, of
course. So, lets discover how min-startup works.

After a second thought I have caught your idea. I agree that for a
fractional path it have no sense to choose any other path except a
cheapest total one.
There are the modified patch in the attachment.

Also, to be more objective, I propose to use examples in argumentation -
something like in attached test2.sql script.

--
regards, Andrei Lepikhov

Attachments:

v1-0001-Consider-explicit-sort-of-the-MergeAppend-subpath.patchtext/x-patch; charset=UTF-8; name=v1-0001-Consider-explicit-sort-of-the-MergeAppend-subpath.patchDownload

From efdea920593e232b9732e8f5288a6fbffd4ce4ce Mon Sep 17 00:00:00 2001
From: "Andrei V. Lepikhov" <lepihov@gmail.com>
Date: Thu, 24 Apr 2025 14:03:02 +0200
Subject: [PATCH v1] Consider explicit sort of the MergeAppend subpaths.

Expand the optimiser search scope a little: fetching optimal subpath matching
pathkeys of the planning MergeAppend, consider the extra case of
overall-optimal path plus explicit Sort node at the top.

It may provide a more effective plan in both full and fractional scan cases:
1. The Sort node may be pushed down to subpaths under a parallel or
async Append.
2. The case when a minor set of subpaths doesn't have a proper index, and it
is profitable to sort them instead of switching to plain Append.

In general, analysing the regression tests changed by this code, it seems that
the cost model prefers a series of small sortings instead of a single massive
one. This feature increases a little the number of such paths.

Overhead:
It seems multiple subpaths may be encountered, as well as many pathkeys.
So, to be as careful as possible here, only cost estimation is performed.
---
 .../postgres_fdw/expected/postgres_fdw.out    |   6 +-
 src/backend/optimizer/path/allpaths.c         |  36 ++---
 src/backend/optimizer/path/pathkeys.c         |  97 ++++++++++++
 src/include/optimizer/paths.h                 |  10 ++
 .../regress/expected/collate.icu.utf8.out     |   9 +-
 src/test/regress/expected/inherit.out         |  19 ++-
 src/test/regress/expected/partition_join.out  | 139 +++++++++++-------
 src/test/regress/expected/partition_prune.out |  16 +-
 src/test/regress/expected/union.out           |   6 +-
 src/test/regress/sql/inherit.sql              |   4 +-
 10 files changed, 241 insertions(+), 101 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index d1acee5a5fa..11b42f18cb6 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -10277,13 +10277,15 @@ SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a
    ->  Nested Loop
          Join Filter: (t1.a = t2.b)
          ->  Append
-               ->  Foreign Scan on ftprt1_p1 t1_1
+               ->  Sort
+                     Sort Key: t1_1.a
+                     ->  Foreign Scan on ftprt1_p1 t1_1
                ->  Foreign Scan on ftprt1_p2 t1_2
          ->  Materialize
                ->  Append
                      ->  Foreign Scan on ftprt2_p1 t2_1
                      ->  Foreign Scan on ftprt2_p2 t2_2
-(10 rows)
+(12 rows)
 
 SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
   a  |  b  
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 905250b3325..0ffb13dffec 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1855,29 +1855,14 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 
 			/* Locate the right paths, if they are available. */
 			cheapest_startup =
-				get_cheapest_path_for_pathkeys(childrel->pathlist,
-											   pathkeys,
-											   NULL,
-											   STARTUP_COST,
-											   false);
+				get_cheapest_path_for_pathkeys_ext(root, childrel, pathkeys,
+												   NULL, STARTUP_COST, false);
 			cheapest_total =
-				get_cheapest_path_for_pathkeys(childrel->pathlist,
-											   pathkeys,
-											   NULL,
-											   TOTAL_COST,
-											   false);
+				get_cheapest_path_for_pathkeys_ext(root, childrel, pathkeys,
+												   NULL, TOTAL_COST, false);
 
-			/*
-			 * If we can't find any paths with the right order just use the
-			 * cheapest-total path; we'll have to sort it later.
-			 */
-			if (cheapest_startup == NULL || cheapest_total == NULL)
-			{
-				cheapest_startup = cheapest_total =
-					childrel->cheapest_total_path;
-				/* Assert we do have an unparameterized path for this child */
-				Assert(cheapest_total->param_info == NULL);
-			}
+			Assert(cheapest_startup != NULL && cheapest_total != NULL);
+			Assert(cheapest_total->param_info == NULL);
 
 			/*
 			 * When building a fractional path, determine a cheapest
@@ -1894,10 +1879,11 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 				double		path_fraction = (1.0 / root->tuple_fraction);
 
 				cheapest_fractional =
-					get_cheapest_fractional_path_for_pathkeys(childrel->pathlist,
-															  pathkeys,
-															  NULL,
-															  path_fraction);
+					get_cheapest_fractional_path_for_pathkeys_ext(root,
+																  childrel,
+																  pathkeys,
+																  NULL,
+																  path_fraction);
 
 				/*
 				 * If we found no path with matching pathkeys, use the
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 8b04d40d36d..a39059c5bc6 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -19,11 +19,13 @@
 
 #include "access/stratnum.h"
 #include "catalog/pg_opfamily.h"
+#include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/cost.h"
 #include "optimizer/optimizer.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
+#include "optimizer/planner.h"
 #include "partitioning/partbounds.h"
 #include "rewrite/rewriteManip.h"
 #include "utils/lsyscache.h"
@@ -648,6 +650,58 @@ get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 	return matched_path;
 }
 
+/*
+ * get_cheapest_path_for_pathkeys_ext
+ *	  Calls get_cheapest_path_for_pathkeys to obtain cheapest path that
+ *	  satisfies defined criterias and сonsiders one more option: choose
+ *	  overall-optimal path (according the criterion) and explicitly sort its
+ *	  output to satisfy the pathkeys.
+ *
+ *	  Caller is responsible to insert corresponding sort path at the top of
+ *	  returned path if it will be chosen to be used.
+ *
+ *	  Return NULL if no such path.
+ */
+Path *
+get_cheapest_path_for_pathkeys_ext(PlannerInfo *root, RelOptInfo *rel,
+								   List *pathkeys, Relids required_outer,
+								   CostSelector cost_criterion,
+								   bool require_parallel_safe)
+{
+	Path	sort_path;
+	Path   *base_path = rel->cheapest_total_path;
+	Path   *path;
+
+	path = get_cheapest_path_for_pathkeys(rel->pathlist, pathkeys,
+										  required_outer, cost_criterion,
+										  require_parallel_safe);
+
+
+	if (path == NULL)
+		/*
+		 * Current pathlist doesn't fit the pathkeys. No need to check extra
+		 * sort path ways.
+		 */
+		return base_path;
+
+	/* Stop here if the path doesn't satisfy necessary conditions */
+	if ((require_parallel_safe && !base_path->parallel_safe) ||
+		!bms_is_subset(PATH_REQ_OUTER(base_path), required_outer))
+		return path;
+
+	/* Consider the most startup-optimal path with extra sort */
+	if (base_path && path != base_path)
+	{
+		cost_sort(&sort_path, root, pathkeys, base_path->disabled_nodes,
+				  base_path->total_cost, base_path->rows,
+				  base_path->pathtarget->width, 0.0, work_mem, -1.0);
+
+		if (compare_path_costs(&sort_path, path, cost_criterion) < 0)
+			return base_path;
+	}
+	return path;
+}
+
 /*
  * get_cheapest_fractional_path_for_pathkeys
  *	  Find the cheapest path (for retrieving a specified fraction of all
@@ -690,6 +744,49 @@ get_cheapest_fractional_path_for_pathkeys(List *paths,
 	return matched_path;
 }
 
+/*
+ * get_cheapest_fractional_path_for_pathkeys_ext
+ *	  obtain cheapest fractional path that satisfies defined criterias excluding
+ *	  pathkeys and explicitly sort its output to satisfy the pathkeys.
+ *
+ *	  Caller is responsible to insert corresponding sort path at the top of
+ *	  returned path if it will be chosen to be used.
+ *
+ *	  Return NULL if no such path.
+ */
+Path *
+get_cheapest_fractional_path_for_pathkeys_ext(PlannerInfo *root,
+											  RelOptInfo *rel,
+											  List *pathkeys,
+											  Relids required_outer,
+											  double fraction)
+{
+	Path			sort_path;
+	Path		   *base_path;
+	Path		   *path;
+
+	path = get_cheapest_fractional_path_for_pathkeys(rel->pathlist, pathkeys,
+													 required_outer, fraction);
+
+	base_path = rel->cheapest_total_path;
+
+	/* Stop here if the path doesn't satisfy necessary conditions */
+	if (!base_path || !bms_is_subset(PATH_REQ_OUTER(base_path), required_outer))
+		return path;
+
+	if (path != base_path)
+	{
+		cost_sort(&sort_path, root, pathkeys, base_path->disabled_nodes,
+				  base_path->total_cost, base_path->rows,
+				  base_path->pathtarget->width, 0.0, work_mem, -1.0);
+
+		if (!path ||
+			compare_fractional_path_costs(&sort_path, path, fraction) <= 0)
+			return base_path;
+	}
+
+	return path;
+}
 
 /*
  * get_cheapest_parallel_safe_total_inner
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index a48c9721797..4bdc85afca9 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -224,10 +224,20 @@ extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
 											bool require_parallel_safe);
+extern Path *get_cheapest_path_for_pathkeys_ext(PlannerInfo *root,
+												RelOptInfo *rel, List *pathkeys,
+												Relids required_outer,
+												CostSelector cost_criterion,
+												bool require_parallel_safe);
 extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 													   List *pathkeys,
 													   Relids required_outer,
 													   double fraction);
+extern Path *get_cheapest_fractional_path_for_pathkeys_ext(PlannerInfo *root,
+														   RelOptInfo *rel,
+														   List *pathkeys,
+														   Relids required_outer,
+														   double fraction);
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
 								  ScanDirection scandir);
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 69805d4b9ec..48d47bb7455 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -2412,16 +2412,19 @@ EXPLAIN (COSTS OFF)
 SELECT c collate "C", count(c) FROM pagg_tab3 GROUP BY c collate "C" ORDER BY 1;
                        QUERY PLAN                       
 --------------------------------------------------------
- Sort
+ Merge Append
    Sort Key: ((pagg_tab3.c)::text) COLLATE "C"
-   ->  Append
+   ->  Sort
+         Sort Key: ((pagg_tab3.c)::text) COLLATE "C"
          ->  HashAggregate
                Group Key: (pagg_tab3.c)::text
                ->  Seq Scan on pagg_tab3_p2 pagg_tab3
+   ->  Sort
+         Sort Key: ((pagg_tab3_1.c)::text) COLLATE "C"
          ->  HashAggregate
                Group Key: (pagg_tab3_1.c)::text
                ->  Seq Scan on pagg_tab3_p1 pagg_tab3_1
-(9 rows)
+(12 rows)
 
 SELECT c collate "C", count(c) FROM pagg_tab3 GROUP BY c collate "C" ORDER BY 1;
  c | count 
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index f9b0c415cfd..71036dc938f 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -1665,11 +1665,13 @@ insert into matest2 (name) values ('Test 3');
 insert into matest2 (name) values ('Test 4');
 insert into matest3 (name) values ('Test 5');
 insert into matest3 (name) values ('Test 6');
-set enable_indexscan = off;  -- force use of seqscan/sort, so no merge
+set enable_indexscan = off;  -- force use of seqscan/sort
+set enable_sort = off; -- since merge append may employ sort in children we need to disable sort
 explain (verbose, costs off) select * from matest0 order by 1-id;
                          QUERY PLAN                         
 ------------------------------------------------------------
  Sort
+   Disabled: true
    Output: matest0.id, matest0.name, ((1 - matest0.id))
    Sort Key: ((1 - matest0.id))
    ->  Result
@@ -1683,7 +1685,7 @@ explain (verbose, costs off) select * from matest0 order by 1-id;
                      Output: matest0_3.id, matest0_3.name
                ->  Seq Scan on public.matest3 matest0_4
                      Output: matest0_4.id, matest0_4.name
-(14 rows)
+(15 rows)
 
 select * from matest0 order by 1-id;
  id |  name  
@@ -1719,6 +1721,7 @@ select min(1-id) from matest0;
 (1 row)
 
 reset enable_indexscan;
+reset enable_sort;
 set enable_seqscan = off;  -- plan with fewest seqscans should be merge
 set enable_parallel_append = off; -- Don't let parallel-append interfere
 explain (verbose, costs off) select * from matest0 order by 1-id;
@@ -1844,16 +1847,20 @@ order by t1.b limit 10;
          Merge Cond: (t1.b = t2.b)
          ->  Merge Append
                Sort Key: t1.b
-               ->  Index Scan using matest0i on matest0 t1_1
+               ->  Sort
+                     Sort Key: t1_1.b
+                     ->  Seq Scan on matest0 t1_1
                ->  Index Scan using matest1i on matest1 t1_2
          ->  Materialize
                ->  Merge Append
                      Sort Key: t2.b
-                     ->  Index Scan using matest0i on matest0 t2_1
-                           Filter: (c = d)
+                     ->  Sort
+                           Sort Key: t2_1.b
+                           ->  Seq Scan on matest0 t2_1
+                                 Filter: (c = d)
                      ->  Index Scan using matest1i on matest1 t2_2
                            Filter: (c = d)
-(14 rows)
+(18 rows)
 
 reset enable_nestloop;
 drop table matest0 cascade;
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index 6101c8c7cf1..d41979367c9 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -65,31 +65,34 @@ SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.b AND t1.b =
 -- inner join with partially-redundant join clauses
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
-                          QUERY PLAN                           
----------------------------------------------------------------
- Sort
+                       QUERY PLAN                        
+---------------------------------------------------------
+ Merge Append
    Sort Key: t1.a
-   ->  Append
-         ->  Merge Join
-               Merge Cond: (t1_1.a = t2_1.a)
-               ->  Index Scan using iprt1_p1_a on prt1_p1 t1_1
-               ->  Sort
-                     Sort Key: t2_1.b
-                     ->  Seq Scan on prt2_p1 t2_1
-                           Filter: (a = b)
+   ->  Merge Join
+         Merge Cond: (t1_1.a = t2_1.a)
+         ->  Index Scan using iprt1_p1_a on prt1_p1 t1_1
+         ->  Sort
+               Sort Key: t2_1.b
+               ->  Seq Scan on prt2_p1 t2_1
+                     Filter: (a = b)
+   ->  Sort
+         Sort Key: t1_2.a
          ->  Hash Join
                Hash Cond: (t1_2.a = t2_2.a)
                ->  Seq Scan on prt1_p2 t1_2
                ->  Hash
                      ->  Seq Scan on prt2_p2 t2_2
                            Filter: (a = b)
+   ->  Sort
+         Sort Key: t1_3.a
          ->  Hash Join
                Hash Cond: (t1_3.a = t2_3.a)
                ->  Seq Scan on prt1_p3 t1_3
                ->  Hash
                      ->  Seq Scan on prt2_p3 t2_3
                            Filter: (a = b)
-(22 rows)
+(25 rows)
 
 SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
  a  |  c   | b  |  c   
@@ -1383,28 +1386,32 @@ SELECT t1.a, t1.c, t2.b, t2.c, t3.a + t3.b, t3.c FROM (prt1 t1 LEFT JOIN prt2 t2
 -- This should generate a partitionwise join, but currently fails to
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
-                        QUERY PLAN                         
------------------------------------------------------------
- Incremental Sort
+                           QUERY PLAN                            
+-----------------------------------------------------------------
+ Sort
    Sort Key: prt1.a, prt2.b
-   Presorted Key: prt1.a
-   ->  Merge Left Join
-         Merge Cond: (prt1.a = prt2.b)
-         ->  Sort
-               Sort Key: prt1.a
-               ->  Append
-                     ->  Seq Scan on prt1_p1 prt1_1
-                           Filter: ((a < 450) AND (b = 0))
-                     ->  Seq Scan on prt1_p2 prt1_2
-                           Filter: ((a < 450) AND (b = 0))
-         ->  Sort
-               Sort Key: prt2.b
-               ->  Append
+   ->  Merge Right Join
+         Merge Cond: (prt2.b = prt1.a)
+         ->  Append
+               ->  Sort
+                     Sort Key: prt2_1.b
                      ->  Seq Scan on prt2_p2 prt2_1
                            Filter: (b > 250)
+               ->  Sort
+                     Sort Key: prt2_2.b
                      ->  Seq Scan on prt2_p3 prt2_2
                            Filter: (b > 250)
-(19 rows)
+         ->  Materialize
+               ->  Append
+                     ->  Sort
+                           Sort Key: prt1_1.a
+                           ->  Seq Scan on prt1_p1 prt1_1
+                                 Filter: ((a < 450) AND (b = 0))
+                     ->  Sort
+                           Sort Key: prt1_2.a
+                           ->  Seq Scan on prt1_p2 prt1_2
+                                 Filter: ((a < 450) AND (b = 0))
+(23 rows)
 
 SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
   a  |  b  
@@ -1424,25 +1431,33 @@ SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT *
 -- partitionwise join does not apply
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t2.b FROM prt1 t1, prt2 t2 WHERE t1::text = t2::text AND t1.a = t2.b ORDER BY t1.a;
-                                       QUERY PLAN                                        
------------------------------------------------------------------------------------------
+                            QUERY PLAN                            
+------------------------------------------------------------------
  Merge Join
-   Merge Cond: ((t1.a = t2.b) AND (((((t1.*)::prt1))::text) = ((((t2.*)::prt2))::text)))
-   ->  Sort
-         Sort Key: t1.a, ((((t1.*)::prt1))::text)
-         ->  Result
-               ->  Append
-                     ->  Seq Scan on prt1_p1 t1_1
-                     ->  Seq Scan on prt1_p2 t1_2
-                     ->  Seq Scan on prt1_p3 t1_3
-   ->  Sort
-         Sort Key: t2.b, ((((t2.*)::prt2))::text)
-         ->  Result
-               ->  Append
+   Merge Cond: (t1.a = t2.b)
+   Join Filter: ((((t2.*)::prt2))::text = (((t1.*)::prt1))::text)
+   ->  Append
+         ->  Sort
+               Sort Key: t1_1.a
+               ->  Seq Scan on prt1_p1 t1_1
+         ->  Sort
+               Sort Key: t1_2.a
+               ->  Seq Scan on prt1_p2 t1_2
+         ->  Sort
+               Sort Key: t1_3.a
+               ->  Seq Scan on prt1_p3 t1_3
+   ->  Materialize
+         ->  Append
+               ->  Sort
+                     Sort Key: t2_1.b
                      ->  Seq Scan on prt2_p1 t2_1
+               ->  Sort
+                     Sort Key: t2_2.b
                      ->  Seq Scan on prt2_p2 t2_2
+               ->  Sort
+                     Sort Key: t2_3.b
                      ->  Seq Scan on prt2_p3 t2_3
-(16 rows)
+(24 rows)
 
 SELECT t1.a, t2.b FROM prt1 t1, prt2 t2 WHERE t1::text = t2::text AND t1.a = t2.b ORDER BY t1.a;
  a  | b  
@@ -4508,9 +4523,10 @@ EXPLAIN (COSTS OFF)
 SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2 ON (t1.a = t2.a AND t1.c = t2.c) LEFT JOIN plt1_adv t3 ON (t1.a = t3.a AND t1.c = t3.c) WHERE t1.b < 10 ORDER BY t1.a;
                                    QUERY PLAN                                   
 --------------------------------------------------------------------------------
- Sort
+ Merge Append
    Sort Key: t1.a
-   ->  Append
+   ->  Sort
+         Sort Key: t1_1.a
          ->  Hash Right Join
                Hash Cond: ((t3_1.a = t1_1.a) AND (t3_1.c = t1_1.c))
                ->  Seq Scan on plt1_adv_p1 t3_1
@@ -4521,6 +4537,8 @@ SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2
                            ->  Hash
                                  ->  Seq Scan on plt1_adv_p1 t1_1
                                        Filter: (b < 10)
+   ->  Sort
+         Sort Key: t1_2.a
          ->  Hash Right Join
                Hash Cond: ((t3_2.a = t1_2.a) AND (t3_2.c = t1_2.c))
                ->  Seq Scan on plt1_adv_p2 t3_2
@@ -4531,6 +4549,8 @@ SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2
                            ->  Hash
                                  ->  Seq Scan on plt1_adv_p2 t1_2
                                        Filter: (b < 10)
+   ->  Sort
+         Sort Key: t1_3.a
          ->  Hash Right Join
                Hash Cond: ((t3_3.a = t1_3.a) AND (t3_3.c = t1_3.c))
                ->  Seq Scan on plt1_adv_p3 t3_3
@@ -4541,15 +4561,19 @@ SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2
                            ->  Hash
                                  ->  Seq Scan on plt1_adv_p3 t1_3
                                        Filter: (b < 10)
-         ->  Nested Loop Left Join
-               Join Filter: ((t1_4.a = t3_4.a) AND (t1_4.c = t3_4.c))
-               ->  Nested Loop Left Join
-                     Join Filter: ((t1_4.a = t2_4.a) AND (t1_4.c = t2_4.c))
+   ->  Nested Loop Left Join
+         Join Filter: ((t1_4.a = t3_4.a) AND (t1_4.c = t3_4.c))
+         ->  Merge Left Join
+               Merge Cond: ((t1_4.a = t2_4.a) AND (t1_4.c = t2_4.c))
+               ->  Sort
+                     Sort Key: t1_4.a, t1_4.c
                      ->  Seq Scan on plt1_adv_extra t1_4
                            Filter: (b < 10)
+               ->  Sort
+                     Sort Key: t2_4.a, t2_4.c
                      ->  Seq Scan on plt2_adv_extra t2_4
-               ->  Seq Scan on plt1_adv_extra t3_4
-(41 rows)
+         ->  Seq Scan on plt1_adv_extra t3_4
+(50 rows)
 
 SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2 ON (t1.a = t2.a AND t1.c = t2.c) LEFT JOIN plt1_adv t3 ON (t1.a = t3.a AND t1.c = t3.c) WHERE t1.b < 10 ORDER BY t1.a;
  a  |  c   | a |  c   | a |  c   
@@ -5037,21 +5061,26 @@ EXPLAIN (COSTS OFF)
 SELECT t1.*, t2.* FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a AND t1.b = t2.b) WHERE t1.b >= 125 AND t1.b < 225 ORDER BY t1.a, t1.b;
                              QUERY PLAN                             
 --------------------------------------------------------------------
- Sort
+ Merge Append
    Sort Key: t1.a, t1.b
-   ->  Append
+   ->  Sort
+         Sort Key: t1_1.a, t1_1.b
          ->  Hash Join
                Hash Cond: ((t1_1.a = t2_1.a) AND (t1_1.b = t2_1.b))
                ->  Seq Scan on alpha_neg_p1 t1_1
                      Filter: ((b >= 125) AND (b < 225))
                ->  Hash
                      ->  Seq Scan on beta_neg_p1 t2_1
+   ->  Sort
+         Sort Key: t1_2.a, t1_2.b
          ->  Hash Join
                Hash Cond: ((t2_2.a = t1_2.a) AND (t2_2.b = t1_2.b))
                ->  Seq Scan on beta_neg_p2 t2_2
                ->  Hash
                      ->  Seq Scan on alpha_neg_p2 t1_2
                            Filter: ((b >= 125) AND (b < 225))
+   ->  Sort
+         Sort Key: t1_4.a, t1_4.b
          ->  Hash Join
                Hash Cond: ((t2_4.a = t1_4.a) AND (t2_4.b = t1_4.b))
                ->  Append
@@ -5066,7 +5095,7 @@ SELECT t1.*, t2.* FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a AND t1.b = t2
                                  Filter: ((b >= 125) AND (b < 225))
                            ->  Seq Scan on alpha_pos_p3 t1_6
                                  Filter: ((b >= 125) AND (b < 225))
-(29 rows)
+(34 rows)
 
 SELECT t1.*, t2.* FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a AND t1.b = t2.b) WHERE t1.b >= 125 AND t1.b < 225 ORDER BY t1.a, t1.b;
  a  |  b  |  c   | a  |  b  |  c   
diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index 0bf35260b46..b25aa73e946 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -4763,9 +4763,10 @@ select min(a) over (partition by a order by a) from part_abc where a >= stable_o
                Window: w1 AS (PARTITION BY part_abc.a ORDER BY part_abc.a)
                ->  Append
                      Subplans Removed: 1
-                     ->  Index Scan using part_abc_2_a_idx on part_abc_2 part_abc_1
-                           Index Cond: (a >= (stable_one() + 1))
-                           Filter: (d <= stable_one())
+                     ->  Sort
+                           Sort Key: part_abc_1.a
+                           ->  Seq Scan on part_abc_2 part_abc_1
+                                 Filter: ((d <= stable_one()) AND (a >= (stable_one() + 1)))
                      ->  Merge Append
                            Sort Key: part_abc_3.a
                            Subplans Removed: 1
@@ -4780,9 +4781,10 @@ select min(a) over (partition by a order by a) from part_abc where a >= stable_o
                Window: w1 AS (PARTITION BY part_abc_5.a ORDER BY part_abc_5.a)
                ->  Append
                      Subplans Removed: 1
-                     ->  Index Scan using part_abc_2_a_idx on part_abc_2 part_abc_6
-                           Index Cond: (a >= (stable_one() + 1))
-                           Filter: (d >= stable_one())
+                     ->  Sort
+                           Sort Key: part_abc_6.a
+                           ->  Seq Scan on part_abc_2 part_abc_6
+                                 Filter: ((d >= stable_one()) AND (a >= (stable_one() + 1)))
                      ->  Merge Append
                            Sort Key: a
                            Subplans Removed: 1
@@ -4792,7 +4794,7 @@ select min(a) over (partition by a order by a) from part_abc where a >= stable_o
                            ->  Index Scan using part_abc_3_3_a_idx on part_abc_3_3 part_abc_9
                                  Index Cond: (a >= (stable_one() + 1))
                                  Filter: (d >= stable_one())
-(35 rows)
+(37 rows)
 
 drop view part_abc_view;
 drop table part_abc;
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index 96962817ed4..0ccadea910c 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -1207,12 +1207,14 @@ select event_id
 ----------------------------------------------------------
  Merge Append
    Sort Key: events.event_id
-   ->  Index Scan using events_pkey on events
+   ->  Sort
+         Sort Key: events.event_id
+         ->  Seq Scan on events
    ->  Sort
          Sort Key: events_1.event_id
          ->  Seq Scan on events_child events_1
    ->  Index Scan using other_events_pkey on other_events
-(7 rows)
+(9 rows)
 
 drop table events_child, events, other_events;
 reset enable_indexonlyscan;
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 699e8ac09c8..c58beebbd1e 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -641,12 +641,14 @@ insert into matest2 (name) values ('Test 4');
 insert into matest3 (name) values ('Test 5');
 insert into matest3 (name) values ('Test 6');
 
-set enable_indexscan = off;  -- force use of seqscan/sort, so no merge
+set enable_indexscan = off;  -- force use of seqscan/sort
+set enable_sort = off; -- since merge append may employ sort in children we need to disable sort
 explain (verbose, costs off) select * from matest0 order by 1-id;
 select * from matest0 order by 1-id;
 explain (verbose, costs off) select min(1-id) from matest0;
 select min(1-id) from matest0;
 reset enable_indexscan;
+reset enable_sort;
 
 set enable_seqscan = off;  -- plan with fewest seqscans should be merge
 set enable_parallel_append = off; -- Don't let parallel-append interfere
-- 
2.39.5

test2.sqlapplication/sql; name=test2.sqlDownload

#10

Alexander Pyhalov

a.pyhalov@postgrespro.ru

9 months ago

In reply to: Andrei Lepikhov (#9)

Re: MergeAppend could consider sorting cheapest child path

Andrei Lepikhov писал(а) 2025-04-29 16:52:

On 4/25/25 17:13, Andrei Lepikhov wrote:

On 4/25/25 11:16, Alexander Pyhalov wrote:

Usually, sorted cheapest_total_path will be cheaper than sorted
fractional/startup path at least by startup cost (as after sorting it
includes total_cost of input path). But we ignore this case when
selecting cheapest_startup and cheapest_fractional subpaths. As
result selected cheapest_startup and cheapest_fractional can be not
cheapest for startup or selecting a fraction of rows.

I don't know what you mean by that. The cheapest_total_path is
considered when we chose optimal cheapest_total path. The same works
for the fractional path - get_cheapest_fractional_path gives us the
most optimal fractional path and probes cheapest_total_path too.
As above, not sure about min-startup case for now. I can imagine
MergeAppend over sophisticated subquery: non-sorted includes highly
parameterised JOINs and the alternative (with pathkeys) includes
HashJoin, drastically increasing startup cost. It is only a theory, of
course. So, lets discover how min-startup works.

After a second thought I have caught your idea. I agree that for a
fractional path it have no sense to choose any other path except a
cheapest total one.
There are the modified patch in the attachment.

Also, to be more objective, I propose to use examples in argumentation
- something like in attached test2.sql script.

Hi.
I've looked through new patch and found minor inconsistencies in
get_cheapest_path_for_pathkeys_ext() and
get_cheapest_fractional_path_for_pathkeys_ext().

In get_cheapest_fractional_path_for_pathkeys_ext() we check that
base_path is not NULL
path = get_cheapest_fractional_path_for_pathkeys(rel->pathlist,
pathkeys,

required_outer, fraction);

base_path = rel->cheapest_total_path;

/* Stop here if the path doesn't satisfy necessary conditions */
if (!base_path || !bms_is_subset(PATH_REQ_OUTER(base_path),
required_outer))
return path;

But it seems, base_path can't be NULL (as add_paths_to_append_rel() is
called after set_rel_pathlist() for childrels).
However, path can. Can we do these two functions
get_cheapest_path_for_pathkeys_ext() and
get_cheapest_fractional_path_for_pathkeys_ext()
more similar?

Also we check base_path for required_outer and require_parallel_safe,
but if cheapest path for pathkeys is NULL, these checks are not
performed. Luckily, they seen to be no-op anyway due to
cheapest_total->param_info == NULL and function arguments being NULL
(required_outer) and false (require_parallel_safe). Should we do
something about this? Don't know, perhaps, remove these misleading
arguments?

Now, if we return cheapest_total_path from
get_cheapest_fractional_path_for_pathkeys_ext() if cheapest paths for
pathkeys don't exist, do the following lines

/*
* If we found no path with matching
pathkeys, use the
* cheapest total path instead.
*
* XXX We might consider partially
sorted paths too (with an
* incremental sort on top). But we'd
have to build all the
* incremental paths, do the costing
etc.
*/
if (!cheapest_fractional)
cheapest_fractional =
cheapest_total;

become no-op? And we do return non-null path from
get_cheapest_fractional_path_for_pathkeys_ext(), as it seems we return
either cheapest_total_path or cheapest fractional path from
get_cheapest_fractional_path_for_pathkeys_ext().

--
Best regards,
Alexander Pyhalov,
Postgres Professional

#11

Andrei Lepikhov

lepihov@gmail.com

8 months ago

In reply to: Alexander Pyhalov (#10)

1 attachment(s)

Re: MergeAppend could consider sorting cheapest child path

On 4/29/25 19:25, Alexander Pyhalov wrote:

Andrei Lepikhov писал(а) 2025-04-29 16:52:
But it seems, base_path can't be NULL

Correct. Fixed.

Also we check base_path for required_outer and require_parallel_safe,
but if cheapest path for pathkeys is NULL, these checks are not
performed.

Thanks. Fixed.

Luckily, they seen to be no-op anyway due to cheapest_total- >
param_info == NULL and function arguments being NULL (required_outer)
and false (require_parallel_safe). Should we do something about this?
Don't know, perhaps, remove these misleading arguments?

The main reason why I introduced these _ext routines was that this
schema may be used somewhere else. And in that case parameterised paths
may exist. Who knows, may be parameterised paths will be introduced for
merge append too?

become no-op? And we do return non-null path from
get_cheapest_fractional_path_for_pathkeys_ext(), as it seems we return
either cheapest_total_path or cheapest fractional path from
get_cheapest_fractional_path_for_pathkeys_ext().

Ok.

Let's check next version of the patch in the attachment.

--
regards, Andrei Lepikhov

Attachments:

v2-0001-Consider-explicit-sort-of-the-MergeAppend-subpath.patchtext/x-patch; charset=UTF-8; name=v2-0001-Consider-explicit-sort-of-the-MergeAppend-subpath.patchDownload

From 45c55f1d15e173d86ecdfbba8ff86b5fc26ad0f6 Mon Sep 17 00:00:00 2001
From: "Andrei V. Lepikhov" <lepihov@gmail.com>
Date: Thu, 24 Apr 2025 14:03:02 +0200
Subject: [PATCH v2] Consider explicit sort of the MergeAppend subpaths.

Expand the optimiser search scope a little: fetching optimal subpath matching
pathkeys of the planning MergeAppend, consider the extra case of
overall-optimal path plus explicit Sort node at the top.

It may provide a more effective plan in both full and fractional scan cases:
1. The Sort node may be pushed down to subpaths under a parallel or
async Append.
2. The case when a minor set of subpaths doesn't have a proper index, and it
is profitable to sort them instead of switching to plain Append.

In general, analysing the regression tests changed by this code, it seems that
the cost model prefers a series of small sortings instead of a single massive
one. This feature increases a little the number of such paths.

Overhead:
It seems multiple subpaths may be encountered, as well as many pathkeys.
So, to be as careful as possible here, only cost estimation is performed.
---
 .../postgres_fdw/expected/postgres_fdw.out    |   6 +-
 src/backend/optimizer/path/allpaths.c         |  47 ++----
 src/backend/optimizer/path/pathkeys.c         |  97 ++++++++++++
 src/include/optimizer/paths.h                 |  10 ++
 .../regress/expected/collate.icu.utf8.out     |   9 +-
 src/test/regress/expected/inherit.out         |  19 ++-
 src/test/regress/expected/partition_join.out  | 139 +++++++++++-------
 src/test/regress/expected/partition_prune.out |  16 +-
 src/test/regress/expected/union.out           |   6 +-
 src/test/regress/sql/inherit.sql              |   4 +-
 10 files changed, 246 insertions(+), 107 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 24ff5f70cc..8e9c0c4e3b 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -10288,13 +10288,15 @@ SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a
    ->  Nested Loop
          Join Filter: (t1.a = t2.b)
          ->  Append
-               ->  Foreign Scan on ftprt1_p1 t1_1
+               ->  Sort
+                     Sort Key: t1_1.a
+                     ->  Foreign Scan on ftprt1_p1 t1_1
                ->  Foreign Scan on ftprt1_p2 t1_2
          ->  Materialize
                ->  Append
                      ->  Foreign Scan on ftprt2_p1 t2_1
                      ->  Foreign Scan on ftprt2_p2 t2_2
-(10 rows)
+(12 rows)
 
 SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
   a  |  b  
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 905250b332..a825225668 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1855,29 +1855,18 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 
 			/* Locate the right paths, if they are available. */
 			cheapest_startup =
-				get_cheapest_path_for_pathkeys(childrel->pathlist,
-											   pathkeys,
-											   NULL,
-											   STARTUP_COST,
-											   false);
+				get_cheapest_path_for_pathkeys_ext(root, childrel, pathkeys,
+												   NULL, STARTUP_COST, false);
 			cheapest_total =
-				get_cheapest_path_for_pathkeys(childrel->pathlist,
-											   pathkeys,
-											   NULL,
-											   TOTAL_COST,
-											   false);
+				get_cheapest_path_for_pathkeys_ext(root, childrel, pathkeys,
+												   NULL, TOTAL_COST, false);
 
 			/*
-			 * If we can't find any paths with the right order just use the
-			 * cheapest-total path; we'll have to sort it later.
+			 * In accordance to current planning logic there are no
+			 * parameterised paths under a merge append.
 			 */
-			if (cheapest_startup == NULL || cheapest_total == NULL)
-			{
-				cheapest_startup = cheapest_total =
-					childrel->cheapest_total_path;
-				/* Assert we do have an unparameterized path for this child */
-				Assert(cheapest_total->param_info == NULL);
-			}
+			Assert(cheapest_startup != NULL && cheapest_total != NULL);
+			Assert(cheapest_total->param_info == NULL);
 
 			/*
 			 * When building a fractional path, determine a cheapest
@@ -1894,21 +1883,17 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 				double		path_fraction = (1.0 / root->tuple_fraction);
 
 				cheapest_fractional =
-					get_cheapest_fractional_path_for_pathkeys(childrel->pathlist,
-															  pathkeys,
-															  NULL,
-															  path_fraction);
+					get_cheapest_fractional_path_for_pathkeys_ext(root,
+																  childrel,
+																  pathkeys,
+																  NULL,
+																  path_fraction);
 
 				/*
-				 * If we found no path with matching pathkeys, use the
-				 * cheapest total path instead.
-				 *
-				 * XXX We might consider partially sorted paths too (with an
-				 * incremental sort on top). But we'd have to build all the
-				 * incremental paths, do the costing etc.
+				 * In accordance to current planning logic there are no
+				 * parameterised fractional paths under a merge append.
 				 */
-				if (!cheapest_fractional)
-					cheapest_fractional = cheapest_total;
+				Assert(cheapest_fractional != NULL);
 			}
 
 			/*
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 8b04d40d36..bcf1c1abda 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -19,11 +19,13 @@
 
 #include "access/stratnum.h"
 #include "catalog/pg_opfamily.h"
+#include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/cost.h"
 #include "optimizer/optimizer.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
+#include "optimizer/planner.h"
 #include "partitioning/partbounds.h"
 #include "rewrite/rewriteManip.h"
 #include "utils/lsyscache.h"
@@ -648,6 +650,58 @@ get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 	return matched_path;
 }
 
+/*
+ * get_cheapest_path_for_pathkeys_ext
+ *	  Calls get_cheapest_path_for_pathkeys to obtain cheapest path that
+ *	  satisfies defined criterias and сonsiders one more option: choose
+ *	  overall-optimal path (according the criterion) and explicitly sort its
+ *	  output to satisfy the pathkeys.
+ *
+ *	  Caller is responsible to insert corresponding sort path at the top of
+ *	  returned path if it will be chosen to be used.
+ *
+ *	  Return NULL if no such path.
+ */
+Path *
+get_cheapest_path_for_pathkeys_ext(PlannerInfo *root, RelOptInfo *rel,
+								   List *pathkeys, Relids required_outer,
+								   CostSelector cost_criterion,
+								   bool require_parallel_safe)
+{
+	Path	sort_path;
+	Path   *base_path = rel->cheapest_total_path;
+	Path   *path;
+
+	path = get_cheapest_path_for_pathkeys(rel->pathlist, pathkeys,
+										  required_outer, cost_criterion,
+										  require_parallel_safe);
+
+
+	if (path == NULL)
+		/*
+		 * Current pathlist doesn't fit the pathkeys. No need to check extra
+		 * sort path ways.
+		 */
+		return base_path;
+
+	/* Stop here if the path doesn't satisfy necessary conditions */
+	if ((require_parallel_safe && !base_path->parallel_safe) ||
+		!bms_is_subset(PATH_REQ_OUTER(base_path), required_outer))
+		return path;
+
+	/* Consider the most startup-optimal path with extra sort */
+	if (base_path && path != base_path)
+	{
+		cost_sort(&sort_path, root, pathkeys, base_path->disabled_nodes,
+				  base_path->total_cost, base_path->rows,
+				  base_path->pathtarget->width, 0.0, work_mem, -1.0);
+
+		if (compare_path_costs(&sort_path, path, cost_criterion) < 0)
+			return base_path;
+	}
+	return path;
+}
+
 /*
  * get_cheapest_fractional_path_for_pathkeys
  *	  Find the cheapest path (for retrieving a specified fraction of all
@@ -690,6 +744,49 @@ get_cheapest_fractional_path_for_pathkeys(List *paths,
 	return matched_path;
 }
 
+/*
+ * get_cheapest_fractional_path_for_pathkeys_ext
+ *	  obtain cheapest fractional path that satisfies defined criterias excluding
+ *	  pathkeys and explicitly sort its output to satisfy the pathkeys.
+ *
+ *	  Caller is responsible to insert corresponding sort path at the top of
+ *	  returned path if it will be chosen to be used.
+ *
+ *	  Return NULL if no such path.
+ */
+Path *
+get_cheapest_fractional_path_for_pathkeys_ext(PlannerInfo *root,
+											  RelOptInfo *rel,
+											  List *pathkeys,
+											  Relids required_outer,
+											  double fraction)
+{
+	Path			sort_path;
+	Path		   *base_path;
+	Path		   *path;
+
+	path = get_cheapest_fractional_path_for_pathkeys(rel->pathlist, pathkeys,
+													 required_outer, fraction);
+
+	base_path = rel->cheapest_total_path;
+
+	/* Stop here if the path doesn't satisfy necessary conditions */
+	if (!bms_is_subset(PATH_REQ_OUTER(base_path), required_outer))
+		return path;
+
+	if (path != base_path)
+	{
+		cost_sort(&sort_path, root, pathkeys, base_path->disabled_nodes,
+				  base_path->total_cost, base_path->rows,
+				  base_path->pathtarget->width, 0.0, work_mem, -1.0);
+
+		if (!path ||
+			compare_fractional_path_costs(&sort_path, path, fraction) <= 0)
+			return base_path;
+	}
+
+	return path;
+}
 
 /*
  * get_cheapest_parallel_safe_total_inner
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index a48c972179..4bdc85afca 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -224,10 +224,20 @@ extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
 											bool require_parallel_safe);
+extern Path *get_cheapest_path_for_pathkeys_ext(PlannerInfo *root,
+												RelOptInfo *rel, List *pathkeys,
+												Relids required_outer,
+												CostSelector cost_criterion,
+												bool require_parallel_safe);
 extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 													   List *pathkeys,
 													   Relids required_outer,
 													   double fraction);
+extern Path *get_cheapest_fractional_path_for_pathkeys_ext(PlannerInfo *root,
+														   RelOptInfo *rel,
+														   List *pathkeys,
+														   Relids required_outer,
+														   double fraction);
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
 								  ScanDirection scandir);
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 69805d4b9e..48d47bb745 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -2412,16 +2412,19 @@ EXPLAIN (COSTS OFF)
 SELECT c collate "C", count(c) FROM pagg_tab3 GROUP BY c collate "C" ORDER BY 1;
                        QUERY PLAN                       
 --------------------------------------------------------
- Sort
+ Merge Append
    Sort Key: ((pagg_tab3.c)::text) COLLATE "C"
-   ->  Append
+   ->  Sort
+         Sort Key: ((pagg_tab3.c)::text) COLLATE "C"
          ->  HashAggregate
                Group Key: (pagg_tab3.c)::text
                ->  Seq Scan on pagg_tab3_p2 pagg_tab3
+   ->  Sort
+         Sort Key: ((pagg_tab3_1.c)::text) COLLATE "C"
          ->  HashAggregate
                Group Key: (pagg_tab3_1.c)::text
                ->  Seq Scan on pagg_tab3_p1 pagg_tab3_1
-(9 rows)
+(12 rows)
 
 SELECT c collate "C", count(c) FROM pagg_tab3 GROUP BY c collate "C" ORDER BY 1;
  c | count 
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index f9b0c415cf..71036dc938 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -1665,11 +1665,13 @@ insert into matest2 (name) values ('Test 3');
 insert into matest2 (name) values ('Test 4');
 insert into matest3 (name) values ('Test 5');
 insert into matest3 (name) values ('Test 6');
-set enable_indexscan = off;  -- force use of seqscan/sort, so no merge
+set enable_indexscan = off;  -- force use of seqscan/sort
+set enable_sort = off; -- since merge append may employ sort in children we need to disable sort
 explain (verbose, costs off) select * from matest0 order by 1-id;
                          QUERY PLAN                         
 ------------------------------------------------------------
  Sort
+   Disabled: true
    Output: matest0.id, matest0.name, ((1 - matest0.id))
    Sort Key: ((1 - matest0.id))
    ->  Result
@@ -1683,7 +1685,7 @@ explain (verbose, costs off) select * from matest0 order by 1-id;
                      Output: matest0_3.id, matest0_3.name
                ->  Seq Scan on public.matest3 matest0_4
                      Output: matest0_4.id, matest0_4.name
-(14 rows)
+(15 rows)
 
 select * from matest0 order by 1-id;
  id |  name  
@@ -1719,6 +1721,7 @@ select min(1-id) from matest0;
 (1 row)
 
 reset enable_indexscan;
+reset enable_sort;
 set enable_seqscan = off;  -- plan with fewest seqscans should be merge
 set enable_parallel_append = off; -- Don't let parallel-append interfere
 explain (verbose, costs off) select * from matest0 order by 1-id;
@@ -1844,16 +1847,20 @@ order by t1.b limit 10;
          Merge Cond: (t1.b = t2.b)
          ->  Merge Append
                Sort Key: t1.b
-               ->  Index Scan using matest0i on matest0 t1_1
+               ->  Sort
+                     Sort Key: t1_1.b
+                     ->  Seq Scan on matest0 t1_1
                ->  Index Scan using matest1i on matest1 t1_2
          ->  Materialize
                ->  Merge Append
                      Sort Key: t2.b
-                     ->  Index Scan using matest0i on matest0 t2_1
-                           Filter: (c = d)
+                     ->  Sort
+                           Sort Key: t2_1.b
+                           ->  Seq Scan on matest0 t2_1
+                                 Filter: (c = d)
                      ->  Index Scan using matest1i on matest1 t2_2
                            Filter: (c = d)
-(14 rows)
+(18 rows)
 
 reset enable_nestloop;
 drop table matest0 cascade;
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index 6101c8c7cf..d41979367c 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -65,31 +65,34 @@ SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.b AND t1.b =
 -- inner join with partially-redundant join clauses
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
-                          QUERY PLAN                           
----------------------------------------------------------------
- Sort
+                       QUERY PLAN                        
+---------------------------------------------------------
+ Merge Append
    Sort Key: t1.a
-   ->  Append
-         ->  Merge Join
-               Merge Cond: (t1_1.a = t2_1.a)
-               ->  Index Scan using iprt1_p1_a on prt1_p1 t1_1
-               ->  Sort
-                     Sort Key: t2_1.b
-                     ->  Seq Scan on prt2_p1 t2_1
-                           Filter: (a = b)
+   ->  Merge Join
+         Merge Cond: (t1_1.a = t2_1.a)
+         ->  Index Scan using iprt1_p1_a on prt1_p1 t1_1
+         ->  Sort
+               Sort Key: t2_1.b
+               ->  Seq Scan on prt2_p1 t2_1
+                     Filter: (a = b)
+   ->  Sort
+         Sort Key: t1_2.a
          ->  Hash Join
                Hash Cond: (t1_2.a = t2_2.a)
                ->  Seq Scan on prt1_p2 t1_2
                ->  Hash
                      ->  Seq Scan on prt2_p2 t2_2
                            Filter: (a = b)
+   ->  Sort
+         Sort Key: t1_3.a
          ->  Hash Join
                Hash Cond: (t1_3.a = t2_3.a)
                ->  Seq Scan on prt1_p3 t1_3
                ->  Hash
                      ->  Seq Scan on prt2_p3 t2_3
                            Filter: (a = b)
-(22 rows)
+(25 rows)
 
 SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
  a  |  c   | b  |  c   
@@ -1383,28 +1386,32 @@ SELECT t1.a, t1.c, t2.b, t2.c, t3.a + t3.b, t3.c FROM (prt1 t1 LEFT JOIN prt2 t2
 -- This should generate a partitionwise join, but currently fails to
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
-                        QUERY PLAN                         
------------------------------------------------------------
- Incremental Sort
+                           QUERY PLAN                            
+-----------------------------------------------------------------
+ Sort
    Sort Key: prt1.a, prt2.b
-   Presorted Key: prt1.a
-   ->  Merge Left Join
-         Merge Cond: (prt1.a = prt2.b)
-         ->  Sort
-               Sort Key: prt1.a
-               ->  Append
-                     ->  Seq Scan on prt1_p1 prt1_1
-                           Filter: ((a < 450) AND (b = 0))
-                     ->  Seq Scan on prt1_p2 prt1_2
-                           Filter: ((a < 450) AND (b = 0))
-         ->  Sort
-               Sort Key: prt2.b
-               ->  Append
+   ->  Merge Right Join
+         Merge Cond: (prt2.b = prt1.a)
+         ->  Append
+               ->  Sort
+                     Sort Key: prt2_1.b
                      ->  Seq Scan on prt2_p2 prt2_1
                            Filter: (b > 250)
+               ->  Sort
+                     Sort Key: prt2_2.b
                      ->  Seq Scan on prt2_p3 prt2_2
                            Filter: (b > 250)
-(19 rows)
+         ->  Materialize
+               ->  Append
+                     ->  Sort
+                           Sort Key: prt1_1.a
+                           ->  Seq Scan on prt1_p1 prt1_1
+                                 Filter: ((a < 450) AND (b = 0))
+                     ->  Sort
+                           Sort Key: prt1_2.a
+                           ->  Seq Scan on prt1_p2 prt1_2
+                                 Filter: ((a < 450) AND (b = 0))
+(23 rows)
 
 SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
   a  |  b  
@@ -1424,25 +1431,33 @@ SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT *
 -- partitionwise join does not apply
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t2.b FROM prt1 t1, prt2 t2 WHERE t1::text = t2::text AND t1.a = t2.b ORDER BY t1.a;
-                                       QUERY PLAN                                        
------------------------------------------------------------------------------------------
+                            QUERY PLAN                            
+------------------------------------------------------------------
  Merge Join
-   Merge Cond: ((t1.a = t2.b) AND (((((t1.*)::prt1))::text) = ((((t2.*)::prt2))::text)))
-   ->  Sort
-         Sort Key: t1.a, ((((t1.*)::prt1))::text)
-         ->  Result
-               ->  Append
-                     ->  Seq Scan on prt1_p1 t1_1
-                     ->  Seq Scan on prt1_p2 t1_2
-                     ->  Seq Scan on prt1_p3 t1_3
-   ->  Sort
-         Sort Key: t2.b, ((((t2.*)::prt2))::text)
-         ->  Result
-               ->  Append
+   Merge Cond: (t1.a = t2.b)
+   Join Filter: ((((t2.*)::prt2))::text = (((t1.*)::prt1))::text)
+   ->  Append
+         ->  Sort
+               Sort Key: t1_1.a
+               ->  Seq Scan on prt1_p1 t1_1
+         ->  Sort
+               Sort Key: t1_2.a
+               ->  Seq Scan on prt1_p2 t1_2
+         ->  Sort
+               Sort Key: t1_3.a
+               ->  Seq Scan on prt1_p3 t1_3
+   ->  Materialize
+         ->  Append
+               ->  Sort
+                     Sort Key: t2_1.b
                      ->  Seq Scan on prt2_p1 t2_1
+               ->  Sort
+                     Sort Key: t2_2.b
                      ->  Seq Scan on prt2_p2 t2_2
+               ->  Sort
+                     Sort Key: t2_3.b
                      ->  Seq Scan on prt2_p3 t2_3
-(16 rows)
+(24 rows)
 
 SELECT t1.a, t2.b FROM prt1 t1, prt2 t2 WHERE t1::text = t2::text AND t1.a = t2.b ORDER BY t1.a;
  a  | b  
@@ -4508,9 +4523,10 @@ EXPLAIN (COSTS OFF)
 SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2 ON (t1.a = t2.a AND t1.c = t2.c) LEFT JOIN plt1_adv t3 ON (t1.a = t3.a AND t1.c = t3.c) WHERE t1.b < 10 ORDER BY t1.a;
                                    QUERY PLAN                                   
 --------------------------------------------------------------------------------
- Sort
+ Merge Append
    Sort Key: t1.a
-   ->  Append
+   ->  Sort
+         Sort Key: t1_1.a
          ->  Hash Right Join
                Hash Cond: ((t3_1.a = t1_1.a) AND (t3_1.c = t1_1.c))
                ->  Seq Scan on plt1_adv_p1 t3_1
@@ -4521,6 +4537,8 @@ SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2
                            ->  Hash
                                  ->  Seq Scan on plt1_adv_p1 t1_1
                                        Filter: (b < 10)
+   ->  Sort
+         Sort Key: t1_2.a
          ->  Hash Right Join
                Hash Cond: ((t3_2.a = t1_2.a) AND (t3_2.c = t1_2.c))
                ->  Seq Scan on plt1_adv_p2 t3_2
@@ -4531,6 +4549,8 @@ SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2
                            ->  Hash
                                  ->  Seq Scan on plt1_adv_p2 t1_2
                                        Filter: (b < 10)
+   ->  Sort
+         Sort Key: t1_3.a
          ->  Hash Right Join
                Hash Cond: ((t3_3.a = t1_3.a) AND (t3_3.c = t1_3.c))
                ->  Seq Scan on plt1_adv_p3 t3_3
@@ -4541,15 +4561,19 @@ SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2
                            ->  Hash
                                  ->  Seq Scan on plt1_adv_p3 t1_3
                                        Filter: (b < 10)
-         ->  Nested Loop Left Join
-               Join Filter: ((t1_4.a = t3_4.a) AND (t1_4.c = t3_4.c))
-               ->  Nested Loop Left Join
-                     Join Filter: ((t1_4.a = t2_4.a) AND (t1_4.c = t2_4.c))
+   ->  Nested Loop Left Join
+         Join Filter: ((t1_4.a = t3_4.a) AND (t1_4.c = t3_4.c))
+         ->  Merge Left Join
+               Merge Cond: ((t1_4.a = t2_4.a) AND (t1_4.c = t2_4.c))
+               ->  Sort
+                     Sort Key: t1_4.a, t1_4.c
                      ->  Seq Scan on plt1_adv_extra t1_4
                            Filter: (b < 10)
+               ->  Sort
+                     Sort Key: t2_4.a, t2_4.c
                      ->  Seq Scan on plt2_adv_extra t2_4
-               ->  Seq Scan on plt1_adv_extra t3_4
-(41 rows)
+         ->  Seq Scan on plt1_adv_extra t3_4
+(50 rows)
 
 SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2 ON (t1.a = t2.a AND t1.c = t2.c) LEFT JOIN plt1_adv t3 ON (t1.a = t3.a AND t1.c = t3.c) WHERE t1.b < 10 ORDER BY t1.a;
  a  |  c   | a |  c   | a |  c   
@@ -5037,21 +5061,26 @@ EXPLAIN (COSTS OFF)
 SELECT t1.*, t2.* FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a AND t1.b = t2.b) WHERE t1.b >= 125 AND t1.b < 225 ORDER BY t1.a, t1.b;
                              QUERY PLAN                             
 --------------------------------------------------------------------
- Sort
+ Merge Append
    Sort Key: t1.a, t1.b
-   ->  Append
+   ->  Sort
+         Sort Key: t1_1.a, t1_1.b
          ->  Hash Join
                Hash Cond: ((t1_1.a = t2_1.a) AND (t1_1.b = t2_1.b))
                ->  Seq Scan on alpha_neg_p1 t1_1
                      Filter: ((b >= 125) AND (b < 225))
                ->  Hash
                      ->  Seq Scan on beta_neg_p1 t2_1
+   ->  Sort
+         Sort Key: t1_2.a, t1_2.b
          ->  Hash Join
                Hash Cond: ((t2_2.a = t1_2.a) AND (t2_2.b = t1_2.b))
                ->  Seq Scan on beta_neg_p2 t2_2
                ->  Hash
                      ->  Seq Scan on alpha_neg_p2 t1_2
                            Filter: ((b >= 125) AND (b < 225))
+   ->  Sort
+         Sort Key: t1_4.a, t1_4.b
          ->  Hash Join
                Hash Cond: ((t2_4.a = t1_4.a) AND (t2_4.b = t1_4.b))
                ->  Append
@@ -5066,7 +5095,7 @@ SELECT t1.*, t2.* FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a AND t1.b = t2
                                  Filter: ((b >= 125) AND (b < 225))
                            ->  Seq Scan on alpha_pos_p3 t1_6
                                  Filter: ((b >= 125) AND (b < 225))
-(29 rows)
+(34 rows)
 
 SELECT t1.*, t2.* FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a AND t1.b = t2.b) WHERE t1.b >= 125 AND t1.b < 225 ORDER BY t1.a, t1.b;
  a  |  b  |  c   | a  |  b  |  c   
diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index 0bf35260b4..b25aa73e94 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -4763,9 +4763,10 @@ select min(a) over (partition by a order by a) from part_abc where a >= stable_o
                Window: w1 AS (PARTITION BY part_abc.a ORDER BY part_abc.a)
                ->  Append
                      Subplans Removed: 1
-                     ->  Index Scan using part_abc_2_a_idx on part_abc_2 part_abc_1
-                           Index Cond: (a >= (stable_one() + 1))
-                           Filter: (d <= stable_one())
+                     ->  Sort
+                           Sort Key: part_abc_1.a
+                           ->  Seq Scan on part_abc_2 part_abc_1
+                                 Filter: ((d <= stable_one()) AND (a >= (stable_one() + 1)))
                      ->  Merge Append
                            Sort Key: part_abc_3.a
                            Subplans Removed: 1
@@ -4780,9 +4781,10 @@ select min(a) over (partition by a order by a) from part_abc where a >= stable_o
                Window: w1 AS (PARTITION BY part_abc_5.a ORDER BY part_abc_5.a)
                ->  Append
                      Subplans Removed: 1
-                     ->  Index Scan using part_abc_2_a_idx on part_abc_2 part_abc_6
-                           Index Cond: (a >= (stable_one() + 1))
-                           Filter: (d >= stable_one())
+                     ->  Sort
+                           Sort Key: part_abc_6.a
+                           ->  Seq Scan on part_abc_2 part_abc_6
+                                 Filter: ((d >= stable_one()) AND (a >= (stable_one() + 1)))
                      ->  Merge Append
                            Sort Key: a
                            Subplans Removed: 1
@@ -4792,7 +4794,7 @@ select min(a) over (partition by a order by a) from part_abc where a >= stable_o
                            ->  Index Scan using part_abc_3_3_a_idx on part_abc_3_3 part_abc_9
                                  Index Cond: (a >= (stable_one() + 1))
                                  Filter: (d >= stable_one())
-(35 rows)
+(37 rows)
 
 drop view part_abc_view;
 drop table part_abc;
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index 96962817ed..0ccadea910 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -1207,12 +1207,14 @@ select event_id
 ----------------------------------------------------------
  Merge Append
    Sort Key: events.event_id
-   ->  Index Scan using events_pkey on events
+   ->  Sort
+         Sort Key: events.event_id
+         ->  Seq Scan on events
    ->  Sort
          Sort Key: events_1.event_id
          ->  Seq Scan on events_child events_1
    ->  Index Scan using other_events_pkey on other_events
-(7 rows)
+(9 rows)
 
 drop table events_child, events, other_events;
 reset enable_indexonlyscan;
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 699e8ac09c..c58beebbd1 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -641,12 +641,14 @@ insert into matest2 (name) values ('Test 4');
 insert into matest3 (name) values ('Test 5');
 insert into matest3 (name) values ('Test 6');
 
-set enable_indexscan = off;  -- force use of seqscan/sort, so no merge
+set enable_indexscan = off;  -- force use of seqscan/sort
+set enable_sort = off; -- since merge append may employ sort in children we need to disable sort
 explain (verbose, costs off) select * from matest0 order by 1-id;
 select * from matest0 order by 1-id;
 explain (verbose, costs off) select min(1-id) from matest0;
 select min(1-id) from matest0;
 reset enable_indexscan;
+reset enable_sort;
 
 set enable_seqscan = off;  -- plan with fewest seqscans should be merge
 set enable_parallel_append = off; -- Don't let parallel-append interfere
-- 
2.39.5

#12

Andrei Lepikhov

lepihov@gmail.com

8 months ago

In reply to: Andrei Lepikhov (#11)

1 attachment(s)

Re: MergeAppend could consider sorting cheapest child path

On 5/5/25 13:38, Andrei Lepikhov wrote:

Let's check next version of the patch in the attachment.

Oops, I forgot some tails - see this new version.

--
regards, Andrei Lepikhov

Attachments:

v3-0001-Consider-explicit-sort-of-the-MergeAppend-subpath.patchtext/x-patch; charset=UTF-8; name=v3-0001-Consider-explicit-sort-of-the-MergeAppend-subpath.patchDownload

From 7b312c38bf9f2370c970dc0058d38610cb1ae6c6 Mon Sep 17 00:00:00 2001
From: "Andrei V. Lepikhov" <lepihov@gmail.com>
Date: Thu, 24 Apr 2025 14:03:02 +0200
Subject: [PATCH v3] Consider explicit sort of the MergeAppend subpaths.

Expand the optimiser search scope a little: fetching optimal subpath matching
pathkeys of the planning MergeAppend, consider the extra case of
overall-optimal path plus explicit Sort node at the top.

It may provide a more effective plan in both full and fractional scan cases:
1. The Sort node may be pushed down to subpaths under a parallel or
async Append.
2. The case when a minor set of subpaths doesn't have a proper index, and it
is profitable to sort them instead of switching to plain Append.

In general, analysing the regression tests changed by this code, it seems that
the cost model prefers a series of small sortings instead of a single massive
one. This feature increases a little the number of such paths.

Overhead:
It seems multiple subpaths may be encountered, as well as many pathkeys.
So, to be as careful as possible here, only cost estimation is performed.
---
 .../postgres_fdw/expected/postgres_fdw.out    |   6 +-
 src/backend/optimizer/path/allpaths.c         |  47 ++----
 src/backend/optimizer/path/pathkeys.c         |  96 ++++++++++++
 src/include/optimizer/paths.h                 |  10 ++
 .../regress/expected/collate.icu.utf8.out     |   9 +-
 src/test/regress/expected/inherit.out         |  19 ++-
 src/test/regress/expected/partition_join.out  | 139 +++++++++++-------
 src/test/regress/expected/partition_prune.out |  16 +-
 src/test/regress/expected/union.out           |   6 +-
 src/test/regress/sql/inherit.sql              |   4 +-
 10 files changed, 245 insertions(+), 107 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 24ff5f70cc..8e9c0c4e3b 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -10288,13 +10288,15 @@ SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a
    ->  Nested Loop
          Join Filter: (t1.a = t2.b)
          ->  Append
-               ->  Foreign Scan on ftprt1_p1 t1_1
+               ->  Sort
+                     Sort Key: t1_1.a
+                     ->  Foreign Scan on ftprt1_p1 t1_1
                ->  Foreign Scan on ftprt1_p2 t1_2
          ->  Materialize
                ->  Append
                      ->  Foreign Scan on ftprt2_p1 t2_1
                      ->  Foreign Scan on ftprt2_p2 t2_2
-(10 rows)
+(12 rows)
 
 SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
   a  |  b  
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 905250b332..a825225668 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1855,29 +1855,18 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 
 			/* Locate the right paths, if they are available. */
 			cheapest_startup =
-				get_cheapest_path_for_pathkeys(childrel->pathlist,
-											   pathkeys,
-											   NULL,
-											   STARTUP_COST,
-											   false);
+				get_cheapest_path_for_pathkeys_ext(root, childrel, pathkeys,
+												   NULL, STARTUP_COST, false);
 			cheapest_total =
-				get_cheapest_path_for_pathkeys(childrel->pathlist,
-											   pathkeys,
-											   NULL,
-											   TOTAL_COST,
-											   false);
+				get_cheapest_path_for_pathkeys_ext(root, childrel, pathkeys,
+												   NULL, TOTAL_COST, false);
 
 			/*
-			 * If we can't find any paths with the right order just use the
-			 * cheapest-total path; we'll have to sort it later.
+			 * In accordance to current planning logic there are no
+			 * parameterised paths under a merge append.
 			 */
-			if (cheapest_startup == NULL || cheapest_total == NULL)
-			{
-				cheapest_startup = cheapest_total =
-					childrel->cheapest_total_path;
-				/* Assert we do have an unparameterized path for this child */
-				Assert(cheapest_total->param_info == NULL);
-			}
+			Assert(cheapest_startup != NULL && cheapest_total != NULL);
+			Assert(cheapest_total->param_info == NULL);
 
 			/*
 			 * When building a fractional path, determine a cheapest
@@ -1894,21 +1883,17 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 				double		path_fraction = (1.0 / root->tuple_fraction);
 
 				cheapest_fractional =
-					get_cheapest_fractional_path_for_pathkeys(childrel->pathlist,
-															  pathkeys,
-															  NULL,
-															  path_fraction);
+					get_cheapest_fractional_path_for_pathkeys_ext(root,
+																  childrel,
+																  pathkeys,
+																  NULL,
+																  path_fraction);
 
 				/*
-				 * If we found no path with matching pathkeys, use the
-				 * cheapest total path instead.
-				 *
-				 * XXX We might consider partially sorted paths too (with an
-				 * incremental sort on top). But we'd have to build all the
-				 * incremental paths, do the costing etc.
+				 * In accordance to current planning logic there are no
+				 * parameterised fractional paths under a merge append.
 				 */
-				if (!cheapest_fractional)
-					cheapest_fractional = cheapest_total;
+				Assert(cheapest_fractional != NULL);
 			}
 
 			/*
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 8b04d40d36..d4237fd143 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -19,11 +19,13 @@
 
 #include "access/stratnum.h"
 #include "catalog/pg_opfamily.h"
+#include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/cost.h"
 #include "optimizer/optimizer.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
+#include "optimizer/planner.h"
 #include "partitioning/partbounds.h"
 #include "rewrite/rewriteManip.h"
 #include "utils/lsyscache.h"
@@ -648,6 +650,57 @@ get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 	return matched_path;
 }
 
+/*
+ * get_cheapest_path_for_pathkeys_ext
+ *	  Calls get_cheapest_path_for_pathkeys to obtain cheapest path that
+ *	  satisfies defined criterias and сonsiders one more option: choose
+ *	  overall-optimal path (according the criterion) and explicitly sort its
+ *	  output to satisfy the pathkeys.
+ *
+ *	  Caller is responsible to insert corresponding sort path at the top of
+ *	  returned path if it will be chosen to be used.
+ *
+ *	  Return NULL if no such path.
+ */
+Path *
+get_cheapest_path_for_pathkeys_ext(PlannerInfo *root, RelOptInfo *rel,
+								   List *pathkeys, Relids required_outer,
+								   CostSelector cost_criterion,
+								   bool require_parallel_safe)
+{
+	Path	sort_path;
+	Path   *base_path = rel->cheapest_total_path;
+	Path   *path;
+
+	path = get_cheapest_path_for_pathkeys(rel->pathlist, pathkeys,
+										  required_outer, cost_criterion,
+										  require_parallel_safe);
+
+	/* Stop here if the path doesn't satisfy necessary conditions */
+	if ((require_parallel_safe && !base_path->parallel_safe) ||
+		!bms_is_subset(PATH_REQ_OUTER(base_path), required_outer))
+		return path;
+
+	if (path == NULL)
+		/*
+		 * Current pathlist doesn't fit the pathkeys. No need to check extra
+		 * sort path ways.
+		 */
+		return base_path;
+
+	/* Consider the most startup-optimal path with extra sort */
+	if (base_path && path != base_path)
+	{
+		cost_sort(&sort_path, root, pathkeys, base_path->disabled_nodes,
+				  base_path->total_cost, base_path->rows,
+				  base_path->pathtarget->width, 0.0, work_mem, -1.0);
+
+		if (compare_path_costs(&sort_path, path, cost_criterion) < 0)
+			return base_path;
+	}
+	return path;
+}
+
 /*
  * get_cheapest_fractional_path_for_pathkeys
  *	  Find the cheapest path (for retrieving a specified fraction of all
@@ -690,6 +743,49 @@ get_cheapest_fractional_path_for_pathkeys(List *paths,
 	return matched_path;
 }
 
+/*
+ * get_cheapest_fractional_path_for_pathkeys_ext
+ *	  obtain cheapest fractional path that satisfies defined criterias excluding
+ *	  pathkeys and explicitly sort its output to satisfy the pathkeys.
+ *
+ *	  Caller is responsible to insert corresponding sort path at the top of
+ *	  returned path if it will be chosen to be used.
+ *
+ *	  Return NULL if no such path.
+ */
+Path *
+get_cheapest_fractional_path_for_pathkeys_ext(PlannerInfo *root,
+											  RelOptInfo *rel,
+											  List *pathkeys,
+											  Relids required_outer,
+											  double fraction)
+{
+	Path			sort_path;
+	Path		   *base_path;
+	Path		   *path;
+
+	path = get_cheapest_fractional_path_for_pathkeys(rel->pathlist, pathkeys,
+													 required_outer, fraction);
+
+	base_path = rel->cheapest_total_path;
+
+	/* Stop here if the path doesn't satisfy necessary conditions */
+	if (!bms_is_subset(PATH_REQ_OUTER(base_path), required_outer))
+		return path;
+
+	if (path != base_path)
+	{
+		cost_sort(&sort_path, root, pathkeys, base_path->disabled_nodes,
+				  base_path->total_cost, base_path->rows,
+				  base_path->pathtarget->width, 0.0, work_mem, -1.0);
+
+		if (!path ||
+			compare_fractional_path_costs(&sort_path, path, fraction) <= 0)
+			return base_path;
+	}
+
+	return path;
+}
 
 /*
  * get_cheapest_parallel_safe_total_inner
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index a48c972179..4bdc85afca 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -224,10 +224,20 @@ extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
 											bool require_parallel_safe);
+extern Path *get_cheapest_path_for_pathkeys_ext(PlannerInfo *root,
+												RelOptInfo *rel, List *pathkeys,
+												Relids required_outer,
+												CostSelector cost_criterion,
+												bool require_parallel_safe);
 extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 													   List *pathkeys,
 													   Relids required_outer,
 													   double fraction);
+extern Path *get_cheapest_fractional_path_for_pathkeys_ext(PlannerInfo *root,
+														   RelOptInfo *rel,
+														   List *pathkeys,
+														   Relids required_outer,
+														   double fraction);
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
 								  ScanDirection scandir);
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 69805d4b9e..48d47bb745 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -2412,16 +2412,19 @@ EXPLAIN (COSTS OFF)
 SELECT c collate "C", count(c) FROM pagg_tab3 GROUP BY c collate "C" ORDER BY 1;
                        QUERY PLAN                       
 --------------------------------------------------------
- Sort
+ Merge Append
    Sort Key: ((pagg_tab3.c)::text) COLLATE "C"
-   ->  Append
+   ->  Sort
+         Sort Key: ((pagg_tab3.c)::text) COLLATE "C"
          ->  HashAggregate
                Group Key: (pagg_tab3.c)::text
                ->  Seq Scan on pagg_tab3_p2 pagg_tab3
+   ->  Sort
+         Sort Key: ((pagg_tab3_1.c)::text) COLLATE "C"
          ->  HashAggregate
                Group Key: (pagg_tab3_1.c)::text
                ->  Seq Scan on pagg_tab3_p1 pagg_tab3_1
-(9 rows)
+(12 rows)
 
 SELECT c collate "C", count(c) FROM pagg_tab3 GROUP BY c collate "C" ORDER BY 1;
  c | count 
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index f9b0c415cf..71036dc938 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -1665,11 +1665,13 @@ insert into matest2 (name) values ('Test 3');
 insert into matest2 (name) values ('Test 4');
 insert into matest3 (name) values ('Test 5');
 insert into matest3 (name) values ('Test 6');
-set enable_indexscan = off;  -- force use of seqscan/sort, so no merge
+set enable_indexscan = off;  -- force use of seqscan/sort
+set enable_sort = off; -- since merge append may employ sort in children we need to disable sort
 explain (verbose, costs off) select * from matest0 order by 1-id;
                          QUERY PLAN                         
 ------------------------------------------------------------
  Sort
+   Disabled: true
    Output: matest0.id, matest0.name, ((1 - matest0.id))
    Sort Key: ((1 - matest0.id))
    ->  Result
@@ -1683,7 +1685,7 @@ explain (verbose, costs off) select * from matest0 order by 1-id;
                      Output: matest0_3.id, matest0_3.name
                ->  Seq Scan on public.matest3 matest0_4
                      Output: matest0_4.id, matest0_4.name
-(14 rows)
+(15 rows)
 
 select * from matest0 order by 1-id;
  id |  name  
@@ -1719,6 +1721,7 @@ select min(1-id) from matest0;
 (1 row)
 
 reset enable_indexscan;
+reset enable_sort;
 set enable_seqscan = off;  -- plan with fewest seqscans should be merge
 set enable_parallel_append = off; -- Don't let parallel-append interfere
 explain (verbose, costs off) select * from matest0 order by 1-id;
@@ -1844,16 +1847,20 @@ order by t1.b limit 10;
          Merge Cond: (t1.b = t2.b)
          ->  Merge Append
                Sort Key: t1.b
-               ->  Index Scan using matest0i on matest0 t1_1
+               ->  Sort
+                     Sort Key: t1_1.b
+                     ->  Seq Scan on matest0 t1_1
                ->  Index Scan using matest1i on matest1 t1_2
          ->  Materialize
                ->  Merge Append
                      Sort Key: t2.b
-                     ->  Index Scan using matest0i on matest0 t2_1
-                           Filter: (c = d)
+                     ->  Sort
+                           Sort Key: t2_1.b
+                           ->  Seq Scan on matest0 t2_1
+                                 Filter: (c = d)
                      ->  Index Scan using matest1i on matest1 t2_2
                            Filter: (c = d)
-(14 rows)
+(18 rows)
 
 reset enable_nestloop;
 drop table matest0 cascade;
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index 6101c8c7cf..d41979367c 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -65,31 +65,34 @@ SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.b AND t1.b =
 -- inner join with partially-redundant join clauses
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
-                          QUERY PLAN                           
----------------------------------------------------------------
- Sort
+                       QUERY PLAN                        
+---------------------------------------------------------
+ Merge Append
    Sort Key: t1.a
-   ->  Append
-         ->  Merge Join
-               Merge Cond: (t1_1.a = t2_1.a)
-               ->  Index Scan using iprt1_p1_a on prt1_p1 t1_1
-               ->  Sort
-                     Sort Key: t2_1.b
-                     ->  Seq Scan on prt2_p1 t2_1
-                           Filter: (a = b)
+   ->  Merge Join
+         Merge Cond: (t1_1.a = t2_1.a)
+         ->  Index Scan using iprt1_p1_a on prt1_p1 t1_1
+         ->  Sort
+               Sort Key: t2_1.b
+               ->  Seq Scan on prt2_p1 t2_1
+                     Filter: (a = b)
+   ->  Sort
+         Sort Key: t1_2.a
          ->  Hash Join
                Hash Cond: (t1_2.a = t2_2.a)
                ->  Seq Scan on prt1_p2 t1_2
                ->  Hash
                      ->  Seq Scan on prt2_p2 t2_2
                            Filter: (a = b)
+   ->  Sort
+         Sort Key: t1_3.a
          ->  Hash Join
                Hash Cond: (t1_3.a = t2_3.a)
                ->  Seq Scan on prt1_p3 t1_3
                ->  Hash
                      ->  Seq Scan on prt2_p3 t2_3
                            Filter: (a = b)
-(22 rows)
+(25 rows)
 
 SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
  a  |  c   | b  |  c   
@@ -1383,28 +1386,32 @@ SELECT t1.a, t1.c, t2.b, t2.c, t3.a + t3.b, t3.c FROM (prt1 t1 LEFT JOIN prt2 t2
 -- This should generate a partitionwise join, but currently fails to
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
-                        QUERY PLAN                         
------------------------------------------------------------
- Incremental Sort
+                           QUERY PLAN                            
+-----------------------------------------------------------------
+ Sort
    Sort Key: prt1.a, prt2.b
-   Presorted Key: prt1.a
-   ->  Merge Left Join
-         Merge Cond: (prt1.a = prt2.b)
-         ->  Sort
-               Sort Key: prt1.a
-               ->  Append
-                     ->  Seq Scan on prt1_p1 prt1_1
-                           Filter: ((a < 450) AND (b = 0))
-                     ->  Seq Scan on prt1_p2 prt1_2
-                           Filter: ((a < 450) AND (b = 0))
-         ->  Sort
-               Sort Key: prt2.b
-               ->  Append
+   ->  Merge Right Join
+         Merge Cond: (prt2.b = prt1.a)
+         ->  Append
+               ->  Sort
+                     Sort Key: prt2_1.b
                      ->  Seq Scan on prt2_p2 prt2_1
                            Filter: (b > 250)
+               ->  Sort
+                     Sort Key: prt2_2.b
                      ->  Seq Scan on prt2_p3 prt2_2
                            Filter: (b > 250)
-(19 rows)
+         ->  Materialize
+               ->  Append
+                     ->  Sort
+                           Sort Key: prt1_1.a
+                           ->  Seq Scan on prt1_p1 prt1_1
+                                 Filter: ((a < 450) AND (b = 0))
+                     ->  Sort
+                           Sort Key: prt1_2.a
+                           ->  Seq Scan on prt1_p2 prt1_2
+                                 Filter: ((a < 450) AND (b = 0))
+(23 rows)
 
 SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
   a  |  b  
@@ -1424,25 +1431,33 @@ SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT *
 -- partitionwise join does not apply
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t2.b FROM prt1 t1, prt2 t2 WHERE t1::text = t2::text AND t1.a = t2.b ORDER BY t1.a;
-                                       QUERY PLAN                                        
------------------------------------------------------------------------------------------
+                            QUERY PLAN                            
+------------------------------------------------------------------
  Merge Join
-   Merge Cond: ((t1.a = t2.b) AND (((((t1.*)::prt1))::text) = ((((t2.*)::prt2))::text)))
-   ->  Sort
-         Sort Key: t1.a, ((((t1.*)::prt1))::text)
-         ->  Result
-               ->  Append
-                     ->  Seq Scan on prt1_p1 t1_1
-                     ->  Seq Scan on prt1_p2 t1_2
-                     ->  Seq Scan on prt1_p3 t1_3
-   ->  Sort
-         Sort Key: t2.b, ((((t2.*)::prt2))::text)
-         ->  Result
-               ->  Append
+   Merge Cond: (t1.a = t2.b)
+   Join Filter: ((((t2.*)::prt2))::text = (((t1.*)::prt1))::text)
+   ->  Append
+         ->  Sort
+               Sort Key: t1_1.a
+               ->  Seq Scan on prt1_p1 t1_1
+         ->  Sort
+               Sort Key: t1_2.a
+               ->  Seq Scan on prt1_p2 t1_2
+         ->  Sort
+               Sort Key: t1_3.a
+               ->  Seq Scan on prt1_p3 t1_3
+   ->  Materialize
+         ->  Append
+               ->  Sort
+                     Sort Key: t2_1.b
                      ->  Seq Scan on prt2_p1 t2_1
+               ->  Sort
+                     Sort Key: t2_2.b
                      ->  Seq Scan on prt2_p2 t2_2
+               ->  Sort
+                     Sort Key: t2_3.b
                      ->  Seq Scan on prt2_p3 t2_3
-(16 rows)
+(24 rows)
 
 SELECT t1.a, t2.b FROM prt1 t1, prt2 t2 WHERE t1::text = t2::text AND t1.a = t2.b ORDER BY t1.a;
  a  | b  
@@ -4508,9 +4523,10 @@ EXPLAIN (COSTS OFF)
 SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2 ON (t1.a = t2.a AND t1.c = t2.c) LEFT JOIN plt1_adv t3 ON (t1.a = t3.a AND t1.c = t3.c) WHERE t1.b < 10 ORDER BY t1.a;
                                    QUERY PLAN                                   
 --------------------------------------------------------------------------------
- Sort
+ Merge Append
    Sort Key: t1.a
-   ->  Append
+   ->  Sort
+         Sort Key: t1_1.a
          ->  Hash Right Join
                Hash Cond: ((t3_1.a = t1_1.a) AND (t3_1.c = t1_1.c))
                ->  Seq Scan on plt1_adv_p1 t3_1
@@ -4521,6 +4537,8 @@ SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2
                            ->  Hash
                                  ->  Seq Scan on plt1_adv_p1 t1_1
                                        Filter: (b < 10)
+   ->  Sort
+         Sort Key: t1_2.a
          ->  Hash Right Join
                Hash Cond: ((t3_2.a = t1_2.a) AND (t3_2.c = t1_2.c))
                ->  Seq Scan on plt1_adv_p2 t3_2
@@ -4531,6 +4549,8 @@ SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2
                            ->  Hash
                                  ->  Seq Scan on plt1_adv_p2 t1_2
                                        Filter: (b < 10)
+   ->  Sort
+         Sort Key: t1_3.a
          ->  Hash Right Join
                Hash Cond: ((t3_3.a = t1_3.a) AND (t3_3.c = t1_3.c))
                ->  Seq Scan on plt1_adv_p3 t3_3
@@ -4541,15 +4561,19 @@ SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2
                            ->  Hash
                                  ->  Seq Scan on plt1_adv_p3 t1_3
                                        Filter: (b < 10)
-         ->  Nested Loop Left Join
-               Join Filter: ((t1_4.a = t3_4.a) AND (t1_4.c = t3_4.c))
-               ->  Nested Loop Left Join
-                     Join Filter: ((t1_4.a = t2_4.a) AND (t1_4.c = t2_4.c))
+   ->  Nested Loop Left Join
+         Join Filter: ((t1_4.a = t3_4.a) AND (t1_4.c = t3_4.c))
+         ->  Merge Left Join
+               Merge Cond: ((t1_4.a = t2_4.a) AND (t1_4.c = t2_4.c))
+               ->  Sort
+                     Sort Key: t1_4.a, t1_4.c
                      ->  Seq Scan on plt1_adv_extra t1_4
                            Filter: (b < 10)
+               ->  Sort
+                     Sort Key: t2_4.a, t2_4.c
                      ->  Seq Scan on plt2_adv_extra t2_4
-               ->  Seq Scan on plt1_adv_extra t3_4
-(41 rows)
+         ->  Seq Scan on plt1_adv_extra t3_4
+(50 rows)
 
 SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2 ON (t1.a = t2.a AND t1.c = t2.c) LEFT JOIN plt1_adv t3 ON (t1.a = t3.a AND t1.c = t3.c) WHERE t1.b < 10 ORDER BY t1.a;
  a  |  c   | a |  c   | a |  c   
@@ -5037,21 +5061,26 @@ EXPLAIN (COSTS OFF)
 SELECT t1.*, t2.* FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a AND t1.b = t2.b) WHERE t1.b >= 125 AND t1.b < 225 ORDER BY t1.a, t1.b;
                              QUERY PLAN                             
 --------------------------------------------------------------------
- Sort
+ Merge Append
    Sort Key: t1.a, t1.b
-   ->  Append
+   ->  Sort
+         Sort Key: t1_1.a, t1_1.b
          ->  Hash Join
                Hash Cond: ((t1_1.a = t2_1.a) AND (t1_1.b = t2_1.b))
                ->  Seq Scan on alpha_neg_p1 t1_1
                      Filter: ((b >= 125) AND (b < 225))
                ->  Hash
                      ->  Seq Scan on beta_neg_p1 t2_1
+   ->  Sort
+         Sort Key: t1_2.a, t1_2.b
          ->  Hash Join
                Hash Cond: ((t2_2.a = t1_2.a) AND (t2_2.b = t1_2.b))
                ->  Seq Scan on beta_neg_p2 t2_2
                ->  Hash
                      ->  Seq Scan on alpha_neg_p2 t1_2
                            Filter: ((b >= 125) AND (b < 225))
+   ->  Sort
+         Sort Key: t1_4.a, t1_4.b
          ->  Hash Join
                Hash Cond: ((t2_4.a = t1_4.a) AND (t2_4.b = t1_4.b))
                ->  Append
@@ -5066,7 +5095,7 @@ SELECT t1.*, t2.* FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a AND t1.b = t2
                                  Filter: ((b >= 125) AND (b < 225))
                            ->  Seq Scan on alpha_pos_p3 t1_6
                                  Filter: ((b >= 125) AND (b < 225))
-(29 rows)
+(34 rows)
 
 SELECT t1.*, t2.* FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a AND t1.b = t2.b) WHERE t1.b >= 125 AND t1.b < 225 ORDER BY t1.a, t1.b;
  a  |  b  |  c   | a  |  b  |  c   
diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index 0bf35260b4..b25aa73e94 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -4763,9 +4763,10 @@ select min(a) over (partition by a order by a) from part_abc where a >= stable_o
                Window: w1 AS (PARTITION BY part_abc.a ORDER BY part_abc.a)
                ->  Append
                      Subplans Removed: 1
-                     ->  Index Scan using part_abc_2_a_idx on part_abc_2 part_abc_1
-                           Index Cond: (a >= (stable_one() + 1))
-                           Filter: (d <= stable_one())
+                     ->  Sort
+                           Sort Key: part_abc_1.a
+                           ->  Seq Scan on part_abc_2 part_abc_1
+                                 Filter: ((d <= stable_one()) AND (a >= (stable_one() + 1)))
                      ->  Merge Append
                            Sort Key: part_abc_3.a
                            Subplans Removed: 1
@@ -4780,9 +4781,10 @@ select min(a) over (partition by a order by a) from part_abc where a >= stable_o
                Window: w1 AS (PARTITION BY part_abc_5.a ORDER BY part_abc_5.a)
                ->  Append
                      Subplans Removed: 1
-                     ->  Index Scan using part_abc_2_a_idx on part_abc_2 part_abc_6
-                           Index Cond: (a >= (stable_one() + 1))
-                           Filter: (d >= stable_one())
+                     ->  Sort
+                           Sort Key: part_abc_6.a
+                           ->  Seq Scan on part_abc_2 part_abc_6
+                                 Filter: ((d >= stable_one()) AND (a >= (stable_one() + 1)))
                      ->  Merge Append
                            Sort Key: a
                            Subplans Removed: 1
@@ -4792,7 +4794,7 @@ select min(a) over (partition by a order by a) from part_abc where a >= stable_o
                            ->  Index Scan using part_abc_3_3_a_idx on part_abc_3_3 part_abc_9
                                  Index Cond: (a >= (stable_one() + 1))
                                  Filter: (d >= stable_one())
-(35 rows)
+(37 rows)
 
 drop view part_abc_view;
 drop table part_abc;
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index 96962817ed..0ccadea910 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -1207,12 +1207,14 @@ select event_id
 ----------------------------------------------------------
  Merge Append
    Sort Key: events.event_id
-   ->  Index Scan using events_pkey on events
+   ->  Sort
+         Sort Key: events.event_id
+         ->  Seq Scan on events
    ->  Sort
          Sort Key: events_1.event_id
          ->  Seq Scan on events_child events_1
    ->  Index Scan using other_events_pkey on other_events
-(7 rows)
+(9 rows)
 
 drop table events_child, events, other_events;
 reset enable_indexonlyscan;
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 699e8ac09c..c58beebbd1 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -641,12 +641,14 @@ insert into matest2 (name) values ('Test 4');
 insert into matest3 (name) values ('Test 5');
 insert into matest3 (name) values ('Test 6');
 
-set enable_indexscan = off;  -- force use of seqscan/sort, so no merge
+set enable_indexscan = off;  -- force use of seqscan/sort
+set enable_sort = off; -- since merge append may employ sort in children we need to disable sort
 explain (verbose, costs off) select * from matest0 order by 1-id;
 select * from matest0 order by 1-id;
 explain (verbose, costs off) select min(1-id) from matest0;
 select min(1-id) from matest0;
 reset enable_indexscan;
+reset enable_sort;
 
 set enable_seqscan = off;  -- plan with fewest seqscans should be merge
 set enable_parallel_append = off; -- Don't let parallel-append interfere
-- 
2.39.5

#13

Alexander Pyhalov

a.pyhalov@postgrespro.ru

8 months ago

In reply to: Andrei Lepikhov (#11)

1 attachment(s)

Re: MergeAppend could consider sorting cheapest child path

Andrei Lepikhov писал(а) 2025-05-05 14:38:

On 4/29/25 19:25, Alexander Pyhalov wrote:

Andrei Lepikhov писал(а) 2025-04-29 16:52:
But it seems, base_path can't be NULL

Correct. Fixed.

Also we check base_path for required_outer and require_parallel_safe,
but if cheapest path for pathkeys is NULL, these checks are not
performed.

Thanks. Fixed.

Luckily, they seen to be no-op anyway due to cheapest_total- >

param_info == NULL and function arguments being NULL (required_outer)

and false (require_parallel_safe). Should we do something about this?
Don't know, perhaps, remove these misleading arguments?

The main reason why I introduced these _ext routines was that this
schema may be used somewhere else. And in that case parameterised paths
may exist. Who knows, may be parameterised paths will be introduced for
merge append too?

become no-op? And we do return non-null path from
get_cheapest_fractional_path_for_pathkeys_ext(), as it seems we return
either cheapest_total_path or cheapest fractional path from
get_cheapest_fractional_path_for_pathkeys_ext().

Ok.

Let's check next version of the patch in the attachment.

Hi.

Both functions are a bit different:

Path *base_path = rel->cheapest_total_path;
Path *path;

path = get_cheapest_path_for_pathkeys(rel->pathlist, pathkeys,

required_outer, cost_criterion,

require_parallel_safe);

/* Stop here if the path doesn't satisfy necessary conditions */
if ((require_parallel_safe && !base_path->parallel_safe) ||
!bms_is_subset(PATH_REQ_OUTER(base_path),
required_outer))
return path;

Here comment speaks about "the path", and check is performed on
base_path, could it be misleading?

In get_cheapest_fractional_path_for_pathkeys_ext():

path = get_cheapest_fractional_path_for_pathkeys(rel->pathlist,
pathkeys,

required_outer, fraction);

base_path = rel->cheapest_total_path;

/* Stop here if the path doesn't satisfy necessary conditions */
if (!bms_is_subset(PATH_REQ_OUTER(base_path), required_outer))
return path;

Here "the path" also refers to base_path, but here at least it's the
last path we've just looked at. Should we make base_path initialization
consistent and fix comment a bit, so that there's no possible ambiguity
and it's evident that it refers to the base_path?

Also logic a bit differs if path is NULL. In
get_cheapest_path_for_pathkeys_ext() we explicitly check for path being
NULL, in get_cheapest_fractional_path_for_pathkeys_ext() only after
calculating sort cost.

I've tried to fix comments a bit and unified functions definitions.
--
Best regards,
Alexander Pyhalov,
Postgres Professional

Attachments:

v4-0001-Consider-explicit-sort-of-the-MergeAppend-subpath.patchtext/x-diff; name=v4-0001-Consider-explicit-sort-of-the-MergeAppend-subpath.patchDownload

From 36e23eab381c383680c9a803418cb98bc3ae912f Mon Sep 17 00:00:00 2001
From: "Andrei V. Lepikhov" <lepihov@gmail.com>
Date: Mon, 5 May 2025 16:23:51 +0300
Subject: [PATCH] Consider explicit sort of the MergeAppend subpaths.

Expand the optimiser search scope a little: fetching optimal subpath matching
pathkeys of the planning MergeAppend, consider the extra case of
overall-optimal path plus explicit Sort node at the top.

It may provide a more effective plan in both full and fractional scan cases:
1. The Sort node may be pushed down to subpaths under a parallel or
async Append.
2. The case when a minor set of subpaths doesn't have a proper index, and it
is profitable to sort them instead of switching to plain Append.

In general, analysing the regression tests changed by this code, it seems that
the cost model prefers a series of small sortings instead of a single massive
one. This feature increases a little the number of such paths.

Overhead:
It seems multiple subpaths may be encountered, as well as many pathkeys.
So, to be as careful as possible here, only cost estimation is performed.
---
 .../postgres_fdw/expected/postgres_fdw.out    |   6 +-
 src/backend/optimizer/path/allpaths.c         |  47 ++----
 src/backend/optimizer/path/pathkeys.c         | 101 +++++++++++++
 src/include/optimizer/paths.h                 |  10 ++
 .../regress/expected/collate.icu.utf8.out     |   9 +-
 src/test/regress/expected/inherit.out         |  19 ++-
 src/test/regress/expected/partition_join.out  | 139 +++++++++++-------
 src/test/regress/expected/partition_prune.out |  16 +-
 src/test/regress/expected/union.out           |   6 +-
 src/test/regress/sql/inherit.sql              |   4 +-
 10 files changed, 250 insertions(+), 107 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 24ff5f70cce..8e9c0c4e3ba 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -10288,13 +10288,15 @@ SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a
    ->  Nested Loop
          Join Filter: (t1.a = t2.b)
          ->  Append
-               ->  Foreign Scan on ftprt1_p1 t1_1
+               ->  Sort
+                     Sort Key: t1_1.a
+                     ->  Foreign Scan on ftprt1_p1 t1_1
                ->  Foreign Scan on ftprt1_p2 t1_2
          ->  Materialize
                ->  Append
                      ->  Foreign Scan on ftprt2_p1 t2_1
                      ->  Foreign Scan on ftprt2_p2 t2_2
-(10 rows)
+(12 rows)
 
 SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
   a  |  b  
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 905250b3325..a825225668c 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1855,29 +1855,18 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 
 			/* Locate the right paths, if they are available. */
 			cheapest_startup =
-				get_cheapest_path_for_pathkeys(childrel->pathlist,
-											   pathkeys,
-											   NULL,
-											   STARTUP_COST,
-											   false);
+				get_cheapest_path_for_pathkeys_ext(root, childrel, pathkeys,
+												   NULL, STARTUP_COST, false);
 			cheapest_total =
-				get_cheapest_path_for_pathkeys(childrel->pathlist,
-											   pathkeys,
-											   NULL,
-											   TOTAL_COST,
-											   false);
+				get_cheapest_path_for_pathkeys_ext(root, childrel, pathkeys,
+												   NULL, TOTAL_COST, false);
 
 			/*
-			 * If we can't find any paths with the right order just use the
-			 * cheapest-total path; we'll have to sort it later.
+			 * In accordance to current planning logic there are no
+			 * parameterised paths under a merge append.
 			 */
-			if (cheapest_startup == NULL || cheapest_total == NULL)
-			{
-				cheapest_startup = cheapest_total =
-					childrel->cheapest_total_path;
-				/* Assert we do have an unparameterized path for this child */
-				Assert(cheapest_total->param_info == NULL);
-			}
+			Assert(cheapest_startup != NULL && cheapest_total != NULL);
+			Assert(cheapest_total->param_info == NULL);
 
 			/*
 			 * When building a fractional path, determine a cheapest
@@ -1894,21 +1883,17 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 				double		path_fraction = (1.0 / root->tuple_fraction);
 
 				cheapest_fractional =
-					get_cheapest_fractional_path_for_pathkeys(childrel->pathlist,
-															  pathkeys,
-															  NULL,
-															  path_fraction);
+					get_cheapest_fractional_path_for_pathkeys_ext(root,
+																  childrel,
+																  pathkeys,
+																  NULL,
+																  path_fraction);
 
 				/*
-				 * If we found no path with matching pathkeys, use the
-				 * cheapest total path instead.
-				 *
-				 * XXX We might consider partially sorted paths too (with an
-				 * incremental sort on top). But we'd have to build all the
-				 * incremental paths, do the costing etc.
+				 * In accordance to current planning logic there are no
+				 * parameterised fractional paths under a merge append.
 				 */
-				if (!cheapest_fractional)
-					cheapest_fractional = cheapest_total;
+				Assert(cheapest_fractional != NULL);
 			}
 
 			/*
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 8b04d40d36d..b67af13a71a 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -19,11 +19,13 @@
 
 #include "access/stratnum.h"
 #include "catalog/pg_opfamily.h"
+#include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/cost.h"
 #include "optimizer/optimizer.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
+#include "optimizer/planner.h"
 #include "partitioning/partbounds.h"
 #include "rewrite/rewriteManip.h"
 #include "utils/lsyscache.h"
@@ -648,6 +650,57 @@ get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 	return matched_path;
 }
 
+/*
+ * get_cheapest_path_for_pathkeys_ext
+ *	  Calls get_cheapest_path_for_pathkeys to obtain cheapest path that
+ *	  satisfies defined criterias and сonsiders one more option: choose
+ *	  overall-optimal path (according the criterion) and explicitly sort its
+ *	  output to satisfy the pathkeys.
+ *
+ *	  Caller is responsible to insert corresponding sort path at the top of
+ *	  returned path if it will be chosen to be used.
+ *
+ *	  Return NULL if no such path.
+ */
+Path *
+get_cheapest_path_for_pathkeys_ext(PlannerInfo *root, RelOptInfo *rel,
+								   List *pathkeys, Relids required_outer,
+								   CostSelector cost_criterion,
+								   bool require_parallel_safe)
+{
+	Path	sort_path;
+	Path   *base_path = rel->cheapest_total_path;
+	Path   *path;
+
+	path = get_cheapest_path_for_pathkeys(rel->pathlist, pathkeys,
+										  required_outer, cost_criterion,
+										  require_parallel_safe);
+
+	/* Stop here if the cheapest total path doesn't satisfy necessary conditions */
+	if ((require_parallel_safe && !base_path->parallel_safe) ||
+		!bms_is_subset(PATH_REQ_OUTER(base_path), required_outer))
+		return path;
+
+	if (path == NULL)
+		/*
+		 * Current pathlist doesn't fit the pathkeys. No need to check extra
+		 * sort path ways.
+		 */
+		return base_path;
+
+	/* Consider the cheapest total path with extra sort */
+	if (base_path && path != base_path)
+	{
+		cost_sort(&sort_path, root, pathkeys, base_path->disabled_nodes,
+				  base_path->total_cost, base_path->rows,
+				  base_path->pathtarget->width, 0.0, work_mem, -1.0);
+
+		if (compare_path_costs(&sort_path, path, cost_criterion) < 0)
+			return base_path;
+	}
+	return path;
+}
+
 /*
  * get_cheapest_fractional_path_for_pathkeys
  *	  Find the cheapest path (for retrieving a specified fraction of all
@@ -690,6 +743,54 @@ get_cheapest_fractional_path_for_pathkeys(List *paths,
 	return matched_path;
 }
 
+/*
+ * get_cheapest_fractional_path_for_pathkeys_ext
+ *	  obtain cheapest fractional path that satisfies defined criterias excluding
+ *	  pathkeys and explicitly sort its output to satisfy the pathkeys.
+ *
+ *	  Caller is responsible to insert corresponding sort path at the top of
+ *	  returned path if it will be chosen to be used.
+ *
+ *	  Return NULL if no such path.
+ */
+Path *
+get_cheapest_fractional_path_for_pathkeys_ext(PlannerInfo *root,
+											  RelOptInfo *rel,
+											  List *pathkeys,
+											  Relids required_outer,
+											  double fraction)
+{
+	Path			sort_path;
+	Path		   *base_path = rel->cheapest_total_path;
+	Path		   *path;
+
+	path = get_cheapest_fractional_path_for_pathkeys(rel->pathlist, pathkeys,
+													 required_outer, fraction);
+
+	/* Stop here if the cheapest total path doesn't satisfy necessary conditions */
+	if (!bms_is_subset(PATH_REQ_OUTER(base_path), required_outer))
+		return path;
+
+	if (path == NULL)
+		/*
+		 * Current pathlist doesn't fit the pathkeys. No need to check extra
+		 * sort path ways.
+		 */
+		return base_path;
+
+	/* Consider the cheapest total path with extra sort */
+	if (base_path && path != base_path)
+	{
+		cost_sort(&sort_path, root, pathkeys, base_path->disabled_nodes,
+				  base_path->total_cost, base_path->rows,
+				  base_path->pathtarget->width, 0.0, work_mem, -1.0);
+
+		if (compare_fractional_path_costs(&sort_path, path, fraction) <= 0)
+			return base_path;
+	}
+
+	return path;
+}
 
 /*
  * get_cheapest_parallel_safe_total_inner
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index a48c9721797..4bdc85afca9 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -224,10 +224,20 @@ extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
 											bool require_parallel_safe);
+extern Path *get_cheapest_path_for_pathkeys_ext(PlannerInfo *root,
+												RelOptInfo *rel, List *pathkeys,
+												Relids required_outer,
+												CostSelector cost_criterion,
+												bool require_parallel_safe);
 extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 													   List *pathkeys,
 													   Relids required_outer,
 													   double fraction);
+extern Path *get_cheapest_fractional_path_for_pathkeys_ext(PlannerInfo *root,
+														   RelOptInfo *rel,
+														   List *pathkeys,
+														   Relids required_outer,
+														   double fraction);
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
 								  ScanDirection scandir);
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 69805d4b9ec..48d47bb7455 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -2412,16 +2412,19 @@ EXPLAIN (COSTS OFF)
 SELECT c collate "C", count(c) FROM pagg_tab3 GROUP BY c collate "C" ORDER BY 1;
                        QUERY PLAN                       
 --------------------------------------------------------
- Sort
+ Merge Append
    Sort Key: ((pagg_tab3.c)::text) COLLATE "C"
-   ->  Append
+   ->  Sort
+         Sort Key: ((pagg_tab3.c)::text) COLLATE "C"
          ->  HashAggregate
                Group Key: (pagg_tab3.c)::text
                ->  Seq Scan on pagg_tab3_p2 pagg_tab3
+   ->  Sort
+         Sort Key: ((pagg_tab3_1.c)::text) COLLATE "C"
          ->  HashAggregate
                Group Key: (pagg_tab3_1.c)::text
                ->  Seq Scan on pagg_tab3_p1 pagg_tab3_1
-(9 rows)
+(12 rows)
 
 SELECT c collate "C", count(c) FROM pagg_tab3 GROUP BY c collate "C" ORDER BY 1;
  c | count 
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index f9b0c415cfd..71036dc938f 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -1665,11 +1665,13 @@ insert into matest2 (name) values ('Test 3');
 insert into matest2 (name) values ('Test 4');
 insert into matest3 (name) values ('Test 5');
 insert into matest3 (name) values ('Test 6');
-set enable_indexscan = off;  -- force use of seqscan/sort, so no merge
+set enable_indexscan = off;  -- force use of seqscan/sort
+set enable_sort = off; -- since merge append may employ sort in children we need to disable sort
 explain (verbose, costs off) select * from matest0 order by 1-id;
                          QUERY PLAN                         
 ------------------------------------------------------------
  Sort
+   Disabled: true
    Output: matest0.id, matest0.name, ((1 - matest0.id))
    Sort Key: ((1 - matest0.id))
    ->  Result
@@ -1683,7 +1685,7 @@ explain (verbose, costs off) select * from matest0 order by 1-id;
                      Output: matest0_3.id, matest0_3.name
                ->  Seq Scan on public.matest3 matest0_4
                      Output: matest0_4.id, matest0_4.name
-(14 rows)
+(15 rows)
 
 select * from matest0 order by 1-id;
  id |  name  
@@ -1719,6 +1721,7 @@ select min(1-id) from matest0;
 (1 row)
 
 reset enable_indexscan;
+reset enable_sort;
 set enable_seqscan = off;  -- plan with fewest seqscans should be merge
 set enable_parallel_append = off; -- Don't let parallel-append interfere
 explain (verbose, costs off) select * from matest0 order by 1-id;
@@ -1844,16 +1847,20 @@ order by t1.b limit 10;
          Merge Cond: (t1.b = t2.b)
          ->  Merge Append
                Sort Key: t1.b
-               ->  Index Scan using matest0i on matest0 t1_1
+               ->  Sort
+                     Sort Key: t1_1.b
+                     ->  Seq Scan on matest0 t1_1
                ->  Index Scan using matest1i on matest1 t1_2
          ->  Materialize
                ->  Merge Append
                      Sort Key: t2.b
-                     ->  Index Scan using matest0i on matest0 t2_1
-                           Filter: (c = d)
+                     ->  Sort
+                           Sort Key: t2_1.b
+                           ->  Seq Scan on matest0 t2_1
+                                 Filter: (c = d)
                      ->  Index Scan using matest1i on matest1 t2_2
                            Filter: (c = d)
-(14 rows)
+(18 rows)
 
 reset enable_nestloop;
 drop table matest0 cascade;
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index 6101c8c7cf1..d41979367c9 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -65,31 +65,34 @@ SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.b AND t1.b =
 -- inner join with partially-redundant join clauses
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
-                          QUERY PLAN                           
----------------------------------------------------------------
- Sort
+                       QUERY PLAN                        
+---------------------------------------------------------
+ Merge Append
    Sort Key: t1.a
-   ->  Append
-         ->  Merge Join
-               Merge Cond: (t1_1.a = t2_1.a)
-               ->  Index Scan using iprt1_p1_a on prt1_p1 t1_1
-               ->  Sort
-                     Sort Key: t2_1.b
-                     ->  Seq Scan on prt2_p1 t2_1
-                           Filter: (a = b)
+   ->  Merge Join
+         Merge Cond: (t1_1.a = t2_1.a)
+         ->  Index Scan using iprt1_p1_a on prt1_p1 t1_1
+         ->  Sort
+               Sort Key: t2_1.b
+               ->  Seq Scan on prt2_p1 t2_1
+                     Filter: (a = b)
+   ->  Sort
+         Sort Key: t1_2.a
          ->  Hash Join
                Hash Cond: (t1_2.a = t2_2.a)
                ->  Seq Scan on prt1_p2 t1_2
                ->  Hash
                      ->  Seq Scan on prt2_p2 t2_2
                            Filter: (a = b)
+   ->  Sort
+         Sort Key: t1_3.a
          ->  Hash Join
                Hash Cond: (t1_3.a = t2_3.a)
                ->  Seq Scan on prt1_p3 t1_3
                ->  Hash
                      ->  Seq Scan on prt2_p3 t2_3
                            Filter: (a = b)
-(22 rows)
+(25 rows)
 
 SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
  a  |  c   | b  |  c   
@@ -1383,28 +1386,32 @@ SELECT t1.a, t1.c, t2.b, t2.c, t3.a + t3.b, t3.c FROM (prt1 t1 LEFT JOIN prt2 t2
 -- This should generate a partitionwise join, but currently fails to
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
-                        QUERY PLAN                         
------------------------------------------------------------
- Incremental Sort
+                           QUERY PLAN                            
+-----------------------------------------------------------------
+ Sort
    Sort Key: prt1.a, prt2.b
-   Presorted Key: prt1.a
-   ->  Merge Left Join
-         Merge Cond: (prt1.a = prt2.b)
-         ->  Sort
-               Sort Key: prt1.a
-               ->  Append
-                     ->  Seq Scan on prt1_p1 prt1_1
-                           Filter: ((a < 450) AND (b = 0))
-                     ->  Seq Scan on prt1_p2 prt1_2
-                           Filter: ((a < 450) AND (b = 0))
-         ->  Sort
-               Sort Key: prt2.b
-               ->  Append
+   ->  Merge Right Join
+         Merge Cond: (prt2.b = prt1.a)
+         ->  Append
+               ->  Sort
+                     Sort Key: prt2_1.b
                      ->  Seq Scan on prt2_p2 prt2_1
                            Filter: (b > 250)
+               ->  Sort
+                     Sort Key: prt2_2.b
                      ->  Seq Scan on prt2_p3 prt2_2
                            Filter: (b > 250)
-(19 rows)
+         ->  Materialize
+               ->  Append
+                     ->  Sort
+                           Sort Key: prt1_1.a
+                           ->  Seq Scan on prt1_p1 prt1_1
+                                 Filter: ((a < 450) AND (b = 0))
+                     ->  Sort
+                           Sort Key: prt1_2.a
+                           ->  Seq Scan on prt1_p2 prt1_2
+                                 Filter: ((a < 450) AND (b = 0))
+(23 rows)
 
 SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
   a  |  b  
@@ -1424,25 +1431,33 @@ SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT *
 -- partitionwise join does not apply
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t2.b FROM prt1 t1, prt2 t2 WHERE t1::text = t2::text AND t1.a = t2.b ORDER BY t1.a;
-                                       QUERY PLAN                                        
------------------------------------------------------------------------------------------
+                            QUERY PLAN                            
+------------------------------------------------------------------
  Merge Join
-   Merge Cond: ((t1.a = t2.b) AND (((((t1.*)::prt1))::text) = ((((t2.*)::prt2))::text)))
-   ->  Sort
-         Sort Key: t1.a, ((((t1.*)::prt1))::text)
-         ->  Result
-               ->  Append
-                     ->  Seq Scan on prt1_p1 t1_1
-                     ->  Seq Scan on prt1_p2 t1_2
-                     ->  Seq Scan on prt1_p3 t1_3
-   ->  Sort
-         Sort Key: t2.b, ((((t2.*)::prt2))::text)
-         ->  Result
-               ->  Append
+   Merge Cond: (t1.a = t2.b)
+   Join Filter: ((((t2.*)::prt2))::text = (((t1.*)::prt1))::text)
+   ->  Append
+         ->  Sort
+               Sort Key: t1_1.a
+               ->  Seq Scan on prt1_p1 t1_1
+         ->  Sort
+               Sort Key: t1_2.a
+               ->  Seq Scan on prt1_p2 t1_2
+         ->  Sort
+               Sort Key: t1_3.a
+               ->  Seq Scan on prt1_p3 t1_3
+   ->  Materialize
+         ->  Append
+               ->  Sort
+                     Sort Key: t2_1.b
                      ->  Seq Scan on prt2_p1 t2_1
+               ->  Sort
+                     Sort Key: t2_2.b
                      ->  Seq Scan on prt2_p2 t2_2
+               ->  Sort
+                     Sort Key: t2_3.b
                      ->  Seq Scan on prt2_p3 t2_3
-(16 rows)
+(24 rows)
 
 SELECT t1.a, t2.b FROM prt1 t1, prt2 t2 WHERE t1::text = t2::text AND t1.a = t2.b ORDER BY t1.a;
  a  | b  
@@ -4508,9 +4523,10 @@ EXPLAIN (COSTS OFF)
 SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2 ON (t1.a = t2.a AND t1.c = t2.c) LEFT JOIN plt1_adv t3 ON (t1.a = t3.a AND t1.c = t3.c) WHERE t1.b < 10 ORDER BY t1.a;
                                    QUERY PLAN                                   
 --------------------------------------------------------------------------------
- Sort
+ Merge Append
    Sort Key: t1.a
-   ->  Append
+   ->  Sort
+         Sort Key: t1_1.a
          ->  Hash Right Join
                Hash Cond: ((t3_1.a = t1_1.a) AND (t3_1.c = t1_1.c))
                ->  Seq Scan on plt1_adv_p1 t3_1
@@ -4521,6 +4537,8 @@ SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2
                            ->  Hash
                                  ->  Seq Scan on plt1_adv_p1 t1_1
                                        Filter: (b < 10)
+   ->  Sort
+         Sort Key: t1_2.a
          ->  Hash Right Join
                Hash Cond: ((t3_2.a = t1_2.a) AND (t3_2.c = t1_2.c))
                ->  Seq Scan on plt1_adv_p2 t3_2
@@ -4531,6 +4549,8 @@ SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2
                            ->  Hash
                                  ->  Seq Scan on plt1_adv_p2 t1_2
                                        Filter: (b < 10)
+   ->  Sort
+         Sort Key: t1_3.a
          ->  Hash Right Join
                Hash Cond: ((t3_3.a = t1_3.a) AND (t3_3.c = t1_3.c))
                ->  Seq Scan on plt1_adv_p3 t3_3
@@ -4541,15 +4561,19 @@ SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2
                            ->  Hash
                                  ->  Seq Scan on plt1_adv_p3 t1_3
                                        Filter: (b < 10)
-         ->  Nested Loop Left Join
-               Join Filter: ((t1_4.a = t3_4.a) AND (t1_4.c = t3_4.c))
-               ->  Nested Loop Left Join
-                     Join Filter: ((t1_4.a = t2_4.a) AND (t1_4.c = t2_4.c))
+   ->  Nested Loop Left Join
+         Join Filter: ((t1_4.a = t3_4.a) AND (t1_4.c = t3_4.c))
+         ->  Merge Left Join
+               Merge Cond: ((t1_4.a = t2_4.a) AND (t1_4.c = t2_4.c))
+               ->  Sort
+                     Sort Key: t1_4.a, t1_4.c
                      ->  Seq Scan on plt1_adv_extra t1_4
                            Filter: (b < 10)
+               ->  Sort
+                     Sort Key: t2_4.a, t2_4.c
                      ->  Seq Scan on plt2_adv_extra t2_4
-               ->  Seq Scan on plt1_adv_extra t3_4
-(41 rows)
+         ->  Seq Scan on plt1_adv_extra t3_4
+(50 rows)
 
 SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2 ON (t1.a = t2.a AND t1.c = t2.c) LEFT JOIN plt1_adv t3 ON (t1.a = t3.a AND t1.c = t3.c) WHERE t1.b < 10 ORDER BY t1.a;
  a  |  c   | a |  c   | a |  c   
@@ -5037,21 +5061,26 @@ EXPLAIN (COSTS OFF)
 SELECT t1.*, t2.* FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a AND t1.b = t2.b) WHERE t1.b >= 125 AND t1.b < 225 ORDER BY t1.a, t1.b;
                              QUERY PLAN                             
 --------------------------------------------------------------------
- Sort
+ Merge Append
    Sort Key: t1.a, t1.b
-   ->  Append
+   ->  Sort
+         Sort Key: t1_1.a, t1_1.b
          ->  Hash Join
                Hash Cond: ((t1_1.a = t2_1.a) AND (t1_1.b = t2_1.b))
                ->  Seq Scan on alpha_neg_p1 t1_1
                      Filter: ((b >= 125) AND (b < 225))
                ->  Hash
                      ->  Seq Scan on beta_neg_p1 t2_1
+   ->  Sort
+         Sort Key: t1_2.a, t1_2.b
          ->  Hash Join
                Hash Cond: ((t2_2.a = t1_2.a) AND (t2_2.b = t1_2.b))
                ->  Seq Scan on beta_neg_p2 t2_2
                ->  Hash
                      ->  Seq Scan on alpha_neg_p2 t1_2
                            Filter: ((b >= 125) AND (b < 225))
+   ->  Sort
+         Sort Key: t1_4.a, t1_4.b
          ->  Hash Join
                Hash Cond: ((t2_4.a = t1_4.a) AND (t2_4.b = t1_4.b))
                ->  Append
@@ -5066,7 +5095,7 @@ SELECT t1.*, t2.* FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a AND t1.b = t2
                                  Filter: ((b >= 125) AND (b < 225))
                            ->  Seq Scan on alpha_pos_p3 t1_6
                                  Filter: ((b >= 125) AND (b < 225))
-(29 rows)
+(34 rows)
 
 SELECT t1.*, t2.* FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a AND t1.b = t2.b) WHERE t1.b >= 125 AND t1.b < 225 ORDER BY t1.a, t1.b;
  a  |  b  |  c   | a  |  b  |  c   
diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index 0bf35260b46..b25aa73e946 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -4763,9 +4763,10 @@ select min(a) over (partition by a order by a) from part_abc where a >= stable_o
                Window: w1 AS (PARTITION BY part_abc.a ORDER BY part_abc.a)
                ->  Append
                      Subplans Removed: 1
-                     ->  Index Scan using part_abc_2_a_idx on part_abc_2 part_abc_1
-                           Index Cond: (a >= (stable_one() + 1))
-                           Filter: (d <= stable_one())
+                     ->  Sort
+                           Sort Key: part_abc_1.a
+                           ->  Seq Scan on part_abc_2 part_abc_1
+                                 Filter: ((d <= stable_one()) AND (a >= (stable_one() + 1)))
                      ->  Merge Append
                            Sort Key: part_abc_3.a
                            Subplans Removed: 1
@@ -4780,9 +4781,10 @@ select min(a) over (partition by a order by a) from part_abc where a >= stable_o
                Window: w1 AS (PARTITION BY part_abc_5.a ORDER BY part_abc_5.a)
                ->  Append
                      Subplans Removed: 1
-                     ->  Index Scan using part_abc_2_a_idx on part_abc_2 part_abc_6
-                           Index Cond: (a >= (stable_one() + 1))
-                           Filter: (d >= stable_one())
+                     ->  Sort
+                           Sort Key: part_abc_6.a
+                           ->  Seq Scan on part_abc_2 part_abc_6
+                                 Filter: ((d >= stable_one()) AND (a >= (stable_one() + 1)))
                      ->  Merge Append
                            Sort Key: a
                            Subplans Removed: 1
@@ -4792,7 +4794,7 @@ select min(a) over (partition by a order by a) from part_abc where a >= stable_o
                            ->  Index Scan using part_abc_3_3_a_idx on part_abc_3_3 part_abc_9
                                  Index Cond: (a >= (stable_one() + 1))
                                  Filter: (d >= stable_one())
-(35 rows)
+(37 rows)
 
 drop view part_abc_view;
 drop table part_abc;
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index 96962817ed4..0ccadea910c 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -1207,12 +1207,14 @@ select event_id
 ----------------------------------------------------------
  Merge Append
    Sort Key: events.event_id
-   ->  Index Scan using events_pkey on events
+   ->  Sort
+         Sort Key: events.event_id
+         ->  Seq Scan on events
    ->  Sort
          Sort Key: events_1.event_id
          ->  Seq Scan on events_child events_1
    ->  Index Scan using other_events_pkey on other_events
-(7 rows)
+(9 rows)
 
 drop table events_child, events, other_events;
 reset enable_indexonlyscan;
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 699e8ac09c8..c58beebbd1e 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -641,12 +641,14 @@ insert into matest2 (name) values ('Test 4');
 insert into matest3 (name) values ('Test 5');
 insert into matest3 (name) values ('Test 6');
 
-set enable_indexscan = off;  -- force use of seqscan/sort, so no merge
+set enable_indexscan = off;  -- force use of seqscan/sort
+set enable_sort = off; -- since merge append may employ sort in children we need to disable sort
 explain (verbose, costs off) select * from matest0 order by 1-id;
 select * from matest0 order by 1-id;
 explain (verbose, costs off) select min(1-id) from matest0;
 select min(1-id) from matest0;
 reset enable_indexscan;
+reset enable_sort;
 
 set enable_seqscan = off;  -- plan with fewest seqscans should be merge
 set enable_parallel_append = off; -- Don't let parallel-append interfere
-- 
2.43.0

#14

Andrei Lepikhov

lepihov@gmail.com

8 months ago

In reply to: Alexander Pyhalov (#13)

Re: MergeAppend could consider sorting cheapest child path

On 5/5/2025 15:56, Alexander Pyhalov wrote:

Andrei Lepikhov писал(а) 2025-05-05 14:38:
Also logic a bit differs if path is NULL. In
get_cheapest_path_for_pathkeys_ext() we explicitly check for path being
NULL, in get_cheapest_fractional_path_for_pathkeys_ext() only after
calculating sort cost.

I've tried to fix comments a bit and unified functions definitions.

Generally seems ok, I'm not a native speaker to judge the comments. But:
if (base_path && path != base_path)

What is the case in your mind where the base_path pointer still may be
null at that point?

--
regards, Andrei Lepikhov

#15

Alexander Pyhalov

a.pyhalov@postgrespro.ru

8 months ago

In reply to: Andrei Lepikhov (#14)

1 attachment(s)

Re: MergeAppend could consider sorting cheapest child path

Andrei Lepikhov писал(а) 2025-05-07 08:02:

On 5/5/2025 15:56, Alexander Pyhalov wrote:

Andrei Lepikhov писал(а) 2025-05-05 14:38:
Also logic a bit differs if path is NULL. In
get_cheapest_path_for_pathkeys_ext() we explicitly check for path
being NULL, in get_cheapest_fractional_path_for_pathkeys_ext() only
after calculating sort cost.

I've tried to fix comments a bit and unified functions definitions.

Generally seems ok, I'm not a native speaker to judge the comments.
But:
if (base_path && path != base_path)

What is the case in your mind where the base_path pointer still may be
null at that point?

Hi.

It seems if some childrel doesn't have valid pathlist, subpaths_valid
would be false in add_paths_to_append_rel()
and generate_orderedappend_paths() will not be called. So when we
iterate over live_childrels,
all of them will have cheapest_total path. This is why we can assert
that base_path is not NULL.
--
Best regards,
Alexander Pyhalov,
Postgres Professional

Attachments:

v5-0001-Consider-explicit-sort-of-the-MergeAppend-subpath.patchtext/x-diff; name=v5-0001-Consider-explicit-sort-of-the-MergeAppend-subpath.patchDownload

From f731a93c1e182428a4f2d3277e4f7f14df3b5dcb Mon Sep 17 00:00:00 2001
From: "Andrei V. Lepikhov" <lepihov@gmail.com>
Date: Mon, 5 May 2025 16:23:51 +0300
Subject: [PATCH] Consider explicit sort of the MergeAppend subpaths.

Expand the optimiser search scope a little: fetching optimal subpath matching
pathkeys of the planning MergeAppend, consider the extra case of
overall-optimal path plus explicit Sort node at the top.

It may provide a more effective plan in both full and fractional scan cases:
1. The Sort node may be pushed down to subpaths under a parallel or
async Append.
2. The case when a minor set of subpaths doesn't have a proper index, and it
is profitable to sort them instead of switching to plain Append.

In general, analysing the regression tests changed by this code, it seems that
the cost model prefers a series of small sortings instead of a single massive
one. This feature increases a little the number of such paths.

Overhead:
It seems multiple subpaths may be encountered, as well as many pathkeys.
So, to be as careful as possible here, only cost estimation is performed.
---
 .../postgres_fdw/expected/postgres_fdw.out    |   6 +-
 src/backend/optimizer/path/allpaths.c         |  47 ++----
 src/backend/optimizer/path/pathkeys.c         | 115 +++++++++++++++
 src/include/optimizer/paths.h                 |  10 ++
 .../regress/expected/collate.icu.utf8.out     |   9 +-
 src/test/regress/expected/inherit.out         |  19 ++-
 src/test/regress/expected/partition_join.out  | 139 +++++++++++-------
 src/test/regress/expected/partition_prune.out |  16 +-
 src/test/regress/expected/union.out           |   6 +-
 src/test/regress/sql/inherit.sql              |   4 +-
 10 files changed, 264 insertions(+), 107 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 24ff5f70cce..8e9c0c4e3ba 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -10288,13 +10288,15 @@ SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a
    ->  Nested Loop
          Join Filter: (t1.a = t2.b)
          ->  Append
-               ->  Foreign Scan on ftprt1_p1 t1_1
+               ->  Sort
+                     Sort Key: t1_1.a
+                     ->  Foreign Scan on ftprt1_p1 t1_1
                ->  Foreign Scan on ftprt1_p2 t1_2
          ->  Materialize
                ->  Append
                      ->  Foreign Scan on ftprt2_p1 t2_1
                      ->  Foreign Scan on ftprt2_p2 t2_2
-(10 rows)
+(12 rows)
 
 SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
   a  |  b  
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 905250b3325..a825225668c 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1855,29 +1855,18 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 
 			/* Locate the right paths, if they are available. */
 			cheapest_startup =
-				get_cheapest_path_for_pathkeys(childrel->pathlist,
-											   pathkeys,
-											   NULL,
-											   STARTUP_COST,
-											   false);
+				get_cheapest_path_for_pathkeys_ext(root, childrel, pathkeys,
+												   NULL, STARTUP_COST, false);
 			cheapest_total =
-				get_cheapest_path_for_pathkeys(childrel->pathlist,
-											   pathkeys,
-											   NULL,
-											   TOTAL_COST,
-											   false);
+				get_cheapest_path_for_pathkeys_ext(root, childrel, pathkeys,
+												   NULL, TOTAL_COST, false);
 
 			/*
-			 * If we can't find any paths with the right order just use the
-			 * cheapest-total path; we'll have to sort it later.
+			 * In accordance to current planning logic there are no
+			 * parameterised paths under a merge append.
 			 */
-			if (cheapest_startup == NULL || cheapest_total == NULL)
-			{
-				cheapest_startup = cheapest_total =
-					childrel->cheapest_total_path;
-				/* Assert we do have an unparameterized path for this child */
-				Assert(cheapest_total->param_info == NULL);
-			}
+			Assert(cheapest_startup != NULL && cheapest_total != NULL);
+			Assert(cheapest_total->param_info == NULL);
 
 			/*
 			 * When building a fractional path, determine a cheapest
@@ -1894,21 +1883,17 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 				double		path_fraction = (1.0 / root->tuple_fraction);
 
 				cheapest_fractional =
-					get_cheapest_fractional_path_for_pathkeys(childrel->pathlist,
-															  pathkeys,
-															  NULL,
-															  path_fraction);
+					get_cheapest_fractional_path_for_pathkeys_ext(root,
+																  childrel,
+																  pathkeys,
+																  NULL,
+																  path_fraction);
 
 				/*
-				 * If we found no path with matching pathkeys, use the
-				 * cheapest total path instead.
-				 *
-				 * XXX We might consider partially sorted paths too (with an
-				 * incremental sort on top). But we'd have to build all the
-				 * incremental paths, do the costing etc.
+				 * In accordance to current planning logic there are no
+				 * parameterised fractional paths under a merge append.
 				 */
-				if (!cheapest_fractional)
-					cheapest_fractional = cheapest_total;
+				Assert(cheapest_fractional != NULL);
 			}
 
 			/*
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 8b04d40d36d..3a13b3d02ee 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -19,11 +19,13 @@
 
 #include "access/stratnum.h"
 #include "catalog/pg_opfamily.h"
+#include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/cost.h"
 #include "optimizer/optimizer.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
+#include "optimizer/planner.h"
 #include "partitioning/partbounds.h"
 #include "rewrite/rewriteManip.h"
 #include "utils/lsyscache.h"
@@ -648,6 +650,64 @@ get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 	return matched_path;
 }
 
+/*
+ * get_cheapest_path_for_pathkeys_ext
+ *	  Calls get_cheapest_path_for_pathkeys to obtain cheapest path that
+ *	  satisfies defined criterias and сonsiders one more option: choose
+ *	  overall-optimal path (according the criterion) and explicitly sort its
+ *	  output to satisfy the pathkeys.
+ *
+ *	  Caller is responsible to insert corresponding sort path at the top of
+ *	  returned path if it will be chosen to be used.
+ *
+ *	  Return NULL if no such path.
+ */
+Path *
+get_cheapest_path_for_pathkeys_ext(PlannerInfo *root, RelOptInfo *rel,
+								   List *pathkeys, Relids required_outer,
+								   CostSelector cost_criterion,
+								   bool require_parallel_safe)
+{
+	Path		sort_path;
+	Path	   *base_path = rel->cheapest_total_path;
+	Path	   *path;
+
+	/* In generate_orderedappend_paths() all childrels do have some paths */
+	Assert(base_path);
+
+	path = get_cheapest_path_for_pathkeys(rel->pathlist, pathkeys,
+										  required_outer, cost_criterion,
+										  require_parallel_safe);
+
+	/*
+	 * Stop here if the cheapest total path doesn't satisfy necessary
+	 * conditions
+	 */
+	if ((require_parallel_safe && !base_path->parallel_safe) ||
+		!bms_is_subset(PATH_REQ_OUTER(base_path), required_outer))
+		return path;
+
+	if (path == NULL)
+
+		/*
+		 * Current pathlist doesn't fit the pathkeys. No need to check extra
+		 * sort path ways.
+		 */
+		return base_path;
+
+	/* Consider the cheapest total path with extra sort */
+	if (path != base_path)
+	{
+		cost_sort(&sort_path, root, pathkeys, base_path->disabled_nodes,
+				  base_path->total_cost, base_path->rows,
+				  base_path->pathtarget->width, 0.0, work_mem, -1.0);
+
+		if (compare_path_costs(&sort_path, path, cost_criterion) < 0)
+			return base_path;
+	}
+	return path;
+}
+
 /*
  * get_cheapest_fractional_path_for_pathkeys
  *	  Find the cheapest path (for retrieving a specified fraction of all
@@ -690,6 +750,61 @@ get_cheapest_fractional_path_for_pathkeys(List *paths,
 	return matched_path;
 }
 
+/*
+ * get_cheapest_fractional_path_for_pathkeys_ext
+ *	  obtain cheapest fractional path that satisfies defined criterias excluding
+ *	  pathkeys and explicitly sort its output to satisfy the pathkeys.
+ *
+ *	  Caller is responsible to insert corresponding sort path at the top of
+ *	  returned path if it will be chosen to be used.
+ *
+ *	  Return NULL if no such path.
+ */
+Path *
+get_cheapest_fractional_path_for_pathkeys_ext(PlannerInfo *root,
+											  RelOptInfo *rel,
+											  List *pathkeys,
+											  Relids required_outer,
+											  double fraction)
+{
+	Path		sort_path;
+	Path	   *base_path = rel->cheapest_total_path;
+	Path	   *path;
+
+	/* In generate_orderedappend_paths() all childrels do have some paths */
+	Assert(base_path);
+
+	path = get_cheapest_fractional_path_for_pathkeys(rel->pathlist, pathkeys,
+													 required_outer, fraction);
+
+	/*
+	 * Stop here if the cheapest total path doesn't satisfy necessary
+	 * conditions
+	 */
+	if (!bms_is_subset(PATH_REQ_OUTER(base_path), required_outer))
+		return path;
+
+	if (path == NULL)
+
+		/*
+		 * Current pathlist doesn't fit the pathkeys. No need to check extra
+		 * sort path ways.
+		 */
+		return base_path;
+
+	/* Consider the cheapest total path with extra sort */
+	if (path != base_path)
+	{
+		cost_sort(&sort_path, root, pathkeys, base_path->disabled_nodes,
+				  base_path->total_cost, base_path->rows,
+				  base_path->pathtarget->width, 0.0, work_mem, -1.0);
+
+		if (compare_fractional_path_costs(&sort_path, path, fraction) <= 0)
+			return base_path;
+	}
+
+	return path;
+}
 
 /*
  * get_cheapest_parallel_safe_total_inner
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index a48c9721797..4bdc85afca9 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -224,10 +224,20 @@ extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
 											bool require_parallel_safe);
+extern Path *get_cheapest_path_for_pathkeys_ext(PlannerInfo *root,
+												RelOptInfo *rel, List *pathkeys,
+												Relids required_outer,
+												CostSelector cost_criterion,
+												bool require_parallel_safe);
 extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 													   List *pathkeys,
 													   Relids required_outer,
 													   double fraction);
+extern Path *get_cheapest_fractional_path_for_pathkeys_ext(PlannerInfo *root,
+														   RelOptInfo *rel,
+														   List *pathkeys,
+														   Relids required_outer,
+														   double fraction);
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
 								  ScanDirection scandir);
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 69805d4b9ec..48d47bb7455 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -2412,16 +2412,19 @@ EXPLAIN (COSTS OFF)
 SELECT c collate "C", count(c) FROM pagg_tab3 GROUP BY c collate "C" ORDER BY 1;
                        QUERY PLAN                       
 --------------------------------------------------------
- Sort
+ Merge Append
    Sort Key: ((pagg_tab3.c)::text) COLLATE "C"
-   ->  Append
+   ->  Sort
+         Sort Key: ((pagg_tab3.c)::text) COLLATE "C"
          ->  HashAggregate
                Group Key: (pagg_tab3.c)::text
                ->  Seq Scan on pagg_tab3_p2 pagg_tab3
+   ->  Sort
+         Sort Key: ((pagg_tab3_1.c)::text) COLLATE "C"
          ->  HashAggregate
                Group Key: (pagg_tab3_1.c)::text
                ->  Seq Scan on pagg_tab3_p1 pagg_tab3_1
-(9 rows)
+(12 rows)
 
 SELECT c collate "C", count(c) FROM pagg_tab3 GROUP BY c collate "C" ORDER BY 1;
  c | count 
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index f9b0c415cfd..71036dc938f 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -1665,11 +1665,13 @@ insert into matest2 (name) values ('Test 3');
 insert into matest2 (name) values ('Test 4');
 insert into matest3 (name) values ('Test 5');
 insert into matest3 (name) values ('Test 6');
-set enable_indexscan = off;  -- force use of seqscan/sort, so no merge
+set enable_indexscan = off;  -- force use of seqscan/sort
+set enable_sort = off; -- since merge append may employ sort in children we need to disable sort
 explain (verbose, costs off) select * from matest0 order by 1-id;
                          QUERY PLAN                         
 ------------------------------------------------------------
  Sort
+   Disabled: true
    Output: matest0.id, matest0.name, ((1 - matest0.id))
    Sort Key: ((1 - matest0.id))
    ->  Result
@@ -1683,7 +1685,7 @@ explain (verbose, costs off) select * from matest0 order by 1-id;
                      Output: matest0_3.id, matest0_3.name
                ->  Seq Scan on public.matest3 matest0_4
                      Output: matest0_4.id, matest0_4.name
-(14 rows)
+(15 rows)
 
 select * from matest0 order by 1-id;
  id |  name  
@@ -1719,6 +1721,7 @@ select min(1-id) from matest0;
 (1 row)
 
 reset enable_indexscan;
+reset enable_sort;
 set enable_seqscan = off;  -- plan with fewest seqscans should be merge
 set enable_parallel_append = off; -- Don't let parallel-append interfere
 explain (verbose, costs off) select * from matest0 order by 1-id;
@@ -1844,16 +1847,20 @@ order by t1.b limit 10;
          Merge Cond: (t1.b = t2.b)
          ->  Merge Append
                Sort Key: t1.b
-               ->  Index Scan using matest0i on matest0 t1_1
+               ->  Sort
+                     Sort Key: t1_1.b
+                     ->  Seq Scan on matest0 t1_1
                ->  Index Scan using matest1i on matest1 t1_2
          ->  Materialize
                ->  Merge Append
                      Sort Key: t2.b
-                     ->  Index Scan using matest0i on matest0 t2_1
-                           Filter: (c = d)
+                     ->  Sort
+                           Sort Key: t2_1.b
+                           ->  Seq Scan on matest0 t2_1
+                                 Filter: (c = d)
                      ->  Index Scan using matest1i on matest1 t2_2
                            Filter: (c = d)
-(14 rows)
+(18 rows)
 
 reset enable_nestloop;
 drop table matest0 cascade;
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index 6101c8c7cf1..d41979367c9 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -65,31 +65,34 @@ SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.b AND t1.b =
 -- inner join with partially-redundant join clauses
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
-                          QUERY PLAN                           
----------------------------------------------------------------
- Sort
+                       QUERY PLAN                        
+---------------------------------------------------------
+ Merge Append
    Sort Key: t1.a
-   ->  Append
-         ->  Merge Join
-               Merge Cond: (t1_1.a = t2_1.a)
-               ->  Index Scan using iprt1_p1_a on prt1_p1 t1_1
-               ->  Sort
-                     Sort Key: t2_1.b
-                     ->  Seq Scan on prt2_p1 t2_1
-                           Filter: (a = b)
+   ->  Merge Join
+         Merge Cond: (t1_1.a = t2_1.a)
+         ->  Index Scan using iprt1_p1_a on prt1_p1 t1_1
+         ->  Sort
+               Sort Key: t2_1.b
+               ->  Seq Scan on prt2_p1 t2_1
+                     Filter: (a = b)
+   ->  Sort
+         Sort Key: t1_2.a
          ->  Hash Join
                Hash Cond: (t1_2.a = t2_2.a)
                ->  Seq Scan on prt1_p2 t1_2
                ->  Hash
                      ->  Seq Scan on prt2_p2 t2_2
                            Filter: (a = b)
+   ->  Sort
+         Sort Key: t1_3.a
          ->  Hash Join
                Hash Cond: (t1_3.a = t2_3.a)
                ->  Seq Scan on prt1_p3 t1_3
                ->  Hash
                      ->  Seq Scan on prt2_p3 t2_3
                            Filter: (a = b)
-(22 rows)
+(25 rows)
 
 SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
  a  |  c   | b  |  c   
@@ -1383,28 +1386,32 @@ SELECT t1.a, t1.c, t2.b, t2.c, t3.a + t3.b, t3.c FROM (prt1 t1 LEFT JOIN prt2 t2
 -- This should generate a partitionwise join, but currently fails to
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
-                        QUERY PLAN                         
------------------------------------------------------------
- Incremental Sort
+                           QUERY PLAN                            
+-----------------------------------------------------------------
+ Sort
    Sort Key: prt1.a, prt2.b
-   Presorted Key: prt1.a
-   ->  Merge Left Join
-         Merge Cond: (prt1.a = prt2.b)
-         ->  Sort
-               Sort Key: prt1.a
-               ->  Append
-                     ->  Seq Scan on prt1_p1 prt1_1
-                           Filter: ((a < 450) AND (b = 0))
-                     ->  Seq Scan on prt1_p2 prt1_2
-                           Filter: ((a < 450) AND (b = 0))
-         ->  Sort
-               Sort Key: prt2.b
-               ->  Append
+   ->  Merge Right Join
+         Merge Cond: (prt2.b = prt1.a)
+         ->  Append
+               ->  Sort
+                     Sort Key: prt2_1.b
                      ->  Seq Scan on prt2_p2 prt2_1
                            Filter: (b > 250)
+               ->  Sort
+                     Sort Key: prt2_2.b
                      ->  Seq Scan on prt2_p3 prt2_2
                            Filter: (b > 250)
-(19 rows)
+         ->  Materialize
+               ->  Append
+                     ->  Sort
+                           Sort Key: prt1_1.a
+                           ->  Seq Scan on prt1_p1 prt1_1
+                                 Filter: ((a < 450) AND (b = 0))
+                     ->  Sort
+                           Sort Key: prt1_2.a
+                           ->  Seq Scan on prt1_p2 prt1_2
+                                 Filter: ((a < 450) AND (b = 0))
+(23 rows)
 
 SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
   a  |  b  
@@ -1424,25 +1431,33 @@ SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT *
 -- partitionwise join does not apply
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t2.b FROM prt1 t1, prt2 t2 WHERE t1::text = t2::text AND t1.a = t2.b ORDER BY t1.a;
-                                       QUERY PLAN                                        
------------------------------------------------------------------------------------------
+                            QUERY PLAN                            
+------------------------------------------------------------------
  Merge Join
-   Merge Cond: ((t1.a = t2.b) AND (((((t1.*)::prt1))::text) = ((((t2.*)::prt2))::text)))
-   ->  Sort
-         Sort Key: t1.a, ((((t1.*)::prt1))::text)
-         ->  Result
-               ->  Append
-                     ->  Seq Scan on prt1_p1 t1_1
-                     ->  Seq Scan on prt1_p2 t1_2
-                     ->  Seq Scan on prt1_p3 t1_3
-   ->  Sort
-         Sort Key: t2.b, ((((t2.*)::prt2))::text)
-         ->  Result
-               ->  Append
+   Merge Cond: (t1.a = t2.b)
+   Join Filter: ((((t2.*)::prt2))::text = (((t1.*)::prt1))::text)
+   ->  Append
+         ->  Sort
+               Sort Key: t1_1.a
+               ->  Seq Scan on prt1_p1 t1_1
+         ->  Sort
+               Sort Key: t1_2.a
+               ->  Seq Scan on prt1_p2 t1_2
+         ->  Sort
+               Sort Key: t1_3.a
+               ->  Seq Scan on prt1_p3 t1_3
+   ->  Materialize
+         ->  Append
+               ->  Sort
+                     Sort Key: t2_1.b
                      ->  Seq Scan on prt2_p1 t2_1
+               ->  Sort
+                     Sort Key: t2_2.b
                      ->  Seq Scan on prt2_p2 t2_2
+               ->  Sort
+                     Sort Key: t2_3.b
                      ->  Seq Scan on prt2_p3 t2_3
-(16 rows)
+(24 rows)
 
 SELECT t1.a, t2.b FROM prt1 t1, prt2 t2 WHERE t1::text = t2::text AND t1.a = t2.b ORDER BY t1.a;
  a  | b  
@@ -4508,9 +4523,10 @@ EXPLAIN (COSTS OFF)
 SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2 ON (t1.a = t2.a AND t1.c = t2.c) LEFT JOIN plt1_adv t3 ON (t1.a = t3.a AND t1.c = t3.c) WHERE t1.b < 10 ORDER BY t1.a;
                                    QUERY PLAN                                   
 --------------------------------------------------------------------------------
- Sort
+ Merge Append
    Sort Key: t1.a
-   ->  Append
+   ->  Sort
+         Sort Key: t1_1.a
          ->  Hash Right Join
                Hash Cond: ((t3_1.a = t1_1.a) AND (t3_1.c = t1_1.c))
                ->  Seq Scan on plt1_adv_p1 t3_1
@@ -4521,6 +4537,8 @@ SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2
                            ->  Hash
                                  ->  Seq Scan on plt1_adv_p1 t1_1
                                        Filter: (b < 10)
+   ->  Sort
+         Sort Key: t1_2.a
          ->  Hash Right Join
                Hash Cond: ((t3_2.a = t1_2.a) AND (t3_2.c = t1_2.c))
                ->  Seq Scan on plt1_adv_p2 t3_2
@@ -4531,6 +4549,8 @@ SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2
                            ->  Hash
                                  ->  Seq Scan on plt1_adv_p2 t1_2
                                        Filter: (b < 10)
+   ->  Sort
+         Sort Key: t1_3.a
          ->  Hash Right Join
                Hash Cond: ((t3_3.a = t1_3.a) AND (t3_3.c = t1_3.c))
                ->  Seq Scan on plt1_adv_p3 t3_3
@@ -4541,15 +4561,19 @@ SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2
                            ->  Hash
                                  ->  Seq Scan on plt1_adv_p3 t1_3
                                        Filter: (b < 10)
-         ->  Nested Loop Left Join
-               Join Filter: ((t1_4.a = t3_4.a) AND (t1_4.c = t3_4.c))
-               ->  Nested Loop Left Join
-                     Join Filter: ((t1_4.a = t2_4.a) AND (t1_4.c = t2_4.c))
+   ->  Nested Loop Left Join
+         Join Filter: ((t1_4.a = t3_4.a) AND (t1_4.c = t3_4.c))
+         ->  Merge Left Join
+               Merge Cond: ((t1_4.a = t2_4.a) AND (t1_4.c = t2_4.c))
+               ->  Sort
+                     Sort Key: t1_4.a, t1_4.c
                      ->  Seq Scan on plt1_adv_extra t1_4
                            Filter: (b < 10)
+               ->  Sort
+                     Sort Key: t2_4.a, t2_4.c
                      ->  Seq Scan on plt2_adv_extra t2_4
-               ->  Seq Scan on plt1_adv_extra t3_4
-(41 rows)
+         ->  Seq Scan on plt1_adv_extra t3_4
+(50 rows)
 
 SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2 ON (t1.a = t2.a AND t1.c = t2.c) LEFT JOIN plt1_adv t3 ON (t1.a = t3.a AND t1.c = t3.c) WHERE t1.b < 10 ORDER BY t1.a;
  a  |  c   | a |  c   | a |  c   
@@ -5037,21 +5061,26 @@ EXPLAIN (COSTS OFF)
 SELECT t1.*, t2.* FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a AND t1.b = t2.b) WHERE t1.b >= 125 AND t1.b < 225 ORDER BY t1.a, t1.b;
                              QUERY PLAN                             
 --------------------------------------------------------------------
- Sort
+ Merge Append
    Sort Key: t1.a, t1.b
-   ->  Append
+   ->  Sort
+         Sort Key: t1_1.a, t1_1.b
          ->  Hash Join
                Hash Cond: ((t1_1.a = t2_1.a) AND (t1_1.b = t2_1.b))
                ->  Seq Scan on alpha_neg_p1 t1_1
                      Filter: ((b >= 125) AND (b < 225))
                ->  Hash
                      ->  Seq Scan on beta_neg_p1 t2_1
+   ->  Sort
+         Sort Key: t1_2.a, t1_2.b
          ->  Hash Join
                Hash Cond: ((t2_2.a = t1_2.a) AND (t2_2.b = t1_2.b))
                ->  Seq Scan on beta_neg_p2 t2_2
                ->  Hash
                      ->  Seq Scan on alpha_neg_p2 t1_2
                            Filter: ((b >= 125) AND (b < 225))
+   ->  Sort
+         Sort Key: t1_4.a, t1_4.b
          ->  Hash Join
                Hash Cond: ((t2_4.a = t1_4.a) AND (t2_4.b = t1_4.b))
                ->  Append
@@ -5066,7 +5095,7 @@ SELECT t1.*, t2.* FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a AND t1.b = t2
                                  Filter: ((b >= 125) AND (b < 225))
                            ->  Seq Scan on alpha_pos_p3 t1_6
                                  Filter: ((b >= 125) AND (b < 225))
-(29 rows)
+(34 rows)
 
 SELECT t1.*, t2.* FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a AND t1.b = t2.b) WHERE t1.b >= 125 AND t1.b < 225 ORDER BY t1.a, t1.b;
  a  |  b  |  c   | a  |  b  |  c   
diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index 0bf35260b46..b25aa73e946 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -4763,9 +4763,10 @@ select min(a) over (partition by a order by a) from part_abc where a >= stable_o
                Window: w1 AS (PARTITION BY part_abc.a ORDER BY part_abc.a)
                ->  Append
                      Subplans Removed: 1
-                     ->  Index Scan using part_abc_2_a_idx on part_abc_2 part_abc_1
-                           Index Cond: (a >= (stable_one() + 1))
-                           Filter: (d <= stable_one())
+                     ->  Sort
+                           Sort Key: part_abc_1.a
+                           ->  Seq Scan on part_abc_2 part_abc_1
+                                 Filter: ((d <= stable_one()) AND (a >= (stable_one() + 1)))
                      ->  Merge Append
                            Sort Key: part_abc_3.a
                            Subplans Removed: 1
@@ -4780,9 +4781,10 @@ select min(a) over (partition by a order by a) from part_abc where a >= stable_o
                Window: w1 AS (PARTITION BY part_abc_5.a ORDER BY part_abc_5.a)
                ->  Append
                      Subplans Removed: 1
-                     ->  Index Scan using part_abc_2_a_idx on part_abc_2 part_abc_6
-                           Index Cond: (a >= (stable_one() + 1))
-                           Filter: (d >= stable_one())
+                     ->  Sort
+                           Sort Key: part_abc_6.a
+                           ->  Seq Scan on part_abc_2 part_abc_6
+                                 Filter: ((d >= stable_one()) AND (a >= (stable_one() + 1)))
                      ->  Merge Append
                            Sort Key: a
                            Subplans Removed: 1
@@ -4792,7 +4794,7 @@ select min(a) over (partition by a order by a) from part_abc where a >= stable_o
                            ->  Index Scan using part_abc_3_3_a_idx on part_abc_3_3 part_abc_9
                                  Index Cond: (a >= (stable_one() + 1))
                                  Filter: (d >= stable_one())
-(35 rows)
+(37 rows)
 
 drop view part_abc_view;
 drop table part_abc;
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index 96962817ed4..0ccadea910c 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -1207,12 +1207,14 @@ select event_id
 ----------------------------------------------------------
  Merge Append
    Sort Key: events.event_id
-   ->  Index Scan using events_pkey on events
+   ->  Sort
+         Sort Key: events.event_id
+         ->  Seq Scan on events
    ->  Sort
          Sort Key: events_1.event_id
          ->  Seq Scan on events_child events_1
    ->  Index Scan using other_events_pkey on other_events
-(7 rows)
+(9 rows)
 
 drop table events_child, events, other_events;
 reset enable_indexonlyscan;
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 699e8ac09c8..c58beebbd1e 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -641,12 +641,14 @@ insert into matest2 (name) values ('Test 4');
 insert into matest3 (name) values ('Test 5');
 insert into matest3 (name) values ('Test 6');
 
-set enable_indexscan = off;  -- force use of seqscan/sort, so no merge
+set enable_indexscan = off;  -- force use of seqscan/sort
+set enable_sort = off; -- since merge append may employ sort in children we need to disable sort
 explain (verbose, costs off) select * from matest0 order by 1-id;
 select * from matest0 order by 1-id;
 explain (verbose, costs off) select min(1-id) from matest0;
 select min(1-id) from matest0;
 reset enable_indexscan;
+reset enable_sort;
 
 set enable_seqscan = off;  -- plan with fewest seqscans should be merge
 set enable_parallel_append = off; -- Don't let parallel-append interfere
-- 
2.43.0

#16

Andrei Lepikhov

lepihov@gmail.com

8 months ago

In reply to: Alexander Pyhalov (#15)

Re: MergeAppend could consider sorting cheapest child path

On 7/5/2025 08:57, Alexander Pyhalov wrote:

Andrei Lepikhov писал(а) 2025-05-07 08:02:

On 5/5/2025 15:56, Alexander Pyhalov wrote:

Andrei Lepikhov писал(а) 2025-05-05 14:38:
Also logic a bit differs if path is NULL. In
get_cheapest_path_for_pathkeys_ext() we explicitly check for path
being NULL, in get_cheapest_fractional_path_for_pathkeys_ext() only
after calculating sort cost.

I've tried to fix comments a bit and unified functions definitions.

Generally seems ok, I'm not a native speaker to judge the comments. But:
if (base_path && path != base_path)

What is the case in your mind where the base_path pointer still may be
null at that point?

Hi.

It seems if some childrel doesn't have valid pathlist, subpaths_valid
would be false in add_paths_to_append_rel()
and generate_orderedappend_paths() will not be called. So when we
iterate over live_childrels,
all of them will have cheapest_total path. This is why we can assert
that base_path is not NULL.

I'm not sure I understand you correctly. Under which conditions will
rel->cheapest_total_path be set to NULL for a childrel? Could you
provide an example?

--
regards, Andrei Lepikhov

#17

Alexander Pyhalov

a.pyhalov@postgrespro.ru

8 months ago

In reply to: Andrei Lepikhov (#16)

Re: MergeAppend could consider sorting cheapest child path

Andrei Lepikhov писал(а) 2025-05-07 12:03:

On 7/5/2025 08:57, Alexander Pyhalov wrote:

Andrei Lepikhov писал(а) 2025-05-07 08:02:

On 5/5/2025 15:56, Alexander Pyhalov wrote:

Andrei Lepikhov писал(а) 2025-05-05 14:38:
Also logic a bit differs if path is NULL. In
get_cheapest_path_for_pathkeys_ext() we explicitly check for path
being NULL, in get_cheapest_fractional_path_for_pathkeys_ext() only
after calculating sort cost.

I've tried to fix comments a bit and unified functions definitions.

Generally seems ok, I'm not a native speaker to judge the comments.
But:
if (base_path && path != base_path)

What is the case in your mind where the base_path pointer still may
be null at that point?

Hi.

It seems if some childrel doesn't have valid pathlist, subpaths_valid
would be false in add_paths_to_append_rel()
and generate_orderedappend_paths() will not be called. So when we
iterate over live_childrels,
all of them will have cheapest_total path. This is why we can assert
that base_path is not NULL.

I'm not sure I understand you correctly. Under which conditions will
rel->cheapest_total_path be set to NULL for a childrel? Could you
provide an example?

Sorry, perhaps I was not clear enough. I've stated the opposite - it
seems we can be sure that it's not NULL.
--
Best regards,
Alexander Pyhalov,
Postgres Professional

#18

Alexander Korotkov

aekorotkov@gmail.com

7 months ago

In reply to: Alexander Pyhalov (#17)

Re: MergeAppend could consider sorting cheapest child path

Hi, Alexander!

On Wed, May 7, 2025 at 12:06 PM Alexander Pyhalov
<a.pyhalov@postgrespro.ru> wrote:

Andrei Lepikhov писал(а) 2025-05-07 12:03:

On 7/5/2025 08:57, Alexander Pyhalov wrote:

Andrei Lepikhov писал(а) 2025-05-07 08:02:

On 5/5/2025 15:56, Alexander Pyhalov wrote:

Andrei Lepikhov писал(а) 2025-05-05 14:38:
Also logic a bit differs if path is NULL. In
get_cheapest_path_for_pathkeys_ext() we explicitly check for path
being NULL, in get_cheapest_fractional_path_for_pathkeys_ext() only
after calculating sort cost.

I've tried to fix comments a bit and unified functions definitions.

Generally seems ok, I'm not a native speaker to judge the comments.
But:
if (base_path && path != base_path)

What is the case in your mind where the base_path pointer still may
be null at that point?

Hi.

It seems if some childrel doesn't have valid pathlist, subpaths_valid
would be false in add_paths_to_append_rel()
and generate_orderedappend_paths() will not be called. So when we
iterate over live_childrels,
all of them will have cheapest_total path. This is why we can assert
that base_path is not NULL.

I'm not sure I understand you correctly. Under which conditions will
rel->cheapest_total_path be set to NULL for a childrel? Could you
provide an example?

Sorry, perhaps I was not clear enough. I've stated the opposite - it
seems we can be sure that it's not NULL.

Thank you for your work on this subject!

I have the following question. I see patch changes some existing
plans from Sort(Append(...)) to MergeAppend(Sort(), ..., Sort(...)) or
even Materialize(MergeAppend(Sort(), ..., Sort(...))). This should be
some problem in cost_sort(). Otherwise, that would mean that Sort
node doesn't know how to do its job: explicit splitting dataset into
pieces then merging sorting result appears to be cheaper, but Sort
node contains merge-sort algorithm inside and it's supposed to be more
efficient. Could you, please, revise the patch to avoid these
unwanted changes?

------
Regards,
Alexander Korotkov
Supabase

#19

Andrei Lepikhov

lepihov@gmail.com

7 months ago

In reply to: Alexander Korotkov (#18)

2 attachment(s)

Re: MergeAppend could consider sorting cheapest child path

On 2/6/2025 20:21, Alexander Korotkov wrote:

I have the following question. I see patch changes some existing
plans from Sort(Append(...)) to MergeAppend(Sort(), ..., Sort(...)) or
even Materialize(MergeAppend(Sort(), ..., Sort(...))). This should be
some problem in cost_sort(). Otherwise, that would mean that Sort
node doesn't know how to do its job: explicit splitting dataset into
pieces then merging sorting result appears to be cheaper, but Sort
node contains merge-sort algorithm inside and it's supposed to be more
efficient. Could you, please, revise the patch to avoid these
unwanted changes?

I think, this issue is related to corner-cases of the
compare_path_costs_fuzzily.

Let's glance into one of the problematic queries:
EXPLAIN (COSTS ON)
SELECT c collate "C", count(c) FROM pagg_tab3 GROUP BY c collate "C"
ORDER BY 1;

if you play with the plan, you can find that total_cost of the
Sort->Append path is cheaper:

Sort (cost=2.40..2.41 rows=4 width=40)
-> Append (cost=1.15..2.36 rows=4 width=40)
Merge Append (cost=2.37..2.42 rows=4 width=40)

But the difference is less than fuzz_factor. In this case, Postgres
probes startup_cost, which is obviously less for the MergeAppend strategy.
This is a good decision, and I think it should stay as is.
What can we do here? We might change the test to increase the cost gap.
However, while designing this patch, I skimmed through each broken query
and didn't find a reason to specifically shift to the Sort->Append
strategy, as it tested things that were not dependent on Append or Sort.

To establish a stable foundation for discussion, I conducted simple
tests - see, for example, a couple of queries in the attachment. As I
see it, Sort->Append works faster: in my test bench, it takes 1250ms on
average versus 1430ms, and it also has lower costs - the same for data
with and without massive numbers of duplicates. Playing with sizes of
inputs, I see the same behaviour.

--
regards, Andrei Lepikhov

Attachments:

merge_sort.sqlapplication/sql; name=merge_sort.sqlDownload

v6-0001-Consider-explicit-sort-of-the-MergeAppend-subpath.patchtext/plain; charset=UTF-8; name=v6-0001-Consider-explicit-sort-of-the-MergeAppend-subpath.patchDownload

From 37aa55b8d216ca015e7748985192fe8196f4f3fe Mon Sep 17 00:00:00 2001
From: "Andrei V. Lepikhov" <lepihov@gmail.com>
Date: Tue, 3 Jun 2025 11:37:23 +0200
Subject: [PATCH v6] Consider explicit sort of the MergeAppend subpaths.

Expand the optimiser search scope a little: when fetching optimal subpath
matching pathkeys of the planning MergeAppend, consider the extra case of
an overall-optimal path plus an explicit Sort node at the top.

It may provide a more effective plan in both full and fractional scan cases:
1. The Sort node may be pushed down to subpaths under a parallel or
async Append.
2. The case when a minor set of subpaths doesn't have a proper index, and it is
profitable to sort them instead of switching to plain Append.

In general, analysing the regression tests changed by this code, it seems that
the cost model prefers a series of small sortings instead of a single massive
one. This feature slightly increases the number of such paths.

Overhead:
It appears that multiple subpaths may be encountered, as well as
numerous pathkeys.
Therefore, to be as cautious as possible, only cost estimation is performed.
---
 .../postgres_fdw/expected/postgres_fdw.out    |   6 +-
 src/backend/optimizer/path/allpaths.c         |  47 ++----
 src/backend/optimizer/path/pathkeys.c         | 115 +++++++++++++++
 src/include/optimizer/paths.h                 |  10 ++
 .../regress/expected/collate.icu.utf8.out     |   9 +-
 src/test/regress/expected/inherit.out         |  19 ++-
 src/test/regress/expected/partition_join.out  | 139 +++++++++++-------
 src/test/regress/expected/partition_prune.out |  16 +-
 src/test/regress/expected/union.out           |   6 +-
 src/test/regress/sql/inherit.sql              |   4 +-
 10 files changed, 264 insertions(+), 107 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index eb4716bed81..1b22ab6f522 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -10352,13 +10352,15 @@ SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a
    ->  Nested Loop
          Join Filter: (t1.a = t2.b)
          ->  Append
-               ->  Foreign Scan on ftprt1_p1 t1_1
+               ->  Sort
+                     Sort Key: t1_1.a
+                     ->  Foreign Scan on ftprt1_p1 t1_1
                ->  Foreign Scan on ftprt1_p2 t1_2
          ->  Materialize
                ->  Append
                      ->  Foreign Scan on ftprt2_p1 t2_1
                      ->  Foreign Scan on ftprt2_p2 t2_2
-(10 rows)
+(12 rows)
 
 SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
   a  |  b  
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 6cc6966b060..65ea9d477b0 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1855,29 +1855,18 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 
 			/* Locate the right paths, if they are available. */
 			cheapest_startup =
-				get_cheapest_path_for_pathkeys(childrel->pathlist,
-											   pathkeys,
-											   NULL,
-											   STARTUP_COST,
-											   false);
+				get_cheapest_path_for_pathkeys_ext(root, childrel, pathkeys,
+												   NULL, STARTUP_COST, false);
 			cheapest_total =
-				get_cheapest_path_for_pathkeys(childrel->pathlist,
-											   pathkeys,
-											   NULL,
-											   TOTAL_COST,
-											   false);
+				get_cheapest_path_for_pathkeys_ext(root, childrel, pathkeys,
+												   NULL, TOTAL_COST, false);
 
 			/*
-			 * If we can't find any paths with the right order just use the
-			 * cheapest-total path; we'll have to sort it later.
+			 * In accordance to current planning logic there are no
+			 * parameterised paths under a merge append.
 			 */
-			if (cheapest_startup == NULL || cheapest_total == NULL)
-			{
-				cheapest_startup = cheapest_total =
-					childrel->cheapest_total_path;
-				/* Assert we do have an unparameterized path for this child */
-				Assert(cheapest_total->param_info == NULL);
-			}
+			Assert(cheapest_startup != NULL && cheapest_total != NULL);
+			Assert(cheapest_total->param_info == NULL);
 
 			/*
 			 * When building a fractional path, determine a cheapest
@@ -1904,21 +1893,17 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 					path_fraction /= childrel->rows;
 
 				cheapest_fractional =
-					get_cheapest_fractional_path_for_pathkeys(childrel->pathlist,
-															  pathkeys,
-															  NULL,
-															  path_fraction);
+					get_cheapest_fractional_path_for_pathkeys_ext(root,
+																  childrel,
+																  pathkeys,
+																  NULL,
+																  path_fraction);
 
 				/*
-				 * If we found no path with matching pathkeys, use the
-				 * cheapest total path instead.
-				 *
-				 * XXX We might consider partially sorted paths too (with an
-				 * incremental sort on top). But we'd have to build all the
-				 * incremental paths, do the costing etc.
+				 * In accordance to current planning logic there are no
+				 * parameterised fractional paths under a merge append.
 				 */
-				if (!cheapest_fractional)
-					cheapest_fractional = cheapest_total;
+				Assert(cheapest_fractional != NULL);
 			}
 
 			/*
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 8b04d40d36d..3a13b3d02ee 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -19,11 +19,13 @@
 
 #include "access/stratnum.h"
 #include "catalog/pg_opfamily.h"
+#include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/cost.h"
 #include "optimizer/optimizer.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
+#include "optimizer/planner.h"
 #include "partitioning/partbounds.h"
 #include "rewrite/rewriteManip.h"
 #include "utils/lsyscache.h"
@@ -648,6 +650,64 @@ get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 	return matched_path;
 }
 
+/*
+ * get_cheapest_path_for_pathkeys_ext
+ *	  Calls get_cheapest_path_for_pathkeys to obtain cheapest path that
+ *	  satisfies defined criterias and сonsiders one more option: choose
+ *	  overall-optimal path (according the criterion) and explicitly sort its
+ *	  output to satisfy the pathkeys.
+ *
+ *	  Caller is responsible to insert corresponding sort path at the top of
+ *	  returned path if it will be chosen to be used.
+ *
+ *	  Return NULL if no such path.
+ */
+Path *
+get_cheapest_path_for_pathkeys_ext(PlannerInfo *root, RelOptInfo *rel,
+								   List *pathkeys, Relids required_outer,
+								   CostSelector cost_criterion,
+								   bool require_parallel_safe)
+{
+	Path		sort_path;
+	Path	   *base_path = rel->cheapest_total_path;
+	Path	   *path;
+
+	/* In generate_orderedappend_paths() all childrels do have some paths */
+	Assert(base_path);
+
+	path = get_cheapest_path_for_pathkeys(rel->pathlist, pathkeys,
+										  required_outer, cost_criterion,
+										  require_parallel_safe);
+
+	/*
+	 * Stop here if the cheapest total path doesn't satisfy necessary
+	 * conditions
+	 */
+	if ((require_parallel_safe && !base_path->parallel_safe) ||
+		!bms_is_subset(PATH_REQ_OUTER(base_path), required_outer))
+		return path;
+
+	if (path == NULL)
+
+		/*
+		 * Current pathlist doesn't fit the pathkeys. No need to check extra
+		 * sort path ways.
+		 */
+		return base_path;
+
+	/* Consider the cheapest total path with extra sort */
+	if (path != base_path)
+	{
+		cost_sort(&sort_path, root, pathkeys, base_path->disabled_nodes,
+				  base_path->total_cost, base_path->rows,
+				  base_path->pathtarget->width, 0.0, work_mem, -1.0);
+
+		if (compare_path_costs(&sort_path, path, cost_criterion) < 0)
+			return base_path;
+	}
+	return path;
+}
+
 /*
  * get_cheapest_fractional_path_for_pathkeys
  *	  Find the cheapest path (for retrieving a specified fraction of all
@@ -690,6 +750,61 @@ get_cheapest_fractional_path_for_pathkeys(List *paths,
 	return matched_path;
 }
 
+/*
+ * get_cheapest_fractional_path_for_pathkeys_ext
+ *	  obtain cheapest fractional path that satisfies defined criterias excluding
+ *	  pathkeys and explicitly sort its output to satisfy the pathkeys.
+ *
+ *	  Caller is responsible to insert corresponding sort path at the top of
+ *	  returned path if it will be chosen to be used.
+ *
+ *	  Return NULL if no such path.
+ */
+Path *
+get_cheapest_fractional_path_for_pathkeys_ext(PlannerInfo *root,
+											  RelOptInfo *rel,
+											  List *pathkeys,
+											  Relids required_outer,
+											  double fraction)
+{
+	Path		sort_path;
+	Path	   *base_path = rel->cheapest_total_path;
+	Path	   *path;
+
+	/* In generate_orderedappend_paths() all childrels do have some paths */
+	Assert(base_path);
+
+	path = get_cheapest_fractional_path_for_pathkeys(rel->pathlist, pathkeys,
+													 required_outer, fraction);
+
+	/*
+	 * Stop here if the cheapest total path doesn't satisfy necessary
+	 * conditions
+	 */
+	if (!bms_is_subset(PATH_REQ_OUTER(base_path), required_outer))
+		return path;
+
+	if (path == NULL)
+
+		/*
+		 * Current pathlist doesn't fit the pathkeys. No need to check extra
+		 * sort path ways.
+		 */
+		return base_path;
+
+	/* Consider the cheapest total path with extra sort */
+	if (path != base_path)
+	{
+		cost_sort(&sort_path, root, pathkeys, base_path->disabled_nodes,
+				  base_path->total_cost, base_path->rows,
+				  base_path->pathtarget->width, 0.0, work_mem, -1.0);
+
+		if (compare_fractional_path_costs(&sort_path, path, fraction) <= 0)
+			return base_path;
+	}
+
+	return path;
+}
 
 /*
  * get_cheapest_parallel_safe_total_inner
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index a48c9721797..4bdc85afca9 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -224,10 +224,20 @@ extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
 											bool require_parallel_safe);
+extern Path *get_cheapest_path_for_pathkeys_ext(PlannerInfo *root,
+												RelOptInfo *rel, List *pathkeys,
+												Relids required_outer,
+												CostSelector cost_criterion,
+												bool require_parallel_safe);
 extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 													   List *pathkeys,
 													   Relids required_outer,
 													   double fraction);
+extern Path *get_cheapest_fractional_path_for_pathkeys_ext(PlannerInfo *root,
+														   RelOptInfo *rel,
+														   List *pathkeys,
+														   Relids required_outer,
+														   double fraction);
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
 								  ScanDirection scandir);
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 69805d4b9ec..48d47bb7455 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -2412,16 +2412,19 @@ EXPLAIN (COSTS OFF)
 SELECT c collate "C", count(c) FROM pagg_tab3 GROUP BY c collate "C" ORDER BY 1;
                        QUERY PLAN                       
 --------------------------------------------------------
- Sort
+ Merge Append
    Sort Key: ((pagg_tab3.c)::text) COLLATE "C"
-   ->  Append
+   ->  Sort
+         Sort Key: ((pagg_tab3.c)::text) COLLATE "C"
          ->  HashAggregate
                Group Key: (pagg_tab3.c)::text
                ->  Seq Scan on pagg_tab3_p2 pagg_tab3
+   ->  Sort
+         Sort Key: ((pagg_tab3_1.c)::text) COLLATE "C"
          ->  HashAggregate
                Group Key: (pagg_tab3_1.c)::text
                ->  Seq Scan on pagg_tab3_p1 pagg_tab3_1
-(9 rows)
+(12 rows)
 
 SELECT c collate "C", count(c) FROM pagg_tab3 GROUP BY c collate "C" ORDER BY 1;
  c | count 
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index f9b0c415cfd..71036dc938f 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -1665,11 +1665,13 @@ insert into matest2 (name) values ('Test 3');
 insert into matest2 (name) values ('Test 4');
 insert into matest3 (name) values ('Test 5');
 insert into matest3 (name) values ('Test 6');
-set enable_indexscan = off;  -- force use of seqscan/sort, so no merge
+set enable_indexscan = off;  -- force use of seqscan/sort
+set enable_sort = off; -- since merge append may employ sort in children we need to disable sort
 explain (verbose, costs off) select * from matest0 order by 1-id;
                          QUERY PLAN                         
 ------------------------------------------------------------
  Sort
+   Disabled: true
    Output: matest0.id, matest0.name, ((1 - matest0.id))
    Sort Key: ((1 - matest0.id))
    ->  Result
@@ -1683,7 +1685,7 @@ explain (verbose, costs off) select * from matest0 order by 1-id;
                      Output: matest0_3.id, matest0_3.name
                ->  Seq Scan on public.matest3 matest0_4
                      Output: matest0_4.id, matest0_4.name
-(14 rows)
+(15 rows)
 
 select * from matest0 order by 1-id;
  id |  name  
@@ -1719,6 +1721,7 @@ select min(1-id) from matest0;
 (1 row)
 
 reset enable_indexscan;
+reset enable_sort;
 set enable_seqscan = off;  -- plan with fewest seqscans should be merge
 set enable_parallel_append = off; -- Don't let parallel-append interfere
 explain (verbose, costs off) select * from matest0 order by 1-id;
@@ -1844,16 +1847,20 @@ order by t1.b limit 10;
          Merge Cond: (t1.b = t2.b)
          ->  Merge Append
                Sort Key: t1.b
-               ->  Index Scan using matest0i on matest0 t1_1
+               ->  Sort
+                     Sort Key: t1_1.b
+                     ->  Seq Scan on matest0 t1_1
                ->  Index Scan using matest1i on matest1 t1_2
          ->  Materialize
                ->  Merge Append
                      Sort Key: t2.b
-                     ->  Index Scan using matest0i on matest0 t2_1
-                           Filter: (c = d)
+                     ->  Sort
+                           Sort Key: t2_1.b
+                           ->  Seq Scan on matest0 t2_1
+                                 Filter: (c = d)
                      ->  Index Scan using matest1i on matest1 t2_2
                            Filter: (c = d)
-(14 rows)
+(18 rows)
 
 reset enable_nestloop;
 drop table matest0 cascade;
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index d5368186caa..55623d5219f 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -65,31 +65,34 @@ SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.b AND t1.b =
 -- inner join with partially-redundant join clauses
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
-                          QUERY PLAN                           
----------------------------------------------------------------
- Sort
+                       QUERY PLAN                        
+---------------------------------------------------------
+ Merge Append
    Sort Key: t1.a
-   ->  Append
-         ->  Merge Join
-               Merge Cond: (t1_1.a = t2_1.a)
-               ->  Index Scan using iprt1_p1_a on prt1_p1 t1_1
-               ->  Sort
-                     Sort Key: t2_1.b
-                     ->  Seq Scan on prt2_p1 t2_1
-                           Filter: (a = b)
+   ->  Merge Join
+         Merge Cond: (t1_1.a = t2_1.a)
+         ->  Index Scan using iprt1_p1_a on prt1_p1 t1_1
+         ->  Sort
+               Sort Key: t2_1.b
+               ->  Seq Scan on prt2_p1 t2_1
+                     Filter: (a = b)
+   ->  Sort
+         Sort Key: t1_2.a
          ->  Hash Join
                Hash Cond: (t1_2.a = t2_2.a)
                ->  Seq Scan on prt1_p2 t1_2
                ->  Hash
                      ->  Seq Scan on prt2_p2 t2_2
                            Filter: (a = b)
+   ->  Sort
+         Sort Key: t1_3.a
          ->  Hash Join
                Hash Cond: (t1_3.a = t2_3.a)
                ->  Seq Scan on prt1_p3 t1_3
                ->  Hash
                      ->  Seq Scan on prt2_p3 t2_3
                            Filter: (a = b)
-(22 rows)
+(25 rows)
 
 SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
  a  |  c   | b  |  c   
@@ -1383,28 +1386,32 @@ SELECT t1.a, t1.c, t2.b, t2.c, t3.a + t3.b, t3.c FROM (prt1 t1 LEFT JOIN prt2 t2
 -- This should generate a partitionwise join, but currently fails to
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
-                        QUERY PLAN                         
------------------------------------------------------------
- Incremental Sort
+                           QUERY PLAN                            
+-----------------------------------------------------------------
+ Sort
    Sort Key: prt1.a, prt2.b
-   Presorted Key: prt1.a
-   ->  Merge Left Join
-         Merge Cond: (prt1.a = prt2.b)
-         ->  Sort
-               Sort Key: prt1.a
-               ->  Append
-                     ->  Seq Scan on prt1_p1 prt1_1
-                           Filter: ((a < 450) AND (b = 0))
-                     ->  Seq Scan on prt1_p2 prt1_2
-                           Filter: ((a < 450) AND (b = 0))
-         ->  Sort
-               Sort Key: prt2.b
-               ->  Append
+   ->  Merge Right Join
+         Merge Cond: (prt2.b = prt1.a)
+         ->  Append
+               ->  Sort
+                     Sort Key: prt2_1.b
                      ->  Seq Scan on prt2_p2 prt2_1
                            Filter: (b > 250)
+               ->  Sort
+                     Sort Key: prt2_2.b
                      ->  Seq Scan on prt2_p3 prt2_2
                            Filter: (b > 250)
-(19 rows)
+         ->  Materialize
+               ->  Append
+                     ->  Sort
+                           Sort Key: prt1_1.a
+                           ->  Seq Scan on prt1_p1 prt1_1
+                                 Filter: ((a < 450) AND (b = 0))
+                     ->  Sort
+                           Sort Key: prt1_2.a
+                           ->  Seq Scan on prt1_p2 prt1_2
+                                 Filter: ((a < 450) AND (b = 0))
+(23 rows)
 
 SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
   a  |  b  
@@ -1424,25 +1431,33 @@ SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT *
 -- partitionwise join does not apply
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t2.b FROM prt1 t1, prt2 t2 WHERE t1::text = t2::text AND t1.a = t2.b ORDER BY t1.a;
-                                       QUERY PLAN                                        
------------------------------------------------------------------------------------------
+                            QUERY PLAN                            
+------------------------------------------------------------------
  Merge Join
-   Merge Cond: ((t1.a = t2.b) AND (((((t1.*)::prt1))::text) = ((((t2.*)::prt2))::text)))
-   ->  Sort
-         Sort Key: t1.a, ((((t1.*)::prt1))::text)
-         ->  Result
-               ->  Append
-                     ->  Seq Scan on prt1_p1 t1_1
-                     ->  Seq Scan on prt1_p2 t1_2
-                     ->  Seq Scan on prt1_p3 t1_3
-   ->  Sort
-         Sort Key: t2.b, ((((t2.*)::prt2))::text)
-         ->  Result
-               ->  Append
+   Merge Cond: (t1.a = t2.b)
+   Join Filter: ((((t2.*)::prt2))::text = (((t1.*)::prt1))::text)
+   ->  Append
+         ->  Sort
+               Sort Key: t1_1.a
+               ->  Seq Scan on prt1_p1 t1_1
+         ->  Sort
+               Sort Key: t1_2.a
+               ->  Seq Scan on prt1_p2 t1_2
+         ->  Sort
+               Sort Key: t1_3.a
+               ->  Seq Scan on prt1_p3 t1_3
+   ->  Materialize
+         ->  Append
+               ->  Sort
+                     Sort Key: t2_1.b
                      ->  Seq Scan on prt2_p1 t2_1
+               ->  Sort
+                     Sort Key: t2_2.b
                      ->  Seq Scan on prt2_p2 t2_2
+               ->  Sort
+                     Sort Key: t2_3.b
                      ->  Seq Scan on prt2_p3 t2_3
-(16 rows)
+(24 rows)
 
 SELECT t1.a, t2.b FROM prt1 t1, prt2 t2 WHERE t1::text = t2::text AND t1.a = t2.b ORDER BY t1.a;
  a  | b  
@@ -4508,9 +4523,10 @@ EXPLAIN (COSTS OFF)
 SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2 ON (t1.a = t2.a AND t1.c = t2.c) LEFT JOIN plt1_adv t3 ON (t1.a = t3.a AND t1.c = t3.c) WHERE t1.b < 10 ORDER BY t1.a;
                                    QUERY PLAN                                   
 --------------------------------------------------------------------------------
- Sort
+ Merge Append
    Sort Key: t1.a
-   ->  Append
+   ->  Sort
+         Sort Key: t1_1.a
          ->  Hash Right Join
                Hash Cond: ((t3_1.a = t1_1.a) AND (t3_1.c = t1_1.c))
                ->  Seq Scan on plt1_adv_p1 t3_1
@@ -4521,6 +4537,8 @@ SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2
                            ->  Hash
                                  ->  Seq Scan on plt1_adv_p1 t1_1
                                        Filter: (b < 10)
+   ->  Sort
+         Sort Key: t1_2.a
          ->  Hash Right Join
                Hash Cond: ((t3_2.a = t1_2.a) AND (t3_2.c = t1_2.c))
                ->  Seq Scan on plt1_adv_p2 t3_2
@@ -4531,6 +4549,8 @@ SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2
                            ->  Hash
                                  ->  Seq Scan on plt1_adv_p2 t1_2
                                        Filter: (b < 10)
+   ->  Sort
+         Sort Key: t1_3.a
          ->  Hash Right Join
                Hash Cond: ((t3_3.a = t1_3.a) AND (t3_3.c = t1_3.c))
                ->  Seq Scan on plt1_adv_p3 t3_3
@@ -4541,15 +4561,19 @@ SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2
                            ->  Hash
                                  ->  Seq Scan on plt1_adv_p3 t1_3
                                        Filter: (b < 10)
-         ->  Nested Loop Left Join
-               Join Filter: ((t1_4.a = t3_4.a) AND (t1_4.c = t3_4.c))
-               ->  Nested Loop Left Join
-                     Join Filter: ((t1_4.a = t2_4.a) AND (t1_4.c = t2_4.c))
+   ->  Nested Loop Left Join
+         Join Filter: ((t1_4.a = t3_4.a) AND (t1_4.c = t3_4.c))
+         ->  Merge Left Join
+               Merge Cond: ((t1_4.a = t2_4.a) AND (t1_4.c = t2_4.c))
+               ->  Sort
+                     Sort Key: t1_4.a, t1_4.c
                      ->  Seq Scan on plt1_adv_extra t1_4
                            Filter: (b < 10)
+               ->  Sort
+                     Sort Key: t2_4.a, t2_4.c
                      ->  Seq Scan on plt2_adv_extra t2_4
-               ->  Seq Scan on plt1_adv_extra t3_4
-(41 rows)
+         ->  Seq Scan on plt1_adv_extra t3_4
+(50 rows)
 
 SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2 ON (t1.a = t2.a AND t1.c = t2.c) LEFT JOIN plt1_adv t3 ON (t1.a = t3.a AND t1.c = t3.c) WHERE t1.b < 10 ORDER BY t1.a;
  a  |  c   | a |  c   | a |  c   
@@ -5037,21 +5061,26 @@ EXPLAIN (COSTS OFF)
 SELECT t1.*, t2.* FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a AND t1.b = t2.b) WHERE t1.b >= 125 AND t1.b < 225 ORDER BY t1.a, t1.b;
                              QUERY PLAN                             
 --------------------------------------------------------------------
- Sort
+ Merge Append
    Sort Key: t1.a, t1.b
-   ->  Append
+   ->  Sort
+         Sort Key: t1_1.a, t1_1.b
          ->  Hash Join
                Hash Cond: ((t1_1.a = t2_1.a) AND (t1_1.b = t2_1.b))
                ->  Seq Scan on alpha_neg_p1 t1_1
                      Filter: ((b >= 125) AND (b < 225))
                ->  Hash
                      ->  Seq Scan on beta_neg_p1 t2_1
+   ->  Sort
+         Sort Key: t1_2.a, t1_2.b
          ->  Hash Join
                Hash Cond: ((t2_2.a = t1_2.a) AND (t2_2.b = t1_2.b))
                ->  Seq Scan on beta_neg_p2 t2_2
                ->  Hash
                      ->  Seq Scan on alpha_neg_p2 t1_2
                            Filter: ((b >= 125) AND (b < 225))
+   ->  Sort
+         Sort Key: t1_4.a, t1_4.b
          ->  Hash Join
                Hash Cond: ((t2_4.a = t1_4.a) AND (t2_4.b = t1_4.b))
                ->  Append
@@ -5066,7 +5095,7 @@ SELECT t1.*, t2.* FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a AND t1.b = t2
                                  Filter: ((b >= 125) AND (b < 225))
                            ->  Seq Scan on alpha_pos_p3 t1_6
                                  Filter: ((b >= 125) AND (b < 225))
-(29 rows)
+(34 rows)
 
 SELECT t1.*, t2.* FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a AND t1.b = t2.b) WHERE t1.b >= 125 AND t1.b < 225 ORDER BY t1.a, t1.b;
  a  |  b  |  c   | a  |  b  |  c   
diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index d1966cd7d82..43e2962009e 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -4768,9 +4768,10 @@ select min(a) over (partition by a order by a) from part_abc where a >= stable_o
                Window: w1 AS (PARTITION BY part_abc.a ORDER BY part_abc.a)
                ->  Append
                      Subplans Removed: 1
-                     ->  Index Scan using part_abc_2_a_idx on part_abc_2 part_abc_1
-                           Index Cond: (a >= (stable_one() + 1))
-                           Filter: (d <= stable_one())
+                     ->  Sort
+                           Sort Key: part_abc_1.a
+                           ->  Seq Scan on part_abc_2 part_abc_1
+                                 Filter: ((d <= stable_one()) AND (a >= (stable_one() + 1)))
                      ->  Merge Append
                            Sort Key: part_abc_3.a
                            Subplans Removed: 1
@@ -4785,9 +4786,10 @@ select min(a) over (partition by a order by a) from part_abc where a >= stable_o
                Window: w1 AS (PARTITION BY part_abc_5.a ORDER BY part_abc_5.a)
                ->  Append
                      Subplans Removed: 1
-                     ->  Index Scan using part_abc_2_a_idx on part_abc_2 part_abc_6
-                           Index Cond: (a >= (stable_one() + 1))
-                           Filter: (d >= stable_one())
+                     ->  Sort
+                           Sort Key: part_abc_6.a
+                           ->  Seq Scan on part_abc_2 part_abc_6
+                                 Filter: ((d >= stable_one()) AND (a >= (stable_one() + 1)))
                      ->  Merge Append
                            Sort Key: a
                            Subplans Removed: 1
@@ -4797,7 +4799,7 @@ select min(a) over (partition by a order by a) from part_abc where a >= stable_o
                            ->  Index Scan using part_abc_3_3_a_idx on part_abc_3_3 part_abc_9
                                  Index Cond: (a >= (stable_one() + 1))
                                  Filter: (d >= stable_one())
-(35 rows)
+(37 rows)
 
 drop view part_abc_view;
 drop table part_abc;
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index 96962817ed4..0ccadea910c 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -1207,12 +1207,14 @@ select event_id
 ----------------------------------------------------------
  Merge Append
    Sort Key: events.event_id
-   ->  Index Scan using events_pkey on events
+   ->  Sort
+         Sort Key: events.event_id
+         ->  Seq Scan on events
    ->  Sort
          Sort Key: events_1.event_id
          ->  Seq Scan on events_child events_1
    ->  Index Scan using other_events_pkey on other_events
-(7 rows)
+(9 rows)
 
 drop table events_child, events, other_events;
 reset enable_indexonlyscan;
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 699e8ac09c8..c58beebbd1e 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -641,12 +641,14 @@ insert into matest2 (name) values ('Test 4');
 insert into matest3 (name) values ('Test 5');
 insert into matest3 (name) values ('Test 6');
 
-set enable_indexscan = off;  -- force use of seqscan/sort, so no merge
+set enable_indexscan = off;  -- force use of seqscan/sort
+set enable_sort = off; -- since merge append may employ sort in children we need to disable sort
 explain (verbose, costs off) select * from matest0 order by 1-id;
 select * from matest0 order by 1-id;
 explain (verbose, costs off) select min(1-id) from matest0;
 select min(1-id) from matest0;
 reset enable_indexscan;
+reset enable_sort;
 
 set enable_seqscan = off;  -- plan with fewest seqscans should be merge
 set enable_parallel_append = off; -- Don't let parallel-append interfere
-- 
2.49.0

#20

Alexander Korotkov

aekorotkov@gmail.com

7 months ago

In reply to: Andrei Lepikhov (#19)

Re: MergeAppend could consider sorting cheapest child path

On Tue, Jun 3, 2025 at 4:23 PM Andrei Lepikhov <lepihov@gmail.com> wrote:

On 2/6/2025 20:21, Alexander Korotkov wrote:

I have the following question. I see patch changes some existing
plans from Sort(Append(...)) to MergeAppend(Sort(), ..., Sort(...)) or
even Materialize(MergeAppend(Sort(), ..., Sort(...))). This should be
some problem in cost_sort(). Otherwise, that would mean that Sort
node doesn't know how to do its job: explicit splitting dataset into
pieces then merging sorting result appears to be cheaper, but Sort
node contains merge-sort algorithm inside and it's supposed to be more
efficient. Could you, please, revise the patch to avoid these
unwanted changes?

I think, this issue is related to corner-cases of the
compare_path_costs_fuzzily.

Let's glance into one of the problematic queries:
EXPLAIN (COSTS ON)
SELECT c collate "C", count(c) FROM pagg_tab3 GROUP BY c collate "C"
ORDER BY 1;

if you play with the plan, you can find that total_cost of the
Sort->Append path is cheaper:

Sort (cost=2.40..2.41 rows=4 width=40)
-> Append (cost=1.15..2.36 rows=4 width=40)
Merge Append (cost=2.37..2.42 rows=4 width=40)

But the difference is less than fuzz_factor. In this case, Postgres
probes startup_cost, which is obviously less for the MergeAppend strategy.
This is a good decision, and I think it should stay as is.
What can we do here? We might change the test to increase the cost gap.
However, while designing this patch, I skimmed through each broken query
and didn't find a reason to specifically shift to the Sort->Append
strategy, as it tested things that were not dependent on Append or Sort.

To establish a stable foundation for discussion, I conducted simple
tests - see, for example, a couple of queries in the attachment. As I
see it, Sort->Append works faster: in my test bench, it takes 1250ms on
average versus 1430ms, and it also has lower costs - the same for data
with and without massive numbers of duplicates. Playing with sizes of
inputs, I see the same behaviour.

I run your tests. For Sort(Append()) case I've got actual
time=811.047..842.473. For MergeAppend case I've got actual time
actual time=723.678..967.004. That looks interesting. At some point
we probably should teach our Sort node to start returning tuple before
finishing the last merge stage.

However, I think costs are not adequate to the timing. Our cost model
predicts that startup cost of MergeAppend is less than startup cost of
Sort(Append()). And that's correct. However, in fast total time of
MergeAppend is bigger than total time of Sort(Append()). The
differences in these two cases are comparable. I think we need to
just our cost_sort() to reflect that.

------
Regards,
Alexander Korotkov
Supabase

#21

Andrei Lepikhov

lepihov@gmail.com

7 months ago

In reply to: Alexander Korotkov (#20)

Re: MergeAppend could consider sorting cheapest child path

On 3/6/2025 15:38, Alexander Korotkov wrote:

On Tue, Jun 3, 2025 at 4:23 PM Andrei Lepikhov <lepihov@gmail.com> wrote:

To establish a stable foundation for discussion, I conducted simple
tests - see, for example, a couple of queries in the attachment. As I
see it, Sort->Append works faster: in my test bench, it takes 1250ms on
average versus 1430ms, and it also has lower costs - the same for data
with and without massive numbers of duplicates. Playing with sizes of
inputs, I see the same behaviour.

I run your tests. For Sort(Append()) case I've got actual
time=811.047..842.473. For MergeAppend case I've got actual time
actual time=723.678..967.004. That looks interesting. At some point
we probably should teach our Sort node to start returning tuple before
finishing the last merge stage.

However, I think costs are not adequate to the timing. Our cost model
predicts that startup cost of MergeAppend is less than startup cost of
Sort(Append()). And that's correct. However, in fast total time of
MergeAppend is bigger than total time of Sort(Append()). The
differences in these two cases are comparable. I think we need to
just our cost_sort() to reflect that.

May you explain your idea? As I see (and have shown in the previous
message), the total cost of the Sort->Append is fewer than
MergeAppend->Sort.
Additionally, as I mentioned earlier, the primary reason for choosing
MergeAppend in the regression test was a slight total cost difference
that triggered the startup cost comparison.
May you show the query and its explain, that is a subject of concern for
you?

--
regards, Andrei Lepikhov

#22

Alexander Korotkov

aekorotkov@gmail.com

7 months ago

In reply to: Andrei Lepikhov (#21)

Re: MergeAppend could consider sorting cheapest child path

On Tue, Jun 3, 2025 at 4:53 PM Andrei Lepikhov <lepihov@gmail.com> wrote:

On 3/6/2025 15:38, Alexander Korotkov wrote:

On Tue, Jun 3, 2025 at 4:23 PM Andrei Lepikhov <lepihov@gmail.com> wrote:

To establish a stable foundation for discussion, I conducted simple
tests - see, for example, a couple of queries in the attachment. As I
see it, Sort->Append works faster: in my test bench, it takes 1250ms on
average versus 1430ms, and it also has lower costs - the same for data
with and without massive numbers of duplicates. Playing with sizes of
inputs, I see the same behaviour.

I run your tests. For Sort(Append()) case I've got actual
time=811.047..842.473. For MergeAppend case I've got actual time
actual time=723.678..967.004. That looks interesting. At some point
we probably should teach our Sort node to start returning tuple before
finishing the last merge stage.

However, I think costs are not adequate to the timing. Our cost model
predicts that startup cost of MergeAppend is less than startup cost of
Sort(Append()). And that's correct. However, in fast total time of
MergeAppend is bigger than total time of Sort(Append()). The
differences in these two cases are comparable. I think we need to
just our cost_sort() to reflect that.

May you explain your idea? As I see (and have shown in the previous
message), the total cost of the Sort->Append is fewer than
MergeAppend->Sort.
Additionally, as I mentioned earlier, the primary reason for choosing
MergeAppend in the regression test was a slight total cost difference
that triggered the startup cost comparison.
May you show the query and its explain, that is a subject of concern for
you?

My point is that difference in total cost is very small. For small
datasets it could be even within the fuzzy limit. However, in
practice difference in total time is as big as difference in startup
time. So, it would be good to make the total cost difference bigger.

------
Regards,
Alexander Korotkov
Supabase

#23

Andrei Lepikhov

lepihov@gmail.com

7 months ago

In reply to: Alexander Korotkov (#22)

Re: MergeAppend could consider sorting cheapest child path

On 3/6/2025 16:05, Alexander Korotkov wrote:

On Tue, Jun 3, 2025 at 4:53 PM Andrei Lepikhov <lepihov@gmail.com> wrote:

Additionally, as I mentioned earlier, the primary reason for choosing
MergeAppend in the regression test was a slight total cost difference
that triggered the startup cost comparison.
May you show the query and its explain, that is a subject of concern for
you?

My point is that difference in total cost is very small. For small
datasets it could be even within the fuzzy limit. However, in
practice difference in total time is as big as difference in startup
time. So, it would be good to make the total cost difference bigger.

For me, it seems like a continuation of the 7d8ac98 discussion. We may
charge a small fee for MergeAppend to adjust the balance, of course.
However, I think this small change requires a series of benchmarks to
determine how it affects the overall cost balance. Without examples it
is hard to say how important this issue is and its worthiness to
commence such work.

--
regards, Andrei Lepikhov

#24

Alexander Korotkov

aekorotkov@gmail.com

7 months ago

In reply to: Andrei Lepikhov (#23)

Re: MergeAppend could consider sorting cheapest child path

On Tue, Jun 3, 2025 at 5:35 PM Andrei Lepikhov <lepihov@gmail.com> wrote:

On 3/6/2025 16:05, Alexander Korotkov wrote:

On Tue, Jun 3, 2025 at 4:53 PM Andrei Lepikhov <lepihov@gmail.com> wrote:

Additionally, as I mentioned earlier, the primary reason for choosing
MergeAppend in the regression test was a slight total cost difference
that triggered the startup cost comparison.
May you show the query and its explain, that is a subject of concern for
you?

My point is that difference in total cost is very small. For small
datasets it could be even within the fuzzy limit. However, in
practice difference in total time is as big as difference in startup
time. So, it would be good to make the total cost difference bigger.

For me, it seems like a continuation of the 7d8ac98 discussion. We may
charge a small fee for MergeAppend to adjust the balance, of course.
However, I think this small change requires a series of benchmarks to
determine how it affects the overall cost balance. Without examples it
is hard to say how important this issue is and its worthiness to
commence such work.

Yes, I think it's fair to charge the MergeAppend node. We currently
cost it similarly to Sort merge stage, but it's clearly more
expensive. It dealing on the executor level dealing with Slot's etc,
while Sort node have a set of lower level optimizations.

------
Regards,
Alexander Korotkov
Supabase

#25

Andrei Lepikhov

lepihov@gmail.com

7 months ago

In reply to: Alexander Korotkov (#24)

Re: MergeAppend could consider sorting cheapest child path

On 4/6/2025 00:41, Alexander Korotkov wrote:

On Tue, Jun 3, 2025 at 5:35 PM Andrei Lepikhov <lepihov@gmail.com> wrote:

On 3/6/2025 16:05, Alexander Korotkov wrote:

On Tue, Jun 3, 2025 at 4:53 PM Andrei Lepikhov <lepihov@gmail.com> wrote:

Additionally, as I mentioned earlier, the primary reason for choosing
MergeAppend in the regression test was a slight total cost difference
that triggered the startup cost comparison.
May you show the query and its explain, that is a subject of concern for
you?

My point is that difference in total cost is very small. For small
datasets it could be even within the fuzzy limit. However, in
practice difference in total time is as big as difference in startup
time. So, it would be good to make the total cost difference bigger.

For me, it seems like a continuation of the 7d8ac98 discussion. We may
charge a small fee for MergeAppend to adjust the balance, of course.
However, I think this small change requires a series of benchmarks to
determine how it affects the overall cost balance. Without examples it
is hard to say how important this issue is and its worthiness to
commence such work.

Yes, I think it's fair to charge the MergeAppend node. We currently
cost it similarly to Sort merge stage, but it's clearly more
expensive. It dealing on the executor level dealing with Slot's etc,
while Sort node have a set of lower level optimizations.

As I see it, it makes sense to charge MergeAppend for the heap operation
or, what is more logical, reduce the charge on Sort due to internal
optimisations.
Playing with both approaches, I found that it breaks many more tests
than the current patch does. Hence, it needs additional work on the
results analysis to realise how correct these changes are.

--
regards, Andrei Lepikhov

#26

Andrei Lepikhov

lepihov@gmail.com

6 months ago

In reply to: Alexander Korotkov (#24)

2 attachment(s)

Re: MergeAppend could consider sorting cheapest child path

On 4/6/2025 00:41, Alexander Korotkov wrote:

On Tue, Jun 3, 2025 at 5:35 PM Andrei Lepikhov <lepihov@gmail.com> wrote:

For me, it seems like a continuation of the 7d8ac98 discussion. We may
charge a small fee for MergeAppend to adjust the balance, of course.
However, I think this small change requires a series of benchmarks to
determine how it affects the overall cost balance. Without examples it
is hard to say how important this issue is and its worthiness to
commence such work.

Yes, I think it's fair to charge the MergeAppend node. We currently
cost it similarly to Sort merge stage, but it's clearly more
expensive. It dealing on the executor level dealing with Slot's etc,
while Sort node have a set of lower level optimizations.

After conducting additional research, I concluded that you are correct,
and the current cost model doesn't allow the optimiser to detect the
best option. A simple test with a full scan and sort of a partitioned
table (see attachment) shows that the query plan prefers small sortings
merged by the MergeAppend node. I have got the following results for
different numbers of tuples to be sorted (in the range from 10 tuples to
1E8):

EXPLAIN SELECT * FROM test ORDER BY y;

1E1: Sort (cost=9.53..9.57 rows=17 width=8)
1E2: Sort (cost=20.82..21.07 rows=100)
1E3: Merge Append (cost=56.19..83.69 rows=1000)
1E4: Merge Append (cost=612.74..887.74 rows=10000)
1E5: Merge Append (cost=7754.19..10504.19 rows=100000)
1E6: Merge Append (cost=94092.25..121592.25 rows=1000000)
1E7: Merge Append (cost=1106931.22..1381931.22 rows=10000000)
1E8: Merge Append (cost=14097413.40..16847413.40 rows=100000000)

That happens because both total costs lie within the fuzzy factor gap,
and the optimiser chooses the path based on the startup cost, which is
obviously better for the MergeAppend case.

At the same time, execution, involving a single Sort node, dominates the
MergeAppend case:

1E3: MergeAppend: 1.927 ms, Sort: 0.720 ms
1E4: MergeAppend: 10.090 ms, Sort: 7.583 ms
1E5: MergeAppend: 118.885 ms, Sort: 88.492 ms
1E6: MergeAppend: 1372.717 ms, Sort: 1106.184 ms
1E7: MergeAppend: 15103.893 ms, Sort: 13415.806 ms
1E8: MergeAppend: 176553.133 ms, Sort: 149458.787 ms

Looking inside the code, I found the only difference we can employ to
rationalise the cost model change: re-structuring of a heap. The
siftDown routine employs two tuple comparisons to find the proper
position for a tuple. So, we have objections to changing the constant in
the cost model of the merge operation:

@@ -2448,7 +2448,7 @@ cost_merge_append(Path *path, PlannerInfo *root,
logN = LOG2(N);

         /* Assumed cost per tuple comparison */
-       comparison_cost = 2.0 * cpu_operator_cost;
+       comparison_cost = 4.0 * cpu_operator_cost;

/* Heap creation cost */
startup_cost += comparison_cost * N * logN;

The exact change also needs to be made in the cost_gather_merge
function, of course.
At this moment, I'm not sure that it should be multiplied by 2 - it is a
subject for further discussion. However, it fixes the current issue and
adds a little additional cost to the merge operation, which, as I see
it, is quite limited.
Please see the new version of the patch attached.

--
regards, Andrei Lepikhov

Attachments:

sort-vs-mergeappend.sqlapplication/sql; name=sort-vs-mergeappend.sqlDownload

v7-0001-Consider-an-explicit-sort-of-the-MergeAppend-subp.patchtext/plain; charset=UTF-8; name=v7-0001-Consider-an-explicit-sort-of-the-MergeAppend-subp.patchDownload

From 588607223f21622f1b0225c95b8fa200165c62eb Mon Sep 17 00:00:00 2001
From: "Andrei V. Lepikhov" <lepihov@gmail.com>
Date: Tue, 3 Jun 2025 11:37:23 +0200
Subject: [PATCH v7] Consider an explicit sort of the MergeAppend subpaths.

Broaden the optimiser's search scope slightly: when retrieving optimal subpaths
that match pathkeys for the planning MergeAppend, also consider the case of
an overall optimal path that includes an explicit Sort node at the top.

It may provide a more effective plan in both full and fractional scan cases:
1. The Sort node may be pushed down to subpaths under a parallel or async Append.
2. The case when a minor set of subpaths doesn't have a proper index, and it
is profitable to sort them instead of switching to plain Append.

Having implemented that strategy, it became clear that the cost of multiple
small sortings merged by a single MergeAppend node exceeds that of a single Sort
operation over a plain Append. The code and benchmarks demonstrate that such
an assumption is incorrect because the Sort operator has optimisations that work
faster than a MergeAppend.
To arrange the cost model, change the merge cost multiplier, considering that
heap rebuilding needs two comparison operations.
---
 src/backend/optimizer/path/allpaths.c         |  47 ++--
 src/backend/optimizer/path/costsize.c         |   6 +-
 src/backend/optimizer/path/pathkeys.c         | 115 ++++++++++
 src/include/optimizer/paths.h                 |  10 +
 src/test/regress/expected/inherit.out         |  34 +--
 .../regress/expected/partition_aggregate.out  |  32 ++-
 src/test/regress/expected/partition_join.out  | 203 ++++++++++--------
 src/test/regress/expected/partition_prune.out |  16 +-
 src/test/regress/expected/union.out           |  69 +++---
 src/test/regress/sql/inherit.sql              |   4 +-
 10 files changed, 334 insertions(+), 202 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 6cc6966b060..65ea9d477b0 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1855,29 +1855,18 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 
 			/* Locate the right paths, if they are available. */
 			cheapest_startup =
-				get_cheapest_path_for_pathkeys(childrel->pathlist,
-											   pathkeys,
-											   NULL,
-											   STARTUP_COST,
-											   false);
+				get_cheapest_path_for_pathkeys_ext(root, childrel, pathkeys,
+												   NULL, STARTUP_COST, false);
 			cheapest_total =
-				get_cheapest_path_for_pathkeys(childrel->pathlist,
-											   pathkeys,
-											   NULL,
-											   TOTAL_COST,
-											   false);
+				get_cheapest_path_for_pathkeys_ext(root, childrel, pathkeys,
+												   NULL, TOTAL_COST, false);
 
 			/*
-			 * If we can't find any paths with the right order just use the
-			 * cheapest-total path; we'll have to sort it later.
+			 * In accordance to current planning logic there are no
+			 * parameterised paths under a merge append.
 			 */
-			if (cheapest_startup == NULL || cheapest_total == NULL)
-			{
-				cheapest_startup = cheapest_total =
-					childrel->cheapest_total_path;
-				/* Assert we do have an unparameterized path for this child */
-				Assert(cheapest_total->param_info == NULL);
-			}
+			Assert(cheapest_startup != NULL && cheapest_total != NULL);
+			Assert(cheapest_total->param_info == NULL);
 
 			/*
 			 * When building a fractional path, determine a cheapest
@@ -1904,21 +1893,17 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 					path_fraction /= childrel->rows;
 
 				cheapest_fractional =
-					get_cheapest_fractional_path_for_pathkeys(childrel->pathlist,
-															  pathkeys,
-															  NULL,
-															  path_fraction);
+					get_cheapest_fractional_path_for_pathkeys_ext(root,
+																  childrel,
+																  pathkeys,
+																  NULL,
+																  path_fraction);
 
 				/*
-				 * If we found no path with matching pathkeys, use the
-				 * cheapest total path instead.
-				 *
-				 * XXX We might consider partially sorted paths too (with an
-				 * incremental sort on top). But we'd have to build all the
-				 * incremental paths, do the costing etc.
+				 * In accordance to current planning logic there are no
+				 * parameterised fractional paths under a merge append.
 				 */
-				if (!cheapest_fractional)
-					cheapest_fractional = cheapest_total;
+				Assert(cheapest_fractional != NULL);
 			}
 
 			/*
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 1f04a2c182c..d22eb8468a4 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -512,7 +512,7 @@ cost_gather_merge(GatherMergePath *path, PlannerInfo *root,
 	logN = LOG2(N);
 
 	/* Assumed cost per tuple comparison */
-	comparison_cost = 2.0 * cpu_operator_cost;
+	comparison_cost = 4.0 * cpu_operator_cost;
 
 	/* Heap creation cost */
 	startup_cost += comparison_cost * N * logN;
@@ -1965,7 +1965,7 @@ cost_tuplesort(Cost *startup_cost, Cost *run_cost,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		*startup_cost = comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		*startup_cost = 2.0 * comparison_cost * tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
@@ -2474,7 +2474,7 @@ cost_merge_append(Path *path, PlannerInfo *root,
 	logN = LOG2(N);
 
 	/* Assumed cost per tuple comparison */
-	comparison_cost = 2.0 * cpu_operator_cost;
+	comparison_cost = 4.0 * cpu_operator_cost;
 
 	/* Heap creation cost */
 	startup_cost += comparison_cost * N * logN;
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 8b04d40d36d..3a13b3d02ee 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -19,11 +19,13 @@
 
 #include "access/stratnum.h"
 #include "catalog/pg_opfamily.h"
+#include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/cost.h"
 #include "optimizer/optimizer.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
+#include "optimizer/planner.h"
 #include "partitioning/partbounds.h"
 #include "rewrite/rewriteManip.h"
 #include "utils/lsyscache.h"
@@ -648,6 +650,64 @@ get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 	return matched_path;
 }
 
+/*
+ * get_cheapest_path_for_pathkeys_ext
+ *	  Calls get_cheapest_path_for_pathkeys to obtain cheapest path that
+ *	  satisfies defined criterias and сonsiders one more option: choose
+ *	  overall-optimal path (according the criterion) and explicitly sort its
+ *	  output to satisfy the pathkeys.
+ *
+ *	  Caller is responsible to insert corresponding sort path at the top of
+ *	  returned path if it will be chosen to be used.
+ *
+ *	  Return NULL if no such path.
+ */
+Path *
+get_cheapest_path_for_pathkeys_ext(PlannerInfo *root, RelOptInfo *rel,
+								   List *pathkeys, Relids required_outer,
+								   CostSelector cost_criterion,
+								   bool require_parallel_safe)
+{
+	Path		sort_path;
+	Path	   *base_path = rel->cheapest_total_path;
+	Path	   *path;
+
+	/* In generate_orderedappend_paths() all childrels do have some paths */
+	Assert(base_path);
+
+	path = get_cheapest_path_for_pathkeys(rel->pathlist, pathkeys,
+										  required_outer, cost_criterion,
+										  require_parallel_safe);
+
+	/*
+	 * Stop here if the cheapest total path doesn't satisfy necessary
+	 * conditions
+	 */
+	if ((require_parallel_safe && !base_path->parallel_safe) ||
+		!bms_is_subset(PATH_REQ_OUTER(base_path), required_outer))
+		return path;
+
+	if (path == NULL)
+
+		/*
+		 * Current pathlist doesn't fit the pathkeys. No need to check extra
+		 * sort path ways.
+		 */
+		return base_path;
+
+	/* Consider the cheapest total path with extra sort */
+	if (path != base_path)
+	{
+		cost_sort(&sort_path, root, pathkeys, base_path->disabled_nodes,
+				  base_path->total_cost, base_path->rows,
+				  base_path->pathtarget->width, 0.0, work_mem, -1.0);
+
+		if (compare_path_costs(&sort_path, path, cost_criterion) < 0)
+			return base_path;
+	}
+	return path;
+}
+
 /*
  * get_cheapest_fractional_path_for_pathkeys
  *	  Find the cheapest path (for retrieving a specified fraction of all
@@ -690,6 +750,61 @@ get_cheapest_fractional_path_for_pathkeys(List *paths,
 	return matched_path;
 }
 
+/*
+ * get_cheapest_fractional_path_for_pathkeys_ext
+ *	  obtain cheapest fractional path that satisfies defined criterias excluding
+ *	  pathkeys and explicitly sort its output to satisfy the pathkeys.
+ *
+ *	  Caller is responsible to insert corresponding sort path at the top of
+ *	  returned path if it will be chosen to be used.
+ *
+ *	  Return NULL if no such path.
+ */
+Path *
+get_cheapest_fractional_path_for_pathkeys_ext(PlannerInfo *root,
+											  RelOptInfo *rel,
+											  List *pathkeys,
+											  Relids required_outer,
+											  double fraction)
+{
+	Path		sort_path;
+	Path	   *base_path = rel->cheapest_total_path;
+	Path	   *path;
+
+	/* In generate_orderedappend_paths() all childrels do have some paths */
+	Assert(base_path);
+
+	path = get_cheapest_fractional_path_for_pathkeys(rel->pathlist, pathkeys,
+													 required_outer, fraction);
+
+	/*
+	 * Stop here if the cheapest total path doesn't satisfy necessary
+	 * conditions
+	 */
+	if (!bms_is_subset(PATH_REQ_OUTER(base_path), required_outer))
+		return path;
+
+	if (path == NULL)
+
+		/*
+		 * Current pathlist doesn't fit the pathkeys. No need to check extra
+		 * sort path ways.
+		 */
+		return base_path;
+
+	/* Consider the cheapest total path with extra sort */
+	if (path != base_path)
+	{
+		cost_sort(&sort_path, root, pathkeys, base_path->disabled_nodes,
+				  base_path->total_cost, base_path->rows,
+				  base_path->pathtarget->width, 0.0, work_mem, -1.0);
+
+		if (compare_fractional_path_costs(&sort_path, path, fraction) <= 0)
+			return base_path;
+	}
+
+	return path;
+}
 
 /*
  * get_cheapest_parallel_safe_total_inner
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 8410531f2d6..26cd4585c97 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -222,10 +222,20 @@ extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
 											bool require_parallel_safe);
+extern Path *get_cheapest_path_for_pathkeys_ext(PlannerInfo *root,
+												RelOptInfo *rel, List *pathkeys,
+												Relids required_outer,
+												CostSelector cost_criterion,
+												bool require_parallel_safe);
 extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 													   List *pathkeys,
 													   Relids required_outer,
 													   double fraction);
+extern Path *get_cheapest_fractional_path_for_pathkeys_ext(PlannerInfo *root,
+														   RelOptInfo *rel,
+														   List *pathkeys,
+														   Relids required_outer,
+														   double fraction);
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
 								  ScanDirection scandir);
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index 5b5055babdc..bb6125090bd 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -1665,11 +1665,13 @@ insert into matest2 (name) values ('Test 3');
 insert into matest2 (name) values ('Test 4');
 insert into matest3 (name) values ('Test 5');
 insert into matest3 (name) values ('Test 6');
-set enable_indexscan = off;  -- force use of seqscan/sort, so no merge
+set enable_indexscan = off;  -- force use of seqscan/sort
+set enable_sort = off; -- since merge append may employ sort in children we need to disable sort
 explain (verbose, costs off) select * from matest0 order by 1-id;
                          QUERY PLAN                         
 ------------------------------------------------------------
  Sort
+   Disabled: true
    Output: matest0.id, matest0.name, ((1 - matest0.id))
    Sort Key: ((1 - matest0.id))
    ->  Result
@@ -1683,7 +1685,7 @@ explain (verbose, costs off) select * from matest0 order by 1-id;
                      Output: matest0_3.id, matest0_3.name
                ->  Seq Scan on public.matest3 matest0_4
                      Output: matest0_4.id, matest0_4.name
-(14 rows)
+(15 rows)
 
 select * from matest0 order by 1-id;
  id |  name  
@@ -1719,6 +1721,7 @@ select min(1-id) from matest0;
 (1 row)
 
 reset enable_indexscan;
+reset enable_sort;
 set enable_seqscan = off;  -- plan with fewest seqscans should be merge
 set enable_parallel_append = off; -- Don't let parallel-append interfere
 explain (verbose, costs off) select * from matest0 order by 1-id;
@@ -1844,16 +1847,20 @@ order by t1.b limit 10;
          Merge Cond: (t1.b = t2.b)
          ->  Merge Append
                Sort Key: t1.b
-               ->  Index Scan using matest0i on matest0 t1_1
+               ->  Sort
+                     Sort Key: t1_1.b
+                     ->  Seq Scan on matest0 t1_1
                ->  Index Scan using matest1i on matest1 t1_2
          ->  Materialize
                ->  Merge Append
                      Sort Key: t2.b
-                     ->  Index Scan using matest0i on matest0 t2_1
-                           Filter: (c = d)
+                     ->  Sort
+                           Sort Key: t2_1.b
+                           ->  Seq Scan on matest0 t2_1
+                                 Filter: (c = d)
                      ->  Index Scan using matest1i on matest1 t2_2
                            Filter: (c = d)
-(14 rows)
+(18 rows)
 
 reset enable_nestloop;
 drop table matest0 cascade;
@@ -1867,17 +1874,16 @@ analyze matest0;
 analyze matest1;
 explain (costs off)
 select * from matest0 where a < 100 order by a;
-                          QUERY PLAN                           
----------------------------------------------------------------
- Merge Append
+                QUERY PLAN                 
+-------------------------------------------
+ Sort
    Sort Key: matest0.a
-   ->  Index Only Scan using matest0_pkey on matest0 matest0_1
-         Index Cond: (a < 100)
-   ->  Sort
-         Sort Key: matest0_2.a
+   ->  Append
+         ->  Seq Scan on matest0 matest0_1
+               Filter: (a < 100)
          ->  Seq Scan on matest1 matest0_2
                Filter: (a < 100)
-(8 rows)
+(7 rows)
 
 drop table matest0 cascade;
 NOTICE:  drop cascades to table matest1
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index 5f2c0cf5786..9245506279b 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -1380,28 +1380,26 @@ SELECT x, sum(y), avg(y), count(*) FROM pagg_tab_para GROUP BY x HAVING avg(y) <
 -- When GROUP BY clause does not match; partial aggregation is performed for each partition.
 EXPLAIN (COSTS OFF)
 SELECT y, sum(x), avg(x), count(*) FROM pagg_tab_para GROUP BY y HAVING avg(x) < 12 ORDER BY 1, 2, 3;
-                                        QUERY PLAN                                         
--------------------------------------------------------------------------------------------
+                                     QUERY PLAN                                      
+-------------------------------------------------------------------------------------
  Sort
    Sort Key: pagg_tab_para.y, (sum(pagg_tab_para.x)), (avg(pagg_tab_para.x))
-   ->  Finalize GroupAggregate
+   ->  Finalize HashAggregate
          Group Key: pagg_tab_para.y
          Filter: (avg(pagg_tab_para.x) < '12'::numeric)
-         ->  Gather Merge
+         ->  Gather
                Workers Planned: 2
-               ->  Sort
-                     Sort Key: pagg_tab_para.y
-                     ->  Parallel Append
-                           ->  Partial HashAggregate
-                                 Group Key: pagg_tab_para.y
-                                 ->  Parallel Seq Scan on pagg_tab_para_p1 pagg_tab_para
-                           ->  Partial HashAggregate
-                                 Group Key: pagg_tab_para_1.y
-                                 ->  Parallel Seq Scan on pagg_tab_para_p2 pagg_tab_para_1
-                           ->  Partial HashAggregate
-                                 Group Key: pagg_tab_para_2.y
-                                 ->  Parallel Seq Scan on pagg_tab_para_p3 pagg_tab_para_2
-(19 rows)
+               ->  Parallel Append
+                     ->  Partial HashAggregate
+                           Group Key: pagg_tab_para.y
+                           ->  Parallel Seq Scan on pagg_tab_para_p1 pagg_tab_para
+                     ->  Partial HashAggregate
+                           Group Key: pagg_tab_para_1.y
+                           ->  Parallel Seq Scan on pagg_tab_para_p2 pagg_tab_para_1
+                     ->  Partial HashAggregate
+                           Group Key: pagg_tab_para_2.y
+                           ->  Parallel Seq Scan on pagg_tab_para_p3 pagg_tab_para_2
+(17 rows)
 
 SELECT y, sum(x), avg(x), count(*) FROM pagg_tab_para GROUP BY y HAVING avg(x) < 12 ORDER BY 1, 2, 3;
  y  |  sum  |         avg         | count 
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index d5368186caa..601b951d1e7 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -65,31 +65,34 @@ SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.b AND t1.b =
 -- inner join with partially-redundant join clauses
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
-                          QUERY PLAN                           
----------------------------------------------------------------
- Sort
+                       QUERY PLAN                        
+---------------------------------------------------------
+ Merge Append
    Sort Key: t1.a
-   ->  Append
-         ->  Merge Join
-               Merge Cond: (t1_1.a = t2_1.a)
-               ->  Index Scan using iprt1_p1_a on prt1_p1 t1_1
-               ->  Sort
-                     Sort Key: t2_1.b
-                     ->  Seq Scan on prt2_p1 t2_1
-                           Filter: (a = b)
+   ->  Merge Join
+         Merge Cond: (t1_1.a = t2_1.a)
+         ->  Index Scan using iprt1_p1_a on prt1_p1 t1_1
+         ->  Sort
+               Sort Key: t2_1.b
+               ->  Seq Scan on prt2_p1 t2_1
+                     Filter: (a = b)
+   ->  Sort
+         Sort Key: t1_2.a
          ->  Hash Join
                Hash Cond: (t1_2.a = t2_2.a)
                ->  Seq Scan on prt1_p2 t1_2
                ->  Hash
                      ->  Seq Scan on prt2_p2 t2_2
                            Filter: (a = b)
+   ->  Sort
+         Sort Key: t1_3.a
          ->  Hash Join
                Hash Cond: (t1_3.a = t2_3.a)
                ->  Seq Scan on prt1_p3 t1_3
                ->  Hash
                      ->  Seq Scan on prt2_p3 t2_3
                            Filter: (a = b)
-(22 rows)
+(25 rows)
 
 SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
  a  |  c   | b  |  c   
@@ -643,52 +646,41 @@ EXPLAIN (COSTS OFF)
 SELECT a, b FROM prt1 FULL JOIN prt2 p2(b,a,c) USING(a,b)
   WHERE a BETWEEN 490 AND 510
   GROUP BY 1, 2 ORDER BY 1, 2;
-                                                   QUERY PLAN                                                    
------------------------------------------------------------------------------------------------------------------
+                                                QUERY PLAN                                                 
+-----------------------------------------------------------------------------------------------------------
  Group
    Group Key: (COALESCE(prt1.a, p2.a)), (COALESCE(prt1.b, p2.b))
-   ->  Merge Append
+   ->  Sort
          Sort Key: (COALESCE(prt1.a, p2.a)), (COALESCE(prt1.b, p2.b))
-         ->  Group
-               Group Key: (COALESCE(prt1.a, p2.a)), (COALESCE(prt1.b, p2.b))
-               ->  Sort
-                     Sort Key: (COALESCE(prt1.a, p2.a)), (COALESCE(prt1.b, p2.b))
-                     ->  Merge Full Join
-                           Merge Cond: ((prt1.a = p2.a) AND (prt1.b = p2.b))
-                           Filter: ((COALESCE(prt1.a, p2.a) >= 490) AND (COALESCE(prt1.a, p2.a) <= 510))
-                           ->  Sort
-                                 Sort Key: prt1.a, prt1.b
-                                 ->  Seq Scan on prt1_p1 prt1
-                           ->  Sort
-                                 Sort Key: p2.a, p2.b
-                                 ->  Seq Scan on prt2_p1 p2
-         ->  Group
-               Group Key: (COALESCE(prt1_1.a, p2_1.a)), (COALESCE(prt1_1.b, p2_1.b))
-               ->  Sort
-                     Sort Key: (COALESCE(prt1_1.a, p2_1.a)), (COALESCE(prt1_1.b, p2_1.b))
-                     ->  Merge Full Join
-                           Merge Cond: ((prt1_1.a = p2_1.a) AND (prt1_1.b = p2_1.b))
-                           Filter: ((COALESCE(prt1_1.a, p2_1.a) >= 490) AND (COALESCE(prt1_1.a, p2_1.a) <= 510))
-                           ->  Sort
-                                 Sort Key: prt1_1.a, prt1_1.b
-                                 ->  Seq Scan on prt1_p2 prt1_1
-                           ->  Sort
-                                 Sort Key: p2_1.a, p2_1.b
-                                 ->  Seq Scan on prt2_p2 p2_1
-         ->  Group
-               Group Key: (COALESCE(prt1_2.a, p2_2.a)), (COALESCE(prt1_2.b, p2_2.b))
-               ->  Sort
-                     Sort Key: (COALESCE(prt1_2.a, p2_2.a)), (COALESCE(prt1_2.b, p2_2.b))
-                     ->  Merge Full Join
-                           Merge Cond: ((prt1_2.a = p2_2.a) AND (prt1_2.b = p2_2.b))
-                           Filter: ((COALESCE(prt1_2.a, p2_2.a) >= 490) AND (COALESCE(prt1_2.a, p2_2.a) <= 510))
-                           ->  Sort
-                                 Sort Key: prt1_2.a, prt1_2.b
-                                 ->  Seq Scan on prt1_p3 prt1_2
-                           ->  Sort
-                                 Sort Key: p2_2.a, p2_2.b
-                                 ->  Seq Scan on prt2_p3 p2_2
-(43 rows)
+         ->  Append
+               ->  Merge Full Join
+                     Merge Cond: ((prt1_1.a = p2_1.a) AND (prt1_1.b = p2_1.b))
+                     Filter: ((COALESCE(prt1_1.a, p2_1.a) >= 490) AND (COALESCE(prt1_1.a, p2_1.a) <= 510))
+                     ->  Sort
+                           Sort Key: prt1_1.a, prt1_1.b
+                           ->  Seq Scan on prt1_p1 prt1_1
+                     ->  Sort
+                           Sort Key: p2_1.a, p2_1.b
+                           ->  Seq Scan on prt2_p1 p2_1
+               ->  Merge Full Join
+                     Merge Cond: ((prt1_2.a = p2_2.a) AND (prt1_2.b = p2_2.b))
+                     Filter: ((COALESCE(prt1_2.a, p2_2.a) >= 490) AND (COALESCE(prt1_2.a, p2_2.a) <= 510))
+                     ->  Sort
+                           Sort Key: prt1_2.a, prt1_2.b
+                           ->  Seq Scan on prt1_p2 prt1_2
+                     ->  Sort
+                           Sort Key: p2_2.a, p2_2.b
+                           ->  Seq Scan on prt2_p2 p2_2
+               ->  Merge Full Join
+                     Merge Cond: ((prt1_3.a = p2_3.a) AND (prt1_3.b = p2_3.b))
+                     Filter: ((COALESCE(prt1_3.a, p2_3.a) >= 490) AND (COALESCE(prt1_3.a, p2_3.a) <= 510))
+                     ->  Sort
+                           Sort Key: prt1_3.a, prt1_3.b
+                           ->  Seq Scan on prt1_p3 prt1_3
+                     ->  Sort
+                           Sort Key: p2_3.a, p2_3.b
+                           ->  Seq Scan on prt2_p3 p2_3
+(32 rows)
 
 SELECT a, b FROM prt1 FULL JOIN prt2 p2(b,a,c) USING(a,b)
   WHERE a BETWEEN 490 AND 510
@@ -1383,28 +1375,32 @@ SELECT t1.a, t1.c, t2.b, t2.c, t3.a + t3.b, t3.c FROM (prt1 t1 LEFT JOIN prt2 t2
 -- This should generate a partitionwise join, but currently fails to
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
-                        QUERY PLAN                         
------------------------------------------------------------
- Incremental Sort
+                           QUERY PLAN                            
+-----------------------------------------------------------------
+ Sort
    Sort Key: prt1.a, prt2.b
-   Presorted Key: prt1.a
-   ->  Merge Left Join
-         Merge Cond: (prt1.a = prt2.b)
-         ->  Sort
-               Sort Key: prt1.a
-               ->  Append
-                     ->  Seq Scan on prt1_p1 prt1_1
-                           Filter: ((a < 450) AND (b = 0))
-                     ->  Seq Scan on prt1_p2 prt1_2
-                           Filter: ((a < 450) AND (b = 0))
-         ->  Sort
-               Sort Key: prt2.b
-               ->  Append
+   ->  Merge Right Join
+         Merge Cond: (prt2.b = prt1.a)
+         ->  Append
+               ->  Sort
+                     Sort Key: prt2_1.b
                      ->  Seq Scan on prt2_p2 prt2_1
                            Filter: (b > 250)
+               ->  Sort
+                     Sort Key: prt2_2.b
                      ->  Seq Scan on prt2_p3 prt2_2
                            Filter: (b > 250)
-(19 rows)
+         ->  Materialize
+               ->  Append
+                     ->  Sort
+                           Sort Key: prt1_1.a
+                           ->  Seq Scan on prt1_p1 prt1_1
+                                 Filter: ((a < 450) AND (b = 0))
+                     ->  Sort
+                           Sort Key: prt1_2.a
+                           ->  Seq Scan on prt1_p2 prt1_2
+                                 Filter: ((a < 450) AND (b = 0))
+(23 rows)
 
 SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
   a  |  b  
@@ -1424,25 +1420,33 @@ SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT *
 -- partitionwise join does not apply
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t2.b FROM prt1 t1, prt2 t2 WHERE t1::text = t2::text AND t1.a = t2.b ORDER BY t1.a;
-                                       QUERY PLAN                                        
------------------------------------------------------------------------------------------
+                            QUERY PLAN                            
+------------------------------------------------------------------
  Merge Join
-   Merge Cond: ((t1.a = t2.b) AND (((((t1.*)::prt1))::text) = ((((t2.*)::prt2))::text)))
-   ->  Sort
-         Sort Key: t1.a, ((((t1.*)::prt1))::text)
-         ->  Result
-               ->  Append
-                     ->  Seq Scan on prt1_p1 t1_1
-                     ->  Seq Scan on prt1_p2 t1_2
-                     ->  Seq Scan on prt1_p3 t1_3
-   ->  Sort
-         Sort Key: t2.b, ((((t2.*)::prt2))::text)
-         ->  Result
-               ->  Append
+   Merge Cond: (t1.a = t2.b)
+   Join Filter: ((((t2.*)::prt2))::text = (((t1.*)::prt1))::text)
+   ->  Append
+         ->  Sort
+               Sort Key: t1_1.a
+               ->  Seq Scan on prt1_p1 t1_1
+         ->  Sort
+               Sort Key: t1_2.a
+               ->  Seq Scan on prt1_p2 t1_2
+         ->  Sort
+               Sort Key: t1_3.a
+               ->  Seq Scan on prt1_p3 t1_3
+   ->  Materialize
+         ->  Append
+               ->  Sort
+                     Sort Key: t2_1.b
                      ->  Seq Scan on prt2_p1 t2_1
+               ->  Sort
+                     Sort Key: t2_2.b
                      ->  Seq Scan on prt2_p2 t2_2
+               ->  Sort
+                     Sort Key: t2_3.b
                      ->  Seq Scan on prt2_p3 t2_3
-(16 rows)
+(24 rows)
 
 SELECT t1.a, t2.b FROM prt1 t1, prt2 t2 WHERE t1::text = t2::text AND t1.a = t2.b ORDER BY t1.a;
  a  | b  
@@ -4508,9 +4512,10 @@ EXPLAIN (COSTS OFF)
 SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2 ON (t1.a = t2.a AND t1.c = t2.c) LEFT JOIN plt1_adv t3 ON (t1.a = t3.a AND t1.c = t3.c) WHERE t1.b < 10 ORDER BY t1.a;
                                    QUERY PLAN                                   
 --------------------------------------------------------------------------------
- Sort
+ Merge Append
    Sort Key: t1.a
-   ->  Append
+   ->  Sort
+         Sort Key: t1_1.a
          ->  Hash Right Join
                Hash Cond: ((t3_1.a = t1_1.a) AND (t3_1.c = t1_1.c))
                ->  Seq Scan on plt1_adv_p1 t3_1
@@ -4521,6 +4526,8 @@ SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2
                            ->  Hash
                                  ->  Seq Scan on plt1_adv_p1 t1_1
                                        Filter: (b < 10)
+   ->  Sort
+         Sort Key: t1_2.a
          ->  Hash Right Join
                Hash Cond: ((t3_2.a = t1_2.a) AND (t3_2.c = t1_2.c))
                ->  Seq Scan on plt1_adv_p2 t3_2
@@ -4531,6 +4538,8 @@ SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2
                            ->  Hash
                                  ->  Seq Scan on plt1_adv_p2 t1_2
                                        Filter: (b < 10)
+   ->  Sort
+         Sort Key: t1_3.a
          ->  Hash Right Join
                Hash Cond: ((t3_3.a = t1_3.a) AND (t3_3.c = t1_3.c))
                ->  Seq Scan on plt1_adv_p3 t3_3
@@ -4541,15 +4550,19 @@ SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2
                            ->  Hash
                                  ->  Seq Scan on plt1_adv_p3 t1_3
                                        Filter: (b < 10)
-         ->  Nested Loop Left Join
-               Join Filter: ((t1_4.a = t3_4.a) AND (t1_4.c = t3_4.c))
-               ->  Nested Loop Left Join
-                     Join Filter: ((t1_4.a = t2_4.a) AND (t1_4.c = t2_4.c))
+   ->  Nested Loop Left Join
+         Join Filter: ((t1_4.a = t3_4.a) AND (t1_4.c = t3_4.c))
+         ->  Merge Left Join
+               Merge Cond: ((t1_4.a = t2_4.a) AND (t1_4.c = t2_4.c))
+               ->  Sort
+                     Sort Key: t1_4.a, t1_4.c
                      ->  Seq Scan on plt1_adv_extra t1_4
                            Filter: (b < 10)
+               ->  Sort
+                     Sort Key: t2_4.a, t2_4.c
                      ->  Seq Scan on plt2_adv_extra t2_4
-               ->  Seq Scan on plt1_adv_extra t3_4
-(41 rows)
+         ->  Seq Scan on plt1_adv_extra t3_4
+(50 rows)
 
 SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2 ON (t1.a = t2.a AND t1.c = t2.c) LEFT JOIN plt1_adv t3 ON (t1.a = t3.a AND t1.c = t3.c) WHERE t1.b < 10 ORDER BY t1.a;
  a  |  c   | a |  c   | a |  c   
diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index d1966cd7d82..43e2962009e 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -4768,9 +4768,10 @@ select min(a) over (partition by a order by a) from part_abc where a >= stable_o
                Window: w1 AS (PARTITION BY part_abc.a ORDER BY part_abc.a)
                ->  Append
                      Subplans Removed: 1
-                     ->  Index Scan using part_abc_2_a_idx on part_abc_2 part_abc_1
-                           Index Cond: (a >= (stable_one() + 1))
-                           Filter: (d <= stable_one())
+                     ->  Sort
+                           Sort Key: part_abc_1.a
+                           ->  Seq Scan on part_abc_2 part_abc_1
+                                 Filter: ((d <= stable_one()) AND (a >= (stable_one() + 1)))
                      ->  Merge Append
                            Sort Key: part_abc_3.a
                            Subplans Removed: 1
@@ -4785,9 +4786,10 @@ select min(a) over (partition by a order by a) from part_abc where a >= stable_o
                Window: w1 AS (PARTITION BY part_abc_5.a ORDER BY part_abc_5.a)
                ->  Append
                      Subplans Removed: 1
-                     ->  Index Scan using part_abc_2_a_idx on part_abc_2 part_abc_6
-                           Index Cond: (a >= (stable_one() + 1))
-                           Filter: (d >= stable_one())
+                     ->  Sort
+                           Sort Key: part_abc_6.a
+                           ->  Seq Scan on part_abc_2 part_abc_6
+                                 Filter: ((d >= stable_one()) AND (a >= (stable_one() + 1)))
                      ->  Merge Append
                            Sort Key: a
                            Subplans Removed: 1
@@ -4797,7 +4799,7 @@ select min(a) over (partition by a order by a) from part_abc where a >= stable_o
                            ->  Index Scan using part_abc_3_3_a_idx on part_abc_3_3 part_abc_9
                                  Index Cond: (a >= (stable_one() + 1))
                                  Filter: (d >= stable_one())
-(35 rows)
+(37 rows)
 
 drop view part_abc_view;
 drop table part_abc;
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index 96962817ed4..e6ba4880b9c 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -1207,12 +1207,14 @@ select event_id
 ----------------------------------------------------------
  Merge Append
    Sort Key: events.event_id
-   ->  Index Scan using events_pkey on events
+   ->  Sort
+         Sort Key: events.event_id
+         ->  Seq Scan on events
    ->  Sort
          Sort Key: events_1.event_id
          ->  Seq Scan on events_child events_1
    ->  Index Scan using other_events_pkey on other_events
-(7 rows)
+(9 rows)
 
 drop table events_child, events, other_events;
 reset enable_indexonlyscan;
@@ -1334,24 +1336,22 @@ select distinct q1 from
    union all
    select distinct * from int8_tbl i82) ss
 where q2 = q2;
-                        QUERY PLAN                        
-----------------------------------------------------------
- Unique
-   ->  Merge Append
-         Sort Key: "*SELECT* 1".q1
+                     QUERY PLAN                     
+----------------------------------------------------
+ HashAggregate
+   Group Key: "*SELECT* 1".q1
+   ->  Append
          ->  Subquery Scan on "*SELECT* 1"
-               ->  Unique
-                     ->  Sort
-                           Sort Key: i81.q1, i81.q2
-                           ->  Seq Scan on int8_tbl i81
-                                 Filter: (q2 IS NOT NULL)
+               ->  HashAggregate
+                     Group Key: i81.q1, i81.q2
+                     ->  Seq Scan on int8_tbl i81
+                           Filter: (q2 IS NOT NULL)
          ->  Subquery Scan on "*SELECT* 2"
-               ->  Unique
-                     ->  Sort
-                           Sort Key: i82.q1, i82.q2
-                           ->  Seq Scan on int8_tbl i82
-                                 Filter: (q2 IS NOT NULL)
-(15 rows)
+               ->  HashAggregate
+                     Group Key: i82.q1, i82.q2
+                     ->  Seq Scan on int8_tbl i82
+                           Filter: (q2 IS NOT NULL)
+(13 rows)
 
 select distinct q1 from
   (select distinct * from int8_tbl i81
@@ -1370,24 +1370,25 @@ select distinct q1 from
    union all
    select distinct * from int8_tbl i82) ss
 where -q1 = q2;
-                       QUERY PLAN                       
---------------------------------------------------------
+                          QUERY PLAN                          
+--------------------------------------------------------------
  Unique
-   ->  Merge Append
+   ->  Sort
          Sort Key: "*SELECT* 1".q1
-         ->  Subquery Scan on "*SELECT* 1"
-               ->  Unique
-                     ->  Sort
-                           Sort Key: i81.q1, i81.q2
-                           ->  Seq Scan on int8_tbl i81
-                                 Filter: ((- q1) = q2)
-         ->  Subquery Scan on "*SELECT* 2"
-               ->  Unique
-                     ->  Sort
-                           Sort Key: i82.q1, i82.q2
-                           ->  Seq Scan on int8_tbl i82
-                                 Filter: ((- q1) = q2)
-(15 rows)
+         ->  Append
+               ->  Subquery Scan on "*SELECT* 1"
+                     ->  Unique
+                           ->  Sort
+                                 Sort Key: i81.q1, i81.q2
+                                 ->  Seq Scan on int8_tbl i81
+                                       Filter: ((- q1) = q2)
+               ->  Subquery Scan on "*SELECT* 2"
+                     ->  Unique
+                           ->  Sort
+                                 Sort Key: i82.q1, i82.q2
+                                 ->  Seq Scan on int8_tbl i82
+                                       Filter: ((- q1) = q2)
+(16 rows)
 
 select distinct q1 from
   (select distinct * from int8_tbl i81
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 699e8ac09c8..c58beebbd1e 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -641,12 +641,14 @@ insert into matest2 (name) values ('Test 4');
 insert into matest3 (name) values ('Test 5');
 insert into matest3 (name) values ('Test 6');
 
-set enable_indexscan = off;  -- force use of seqscan/sort, so no merge
+set enable_indexscan = off;  -- force use of seqscan/sort
+set enable_sort = off; -- since merge append may employ sort in children we need to disable sort
 explain (verbose, costs off) select * from matest0 order by 1-id;
 select * from matest0 order by 1-id;
 explain (verbose, costs off) select min(1-id) from matest0;
 select min(1-id) from matest0;
 reset enable_indexscan;
+reset enable_sort;
 
 set enable_seqscan = off;  -- plan with fewest seqscans should be merge
 set enable_parallel_append = off; -- Don't let parallel-append interfere
-- 
2.50.1

#27

Alexander Korotkov

aekorotkov@gmail.com

6 months ago

In reply to: Andrei Lepikhov (#26)

2 attachment(s)

Re: MergeAppend could consider sorting cheapest child path

On Tue, Jul 22, 2025 at 2:13 PM Andrei Lepikhov <lepihov@gmail.com> wrote:

On 4/6/2025 00:41, Alexander Korotkov wrote:

On Tue, Jun 3, 2025 at 5:35 PM Andrei Lepikhov <lepihov@gmail.com>

wrote:

For me, it seems like a continuation of the 7d8ac98 discussion. We may
charge a small fee for MergeAppend to adjust the balance, of course.
However, I think this small change requires a series of benchmarks to
determine how it affects the overall cost balance. Without examples it
is hard to say how important this issue is and its worthiness to
commence such work.

Yes, I think it's fair to charge the MergeAppend node. We currently
cost it similarly to Sort merge stage, but it's clearly more
expensive. It dealing on the executor level dealing with Slot's etc,
while Sort node have a set of lower level optimizations.

After conducting additional research, I concluded that you are correct,
and the current cost model doesn't allow the optimiser to detect the
best option. A simple test with a full scan and sort of a partitioned
table (see attachment) shows that the query plan prefers small sortings
merged by the MergeAppend node. I have got the following results for
different numbers of tuples to be sorted (in the range from 10 tuples to
1E8):

EXPLAIN SELECT * FROM test ORDER BY y;

1E1: Sort (cost=9.53..9.57 rows=17 width=8)
1E2: Sort (cost=20.82..21.07 rows=100)
1E3: Merge Append (cost=56.19..83.69 rows=1000)
1E4: Merge Append (cost=612.74..887.74 rows=10000)
1E5: Merge Append (cost=7754.19..10504.19 rows=100000)
1E6: Merge Append (cost=94092.25..121592.25 rows=1000000)
1E7: Merge Append (cost=1106931.22..1381931.22 rows=10000000)
1E8: Merge Append (cost=14097413.40..16847413.40 rows=100000000)

That happens because both total costs lie within the fuzzy factor gap,
and the optimiser chooses the path based on the startup cost, which is
obviously better for the MergeAppend case.

At the same time, execution, involving a single Sort node, dominates the
MergeAppend case:

1E3: MergeAppend: 1.927 ms, Sort: 0.720 ms
1E4: MergeAppend: 10.090 ms, Sort: 7.583 ms
1E5: MergeAppend: 118.885 ms, Sort: 88.492 ms
1E6: MergeAppend: 1372.717 ms, Sort: 1106.184 ms
1E7: MergeAppend: 15103.893 ms, Sort: 13415.806 ms
1E8: MergeAppend: 176553.133 ms, Sort: 149458.787 ms

Looking inside the code, I found the only difference we can employ to
rationalise the cost model change: re-structuring of a heap. The
siftDown routine employs two tuple comparisons to find the proper
position for a tuple. So, we have objections to changing the constant in
the cost model of the merge operation:

@@ -2448,7 +2448,7 @@ cost_merge_append(Path *path, PlannerInfo *root,
logN = LOG2(N);
/* Assumed cost per tuple comparison */
-       comparison_cost = 2.0 * cpu_operator_cost;
+       comparison_cost = 4.0 * cpu_operator_cost;
/* Heap creation cost */
startup_cost += comparison_cost * N * logN;

The exact change also needs to be made in the cost_gather_merge
function, of course.
At this moment, I'm not sure that it should be multiplied by 2 - it is a
subject for further discussion. However, it fixes the current issue and
adds a little additional cost to the merge operation, which, as I see
it, is quite limited.
Please see the new version of the patch attached.

I've checked the cost adjustment you've made. If you change the cost of
top-N sort, you must also change the prior if condition on when it gets
selected. We currently do the switch on tuples = 2 * limit_tuples. If we
apply the proposed change, we should switch on tuples = limit_tuples^2 ^
2. But also, in order for this cost to reflect reality, we must change
tuplesort_puttuple_common() in the same way. These inconsistencies lead to
failures on contrib/postgres_fdw checks. But these changes don't appear to
be a win. See the example below. Top-N sort appears to be a win for
LIMITS up to 500000.

# EXPLAIN ANALYZE SELECT * FROM (SELECT random() r FROM generate_series(1,
1000000) i) x ORDER BY x.r LIMIT 10000;
QUERY
PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=112157.85..112182.85 rows=10000 width=8) (actual
time=638.998..639.841 rows=10000.00 loops=1)
-> Sort (cost=112157.85..114657.85 rows=1000000 width=8) (actual
time=638.996..639.340 rows=10000.00 loops=1)
Sort Key: (random())
Sort Method: quicksort Memory: 24577kB
-> Function Scan on generate_series i (cost=0.00..12500.00
rows=1000000 width=8) (actual time=126.582..205.610 rows=1000000.00 loops=1)
Planning Time: 0.118 ms
Execution Time: 653.283 ms
(7 rows)

# EXPLAIN ANALYZE SELECT * FROM (SELECT random() r FROM generate_series(1,
1000000) i) x ORDER BY x.r LIMIT 500000;
QUERY
PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=112157.85..113407.85 rows=500000 width=8) (actual
time=646.522..688.573 rows=500000.00 loops=1)
-> Sort (cost=112157.85..114657.85 rows=1000000 width=8) (actual
time=646.520..663.562 rows=500000.00 loops=1)
Sort Key: (random())
Sort Method: quicksort Memory: 24577kB
-> Function Scan on generate_series i (cost=0.00..12500.00
rows=1000000 width=8) (actual time=129.028..208.936 rows=1000000.00 loops=1)
Planning Time: 0.188 ms
Execution Time: 713.738 ms
(7 rows)

# EXPLAIN ANALYZE SELECT * FROM (SELECT random() r FROM generate_series(1,
1000000) i) x ORDER BY x.r LIMIT 10000;
QUERY
PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=78938.56..78963.56 rows=10000 width=8) (actual
time=412.633..413.459 rows=10000.00 loops=1)
Buffers: shared hit=3, temp read=1709 written=1709
-> Sort (cost=78938.56..81438.56 rows=1000000 width=8) (actual
time=412.631..412.969 rows=10000.00 loops=1)
Sort Key: (random())
Sort Method: top-N heapsort Memory: 769kB
Buffers: shared hit=3, temp read=1709 written=1709
-> Function Scan on generate_series i (cost=0.00..12500.00
rows=1000000 width=8) (actual time=185.892..333.233 rows=1000000.00 loops=1)
Buffers: temp read=1709 written=1709
Planning:
Buffers: shared hit=7 read=8
Planning Time: 2.058 ms
Execution Time: 416.040 ms
(12 rows)

# EXPLAIN ANALYZE SELECT * FROM (SELECT random() r FROM generate_series(1,
1000000) i) x ORDER BY x.r LIMIT 500000;
QUERY
PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=112157.85..113407.85 rows=500000 width=8) (actual
time=631.185..673.768 rows=500000.00 loops=1)
-> Sort (cost=112157.85..114657.85 rows=1000000 width=8) (actual
time=631.183..649.072 rows=500000.00 loops=1)
Sort Key: (random())
Sort Method: quicksort Memory: 24577kB
-> Function Scan on generate_series i (cost=0.00..12500.00
rows=1000000 width=8) (actual time=121.274..200.453 rows=1000000.00 loops=1)
Planning Time: 0.243 ms
Execution Time: 698.841 ms
(7 rows)

I've another idea. cost_tuplesort() puts 2.0 under logarithm to prefer
tuplesort over heapsort. I think we can adjust cost_gather_merge() and
cost_merge_append() to do the same. 0001 patch implements that. I think
the plan changes of 0001 might be reasonable since most cases deal with
small rowsets. One thing concerns me: 0002 still affects one of the
postgres_fdw checks. Could you, please, take a look?

------
Regards,
Alexander Korotkov
Supabase

Attachments:

v8-0001-Add-some-penatly-to-gather-merge-and-merge-append.patchapplication/octet-stream; name=v8-0001-Add-some-penatly-to-gather-merge-and-merge-append.patchDownload

From 414dc72019f71e589ae5d16a4692419750a04392 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Sun, 27 Jul 2025 01:15:57 +0300
Subject: [PATCH v8 1/2] Add some penatly to gather merge and merge append

Reported-by:
Bug:
Discussion:
Author:
Co-authored-by:
Reviewed-by:
Tested-by:
Backpatch-through:
---
 src/backend/optimizer/path/costsize.c         |  4 +-
 src/test/regress/expected/inherit.out         | 15 ++--
 .../regress/expected/partition_aggregate.out  | 32 ++++----
 src/test/regress/expected/partition_join.out  | 75 ++++++++-----------
 src/test/regress/expected/union.out           | 63 ++++++++--------
 5 files changed, 87 insertions(+), 102 deletions(-)

diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 1f04a2c182c..7ff861247af 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -509,7 +509,7 @@ cost_gather_merge(GatherMergePath *path, PlannerInfo *root,
 	 */
 	Assert(path->num_workers > 0);
 	N = (double) path->num_workers + 1;
-	logN = LOG2(N);
+	logN = LOG2(2.0 * N);
 
 	/* Assumed cost per tuple comparison */
 	comparison_cost = 2.0 * cpu_operator_cost;
@@ -2471,7 +2471,7 @@ cost_merge_append(Path *path, PlannerInfo *root,
 	 * Avoid log(0)...
 	 */
 	N = (n_streams < 2) ? 2.0 : (double) n_streams;
-	logN = LOG2(N);
+	logN = LOG2(2.0 * N);
 
 	/* Assumed cost per tuple comparison */
 	comparison_cost = 2.0 * cpu_operator_cost;
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index 5b5055babdc..b82da190534 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -1867,17 +1867,16 @@ analyze matest0;
 analyze matest1;
 explain (costs off)
 select * from matest0 where a < 100 order by a;
-                          QUERY PLAN                           
----------------------------------------------------------------
- Merge Append
+                QUERY PLAN                 
+-------------------------------------------
+ Sort
    Sort Key: matest0.a
-   ->  Index Only Scan using matest0_pkey on matest0 matest0_1
-         Index Cond: (a < 100)
-   ->  Sort
-         Sort Key: matest0_2.a
+   ->  Append
+         ->  Seq Scan on matest0 matest0_1
+               Filter: (a < 100)
          ->  Seq Scan on matest1 matest0_2
                Filter: (a < 100)
-(8 rows)
+(7 rows)
 
 drop table matest0 cascade;
 NOTICE:  drop cascades to table matest1
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index 5f2c0cf5786..9245506279b 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -1380,28 +1380,26 @@ SELECT x, sum(y), avg(y), count(*) FROM pagg_tab_para GROUP BY x HAVING avg(y) <
 -- When GROUP BY clause does not match; partial aggregation is performed for each partition.
 EXPLAIN (COSTS OFF)
 SELECT y, sum(x), avg(x), count(*) FROM pagg_tab_para GROUP BY y HAVING avg(x) < 12 ORDER BY 1, 2, 3;
-                                        QUERY PLAN                                         
--------------------------------------------------------------------------------------------
+                                     QUERY PLAN                                      
+-------------------------------------------------------------------------------------
  Sort
    Sort Key: pagg_tab_para.y, (sum(pagg_tab_para.x)), (avg(pagg_tab_para.x))
-   ->  Finalize GroupAggregate
+   ->  Finalize HashAggregate
          Group Key: pagg_tab_para.y
          Filter: (avg(pagg_tab_para.x) < '12'::numeric)
-         ->  Gather Merge
+         ->  Gather
                Workers Planned: 2
-               ->  Sort
-                     Sort Key: pagg_tab_para.y
-                     ->  Parallel Append
-                           ->  Partial HashAggregate
-                                 Group Key: pagg_tab_para.y
-                                 ->  Parallel Seq Scan on pagg_tab_para_p1 pagg_tab_para
-                           ->  Partial HashAggregate
-                                 Group Key: pagg_tab_para_1.y
-                                 ->  Parallel Seq Scan on pagg_tab_para_p2 pagg_tab_para_1
-                           ->  Partial HashAggregate
-                                 Group Key: pagg_tab_para_2.y
-                                 ->  Parallel Seq Scan on pagg_tab_para_p3 pagg_tab_para_2
-(19 rows)
+               ->  Parallel Append
+                     ->  Partial HashAggregate
+                           Group Key: pagg_tab_para.y
+                           ->  Parallel Seq Scan on pagg_tab_para_p1 pagg_tab_para
+                     ->  Partial HashAggregate
+                           Group Key: pagg_tab_para_1.y
+                           ->  Parallel Seq Scan on pagg_tab_para_p2 pagg_tab_para_1
+                     ->  Partial HashAggregate
+                           Group Key: pagg_tab_para_2.y
+                           ->  Parallel Seq Scan on pagg_tab_para_p3 pagg_tab_para_2
+(17 rows)
 
 SELECT y, sum(x), avg(x), count(*) FROM pagg_tab_para GROUP BY y HAVING avg(x) < 12 ORDER BY 1, 2, 3;
  y  |  sum  |         avg         | count 
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index d5368186caa..0431e183783 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -643,52 +643,41 @@ EXPLAIN (COSTS OFF)
 SELECT a, b FROM prt1 FULL JOIN prt2 p2(b,a,c) USING(a,b)
   WHERE a BETWEEN 490 AND 510
   GROUP BY 1, 2 ORDER BY 1, 2;
-                                                   QUERY PLAN                                                    
------------------------------------------------------------------------------------------------------------------
+                                                QUERY PLAN                                                 
+-----------------------------------------------------------------------------------------------------------
  Group
    Group Key: (COALESCE(prt1.a, p2.a)), (COALESCE(prt1.b, p2.b))
-   ->  Merge Append
+   ->  Sort
          Sort Key: (COALESCE(prt1.a, p2.a)), (COALESCE(prt1.b, p2.b))
-         ->  Group
-               Group Key: (COALESCE(prt1.a, p2.a)), (COALESCE(prt1.b, p2.b))
-               ->  Sort
-                     Sort Key: (COALESCE(prt1.a, p2.a)), (COALESCE(prt1.b, p2.b))
-                     ->  Merge Full Join
-                           Merge Cond: ((prt1.a = p2.a) AND (prt1.b = p2.b))
-                           Filter: ((COALESCE(prt1.a, p2.a) >= 490) AND (COALESCE(prt1.a, p2.a) <= 510))
-                           ->  Sort
-                                 Sort Key: prt1.a, prt1.b
-                                 ->  Seq Scan on prt1_p1 prt1
-                           ->  Sort
-                                 Sort Key: p2.a, p2.b
-                                 ->  Seq Scan on prt2_p1 p2
-         ->  Group
-               Group Key: (COALESCE(prt1_1.a, p2_1.a)), (COALESCE(prt1_1.b, p2_1.b))
-               ->  Sort
-                     Sort Key: (COALESCE(prt1_1.a, p2_1.a)), (COALESCE(prt1_1.b, p2_1.b))
-                     ->  Merge Full Join
-                           Merge Cond: ((prt1_1.a = p2_1.a) AND (prt1_1.b = p2_1.b))
-                           Filter: ((COALESCE(prt1_1.a, p2_1.a) >= 490) AND (COALESCE(prt1_1.a, p2_1.a) <= 510))
-                           ->  Sort
-                                 Sort Key: prt1_1.a, prt1_1.b
-                                 ->  Seq Scan on prt1_p2 prt1_1
-                           ->  Sort
-                                 Sort Key: p2_1.a, p2_1.b
-                                 ->  Seq Scan on prt2_p2 p2_1
-         ->  Group
-               Group Key: (COALESCE(prt1_2.a, p2_2.a)), (COALESCE(prt1_2.b, p2_2.b))
-               ->  Sort
-                     Sort Key: (COALESCE(prt1_2.a, p2_2.a)), (COALESCE(prt1_2.b, p2_2.b))
-                     ->  Merge Full Join
-                           Merge Cond: ((prt1_2.a = p2_2.a) AND (prt1_2.b = p2_2.b))
-                           Filter: ((COALESCE(prt1_2.a, p2_2.a) >= 490) AND (COALESCE(prt1_2.a, p2_2.a) <= 510))
-                           ->  Sort
-                                 Sort Key: prt1_2.a, prt1_2.b
-                                 ->  Seq Scan on prt1_p3 prt1_2
-                           ->  Sort
-                                 Sort Key: p2_2.a, p2_2.b
-                                 ->  Seq Scan on prt2_p3 p2_2
-(43 rows)
+         ->  Append
+               ->  Merge Full Join
+                     Merge Cond: ((prt1_1.a = p2_1.a) AND (prt1_1.b = p2_1.b))
+                     Filter: ((COALESCE(prt1_1.a, p2_1.a) >= 490) AND (COALESCE(prt1_1.a, p2_1.a) <= 510))
+                     ->  Sort
+                           Sort Key: prt1_1.a, prt1_1.b
+                           ->  Seq Scan on prt1_p1 prt1_1
+                     ->  Sort
+                           Sort Key: p2_1.a, p2_1.b
+                           ->  Seq Scan on prt2_p1 p2_1
+               ->  Merge Full Join
+                     Merge Cond: ((prt1_2.a = p2_2.a) AND (prt1_2.b = p2_2.b))
+                     Filter: ((COALESCE(prt1_2.a, p2_2.a) >= 490) AND (COALESCE(prt1_2.a, p2_2.a) <= 510))
+                     ->  Sort
+                           Sort Key: prt1_2.a, prt1_2.b
+                           ->  Seq Scan on prt1_p2 prt1_2
+                     ->  Sort
+                           Sort Key: p2_2.a, p2_2.b
+                           ->  Seq Scan on prt2_p2 p2_2
+               ->  Merge Full Join
+                     Merge Cond: ((prt1_3.a = p2_3.a) AND (prt1_3.b = p2_3.b))
+                     Filter: ((COALESCE(prt1_3.a, p2_3.a) >= 490) AND (COALESCE(prt1_3.a, p2_3.a) <= 510))
+                     ->  Sort
+                           Sort Key: prt1_3.a, prt1_3.b
+                           ->  Seq Scan on prt1_p3 prt1_3
+                     ->  Sort
+                           Sort Key: p2_3.a, p2_3.b
+                           ->  Seq Scan on prt2_p3 p2_3
+(32 rows)
 
 SELECT a, b FROM prt1 FULL JOIN prt2 p2(b,a,c) USING(a,b)
   WHERE a BETWEEN 490 AND 510
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index 96962817ed4..d0977b05f98 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -1334,24 +1334,22 @@ select distinct q1 from
    union all
    select distinct * from int8_tbl i82) ss
 where q2 = q2;
-                        QUERY PLAN                        
-----------------------------------------------------------
- Unique
-   ->  Merge Append
-         Sort Key: "*SELECT* 1".q1
+                     QUERY PLAN                     
+----------------------------------------------------
+ HashAggregate
+   Group Key: "*SELECT* 1".q1
+   ->  Append
          ->  Subquery Scan on "*SELECT* 1"
-               ->  Unique
-                     ->  Sort
-                           Sort Key: i81.q1, i81.q2
-                           ->  Seq Scan on int8_tbl i81
-                                 Filter: (q2 IS NOT NULL)
+               ->  HashAggregate
+                     Group Key: i81.q1, i81.q2
+                     ->  Seq Scan on int8_tbl i81
+                           Filter: (q2 IS NOT NULL)
          ->  Subquery Scan on "*SELECT* 2"
-               ->  Unique
-                     ->  Sort
-                           Sort Key: i82.q1, i82.q2
-                           ->  Seq Scan on int8_tbl i82
-                                 Filter: (q2 IS NOT NULL)
-(15 rows)
+               ->  HashAggregate
+                     Group Key: i82.q1, i82.q2
+                     ->  Seq Scan on int8_tbl i82
+                           Filter: (q2 IS NOT NULL)
+(13 rows)
 
 select distinct q1 from
   (select distinct * from int8_tbl i81
@@ -1370,24 +1368,25 @@ select distinct q1 from
    union all
    select distinct * from int8_tbl i82) ss
 where -q1 = q2;
-                       QUERY PLAN                       
---------------------------------------------------------
+                          QUERY PLAN                          
+--------------------------------------------------------------
  Unique
-   ->  Merge Append
+   ->  Sort
          Sort Key: "*SELECT* 1".q1
-         ->  Subquery Scan on "*SELECT* 1"
-               ->  Unique
-                     ->  Sort
-                           Sort Key: i81.q1, i81.q2
-                           ->  Seq Scan on int8_tbl i81
-                                 Filter: ((- q1) = q2)
-         ->  Subquery Scan on "*SELECT* 2"
-               ->  Unique
-                     ->  Sort
-                           Sort Key: i82.q1, i82.q2
-                           ->  Seq Scan on int8_tbl i82
-                                 Filter: ((- q1) = q2)
-(15 rows)
+         ->  Append
+               ->  Subquery Scan on "*SELECT* 1"
+                     ->  Unique
+                           ->  Sort
+                                 Sort Key: i81.q1, i81.q2
+                                 ->  Seq Scan on int8_tbl i81
+                                       Filter: ((- q1) = q2)
+               ->  Subquery Scan on "*SELECT* 2"
+                     ->  Unique
+                           ->  Sort
+                                 Sort Key: i82.q1, i82.q2
+                                 ->  Seq Scan on int8_tbl i82
+                                       Filter: ((- q1) = q2)
+(16 rows)
 
 select distinct q1 from
   (select distinct * from int8_tbl i81
-- 
2.39.5 (Apple Git-154)

v8-0002-Consider-an-explicit-sort-of-the-MergeAppend-subp.patchapplication/octet-stream; name=v8-0002-Consider-an-explicit-sort-of-the-MergeAppend-subp.patchDownload

From 5068658e93c50496a95b909db46cf2d1d5334ea7 Mon Sep 17 00:00:00 2001
From: "Andrei V. Lepikhov" <lepihov@gmail.com>
Date: Tue, 3 Jun 2025 11:37:23 +0200
Subject: [PATCH v8 2/2] Consider an explicit sort of the MergeAppend subpaths.

Broaden the optimiser's search scope slightly: when retrieving optimal subpaths
that match pathkeys for the planning MergeAppend, also consider the case of
an overall optimal path that includes an explicit Sort node at the top.

It may provide a more effective plan in both full and fractional scan cases:
1. The Sort node may be pushed down to subpaths under a parallel or async Append.
2. The case when a minor set of subpaths doesn't have a proper index, and it
is profitable to sort them instead of switching to plain Append.

Having implemented that strategy, it became clear that the cost of multiple
small sortings merged by a single MergeAppend node exceeds that of a single Sort
operation over a plain Append. The code and benchmarks demonstrate that such
an assumption is incorrect because the Sort operator has optimisations that work
faster than a MergeAppend.
To arrange the cost model, change the merge cost multiplier, considering that
heap rebuilding needs two comparison operations.
---
 .../postgres_fdw/expected/postgres_fdw.out    |   6 +-
 src/backend/optimizer/path/allpaths.c         |  47 +++----
 src/backend/optimizer/path/pathkeys.c         | 115 ++++++++++++++++
 src/include/optimizer/paths.h                 |  10 ++
 src/test/regress/expected/inherit.out         |  19 ++-
 src/test/regress/expected/partition_join.out  | 128 +++++++++++-------
 src/test/regress/expected/partition_prune.out |  16 ++-
 src/test/regress/expected/union.out           |   6 +-
 src/test/regress/sql/inherit.sql              |   4 +-
 9 files changed, 250 insertions(+), 101 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 4b6e49a5d95..acaa5d713d8 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -10373,13 +10373,15 @@ SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a
    ->  Nested Loop
          Join Filter: (t1.a = t2.b)
          ->  Append
-               ->  Foreign Scan on ftprt1_p1 t1_1
+               ->  Sort
+                     Sort Key: t1_1.a
+                     ->  Foreign Scan on ftprt1_p1 t1_1
                ->  Foreign Scan on ftprt1_p2 t1_2
          ->  Materialize
                ->  Append
                      ->  Foreign Scan on ftprt2_p1 t2_1
                      ->  Foreign Scan on ftprt2_p2 t2_2
-(10 rows)
+(12 rows)
 
 SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
   a  |  b  
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 6cc6966b060..65ea9d477b0 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1855,29 +1855,18 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 
 			/* Locate the right paths, if they are available. */
 			cheapest_startup =
-				get_cheapest_path_for_pathkeys(childrel->pathlist,
-											   pathkeys,
-											   NULL,
-											   STARTUP_COST,
-											   false);
+				get_cheapest_path_for_pathkeys_ext(root, childrel, pathkeys,
+												   NULL, STARTUP_COST, false);
 			cheapest_total =
-				get_cheapest_path_for_pathkeys(childrel->pathlist,
-											   pathkeys,
-											   NULL,
-											   TOTAL_COST,
-											   false);
+				get_cheapest_path_for_pathkeys_ext(root, childrel, pathkeys,
+												   NULL, TOTAL_COST, false);
 
 			/*
-			 * If we can't find any paths with the right order just use the
-			 * cheapest-total path; we'll have to sort it later.
+			 * In accordance to current planning logic there are no
+			 * parameterised paths under a merge append.
 			 */
-			if (cheapest_startup == NULL || cheapest_total == NULL)
-			{
-				cheapest_startup = cheapest_total =
-					childrel->cheapest_total_path;
-				/* Assert we do have an unparameterized path for this child */
-				Assert(cheapest_total->param_info == NULL);
-			}
+			Assert(cheapest_startup != NULL && cheapest_total != NULL);
+			Assert(cheapest_total->param_info == NULL);
 
 			/*
 			 * When building a fractional path, determine a cheapest
@@ -1904,21 +1893,17 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 					path_fraction /= childrel->rows;
 
 				cheapest_fractional =
-					get_cheapest_fractional_path_for_pathkeys(childrel->pathlist,
-															  pathkeys,
-															  NULL,
-															  path_fraction);
+					get_cheapest_fractional_path_for_pathkeys_ext(root,
+																  childrel,
+																  pathkeys,
+																  NULL,
+																  path_fraction);
 
 				/*
-				 * If we found no path with matching pathkeys, use the
-				 * cheapest total path instead.
-				 *
-				 * XXX We might consider partially sorted paths too (with an
-				 * incremental sort on top). But we'd have to build all the
-				 * incremental paths, do the costing etc.
+				 * In accordance to current planning logic there are no
+				 * parameterised fractional paths under a merge append.
 				 */
-				if (!cheapest_fractional)
-					cheapest_fractional = cheapest_total;
+				Assert(cheapest_fractional != NULL);
 			}
 
 			/*
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 8b04d40d36d..3a13b3d02ee 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -19,11 +19,13 @@
 
 #include "access/stratnum.h"
 #include "catalog/pg_opfamily.h"
+#include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/cost.h"
 #include "optimizer/optimizer.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
+#include "optimizer/planner.h"
 #include "partitioning/partbounds.h"
 #include "rewrite/rewriteManip.h"
 #include "utils/lsyscache.h"
@@ -648,6 +650,64 @@ get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 	return matched_path;
 }
 
+/*
+ * get_cheapest_path_for_pathkeys_ext
+ *	  Calls get_cheapest_path_for_pathkeys to obtain cheapest path that
+ *	  satisfies defined criterias and сonsiders one more option: choose
+ *	  overall-optimal path (according the criterion) and explicitly sort its
+ *	  output to satisfy the pathkeys.
+ *
+ *	  Caller is responsible to insert corresponding sort path at the top of
+ *	  returned path if it will be chosen to be used.
+ *
+ *	  Return NULL if no such path.
+ */
+Path *
+get_cheapest_path_for_pathkeys_ext(PlannerInfo *root, RelOptInfo *rel,
+								   List *pathkeys, Relids required_outer,
+								   CostSelector cost_criterion,
+								   bool require_parallel_safe)
+{
+	Path		sort_path;
+	Path	   *base_path = rel->cheapest_total_path;
+	Path	   *path;
+
+	/* In generate_orderedappend_paths() all childrels do have some paths */
+	Assert(base_path);
+
+	path = get_cheapest_path_for_pathkeys(rel->pathlist, pathkeys,
+										  required_outer, cost_criterion,
+										  require_parallel_safe);
+
+	/*
+	 * Stop here if the cheapest total path doesn't satisfy necessary
+	 * conditions
+	 */
+	if ((require_parallel_safe && !base_path->parallel_safe) ||
+		!bms_is_subset(PATH_REQ_OUTER(base_path), required_outer))
+		return path;
+
+	if (path == NULL)
+
+		/*
+		 * Current pathlist doesn't fit the pathkeys. No need to check extra
+		 * sort path ways.
+		 */
+		return base_path;
+
+	/* Consider the cheapest total path with extra sort */
+	if (path != base_path)
+	{
+		cost_sort(&sort_path, root, pathkeys, base_path->disabled_nodes,
+				  base_path->total_cost, base_path->rows,
+				  base_path->pathtarget->width, 0.0, work_mem, -1.0);
+
+		if (compare_path_costs(&sort_path, path, cost_criterion) < 0)
+			return base_path;
+	}
+	return path;
+}
+
 /*
  * get_cheapest_fractional_path_for_pathkeys
  *	  Find the cheapest path (for retrieving a specified fraction of all
@@ -690,6 +750,61 @@ get_cheapest_fractional_path_for_pathkeys(List *paths,
 	return matched_path;
 }
 
+/*
+ * get_cheapest_fractional_path_for_pathkeys_ext
+ *	  obtain cheapest fractional path that satisfies defined criterias excluding
+ *	  pathkeys and explicitly sort its output to satisfy the pathkeys.
+ *
+ *	  Caller is responsible to insert corresponding sort path at the top of
+ *	  returned path if it will be chosen to be used.
+ *
+ *	  Return NULL if no such path.
+ */
+Path *
+get_cheapest_fractional_path_for_pathkeys_ext(PlannerInfo *root,
+											  RelOptInfo *rel,
+											  List *pathkeys,
+											  Relids required_outer,
+											  double fraction)
+{
+	Path		sort_path;
+	Path	   *base_path = rel->cheapest_total_path;
+	Path	   *path;
+
+	/* In generate_orderedappend_paths() all childrels do have some paths */
+	Assert(base_path);
+
+	path = get_cheapest_fractional_path_for_pathkeys(rel->pathlist, pathkeys,
+													 required_outer, fraction);
+
+	/*
+	 * Stop here if the cheapest total path doesn't satisfy necessary
+	 * conditions
+	 */
+	if (!bms_is_subset(PATH_REQ_OUTER(base_path), required_outer))
+		return path;
+
+	if (path == NULL)
+
+		/*
+		 * Current pathlist doesn't fit the pathkeys. No need to check extra
+		 * sort path ways.
+		 */
+		return base_path;
+
+	/* Consider the cheapest total path with extra sort */
+	if (path != base_path)
+	{
+		cost_sort(&sort_path, root, pathkeys, base_path->disabled_nodes,
+				  base_path->total_cost, base_path->rows,
+				  base_path->pathtarget->width, 0.0, work_mem, -1.0);
+
+		if (compare_fractional_path_costs(&sort_path, path, fraction) <= 0)
+			return base_path;
+	}
+
+	return path;
+}
 
 /*
  * get_cheapest_parallel_safe_total_inner
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 8410531f2d6..26cd4585c97 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -222,10 +222,20 @@ extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
 											bool require_parallel_safe);
+extern Path *get_cheapest_path_for_pathkeys_ext(PlannerInfo *root,
+												RelOptInfo *rel, List *pathkeys,
+												Relids required_outer,
+												CostSelector cost_criterion,
+												bool require_parallel_safe);
 extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 													   List *pathkeys,
 													   Relids required_outer,
 													   double fraction);
+extern Path *get_cheapest_fractional_path_for_pathkeys_ext(PlannerInfo *root,
+														   RelOptInfo *rel,
+														   List *pathkeys,
+														   Relids required_outer,
+														   double fraction);
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
 								  ScanDirection scandir);
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index b82da190534..bb6125090bd 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -1665,11 +1665,13 @@ insert into matest2 (name) values ('Test 3');
 insert into matest2 (name) values ('Test 4');
 insert into matest3 (name) values ('Test 5');
 insert into matest3 (name) values ('Test 6');
-set enable_indexscan = off;  -- force use of seqscan/sort, so no merge
+set enable_indexscan = off;  -- force use of seqscan/sort
+set enable_sort = off; -- since merge append may employ sort in children we need to disable sort
 explain (verbose, costs off) select * from matest0 order by 1-id;
                          QUERY PLAN                         
 ------------------------------------------------------------
  Sort
+   Disabled: true
    Output: matest0.id, matest0.name, ((1 - matest0.id))
    Sort Key: ((1 - matest0.id))
    ->  Result
@@ -1683,7 +1685,7 @@ explain (verbose, costs off) select * from matest0 order by 1-id;
                      Output: matest0_3.id, matest0_3.name
                ->  Seq Scan on public.matest3 matest0_4
                      Output: matest0_4.id, matest0_4.name
-(14 rows)
+(15 rows)
 
 select * from matest0 order by 1-id;
  id |  name  
@@ -1719,6 +1721,7 @@ select min(1-id) from matest0;
 (1 row)
 
 reset enable_indexscan;
+reset enable_sort;
 set enable_seqscan = off;  -- plan with fewest seqscans should be merge
 set enable_parallel_append = off; -- Don't let parallel-append interfere
 explain (verbose, costs off) select * from matest0 order by 1-id;
@@ -1844,16 +1847,20 @@ order by t1.b limit 10;
          Merge Cond: (t1.b = t2.b)
          ->  Merge Append
                Sort Key: t1.b
-               ->  Index Scan using matest0i on matest0 t1_1
+               ->  Sort
+                     Sort Key: t1_1.b
+                     ->  Seq Scan on matest0 t1_1
                ->  Index Scan using matest1i on matest1 t1_2
          ->  Materialize
                ->  Merge Append
                      Sort Key: t2.b
-                     ->  Index Scan using matest0i on matest0 t2_1
-                           Filter: (c = d)
+                     ->  Sort
+                           Sort Key: t2_1.b
+                           ->  Seq Scan on matest0 t2_1
+                                 Filter: (c = d)
                      ->  Index Scan using matest1i on matest1 t2_2
                            Filter: (c = d)
-(14 rows)
+(18 rows)
 
 reset enable_nestloop;
 drop table matest0 cascade;
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index 0431e183783..601b951d1e7 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -65,31 +65,34 @@ SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.b AND t1.b =
 -- inner join with partially-redundant join clauses
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
-                          QUERY PLAN                           
----------------------------------------------------------------
- Sort
+                       QUERY PLAN                        
+---------------------------------------------------------
+ Merge Append
    Sort Key: t1.a
-   ->  Append
-         ->  Merge Join
-               Merge Cond: (t1_1.a = t2_1.a)
-               ->  Index Scan using iprt1_p1_a on prt1_p1 t1_1
-               ->  Sort
-                     Sort Key: t2_1.b
-                     ->  Seq Scan on prt2_p1 t2_1
-                           Filter: (a = b)
+   ->  Merge Join
+         Merge Cond: (t1_1.a = t2_1.a)
+         ->  Index Scan using iprt1_p1_a on prt1_p1 t1_1
+         ->  Sort
+               Sort Key: t2_1.b
+               ->  Seq Scan on prt2_p1 t2_1
+                     Filter: (a = b)
+   ->  Sort
+         Sort Key: t1_2.a
          ->  Hash Join
                Hash Cond: (t1_2.a = t2_2.a)
                ->  Seq Scan on prt1_p2 t1_2
                ->  Hash
                      ->  Seq Scan on prt2_p2 t2_2
                            Filter: (a = b)
+   ->  Sort
+         Sort Key: t1_3.a
          ->  Hash Join
                Hash Cond: (t1_3.a = t2_3.a)
                ->  Seq Scan on prt1_p3 t1_3
                ->  Hash
                      ->  Seq Scan on prt2_p3 t2_3
                            Filter: (a = b)
-(22 rows)
+(25 rows)
 
 SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
  a  |  c   | b  |  c   
@@ -1372,28 +1375,32 @@ SELECT t1.a, t1.c, t2.b, t2.c, t3.a + t3.b, t3.c FROM (prt1 t1 LEFT JOIN prt2 t2
 -- This should generate a partitionwise join, but currently fails to
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
-                        QUERY PLAN                         
------------------------------------------------------------
- Incremental Sort
+                           QUERY PLAN                            
+-----------------------------------------------------------------
+ Sort
    Sort Key: prt1.a, prt2.b
-   Presorted Key: prt1.a
-   ->  Merge Left Join
-         Merge Cond: (prt1.a = prt2.b)
-         ->  Sort
-               Sort Key: prt1.a
-               ->  Append
-                     ->  Seq Scan on prt1_p1 prt1_1
-                           Filter: ((a < 450) AND (b = 0))
-                     ->  Seq Scan on prt1_p2 prt1_2
-                           Filter: ((a < 450) AND (b = 0))
-         ->  Sort
-               Sort Key: prt2.b
-               ->  Append
+   ->  Merge Right Join
+         Merge Cond: (prt2.b = prt1.a)
+         ->  Append
+               ->  Sort
+                     Sort Key: prt2_1.b
                      ->  Seq Scan on prt2_p2 prt2_1
                            Filter: (b > 250)
+               ->  Sort
+                     Sort Key: prt2_2.b
                      ->  Seq Scan on prt2_p3 prt2_2
                            Filter: (b > 250)
-(19 rows)
+         ->  Materialize
+               ->  Append
+                     ->  Sort
+                           Sort Key: prt1_1.a
+                           ->  Seq Scan on prt1_p1 prt1_1
+                                 Filter: ((a < 450) AND (b = 0))
+                     ->  Sort
+                           Sort Key: prt1_2.a
+                           ->  Seq Scan on prt1_p2 prt1_2
+                                 Filter: ((a < 450) AND (b = 0))
+(23 rows)
 
 SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
   a  |  b  
@@ -1413,25 +1420,33 @@ SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT *
 -- partitionwise join does not apply
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t2.b FROM prt1 t1, prt2 t2 WHERE t1::text = t2::text AND t1.a = t2.b ORDER BY t1.a;
-                                       QUERY PLAN                                        
------------------------------------------------------------------------------------------
+                            QUERY PLAN                            
+------------------------------------------------------------------
  Merge Join
-   Merge Cond: ((t1.a = t2.b) AND (((((t1.*)::prt1))::text) = ((((t2.*)::prt2))::text)))
-   ->  Sort
-         Sort Key: t1.a, ((((t1.*)::prt1))::text)
-         ->  Result
-               ->  Append
-                     ->  Seq Scan on prt1_p1 t1_1
-                     ->  Seq Scan on prt1_p2 t1_2
-                     ->  Seq Scan on prt1_p3 t1_3
-   ->  Sort
-         Sort Key: t2.b, ((((t2.*)::prt2))::text)
-         ->  Result
-               ->  Append
+   Merge Cond: (t1.a = t2.b)
+   Join Filter: ((((t2.*)::prt2))::text = (((t1.*)::prt1))::text)
+   ->  Append
+         ->  Sort
+               Sort Key: t1_1.a
+               ->  Seq Scan on prt1_p1 t1_1
+         ->  Sort
+               Sort Key: t1_2.a
+               ->  Seq Scan on prt1_p2 t1_2
+         ->  Sort
+               Sort Key: t1_3.a
+               ->  Seq Scan on prt1_p3 t1_3
+   ->  Materialize
+         ->  Append
+               ->  Sort
+                     Sort Key: t2_1.b
                      ->  Seq Scan on prt2_p1 t2_1
+               ->  Sort
+                     Sort Key: t2_2.b
                      ->  Seq Scan on prt2_p2 t2_2
+               ->  Sort
+                     Sort Key: t2_3.b
                      ->  Seq Scan on prt2_p3 t2_3
-(16 rows)
+(24 rows)
 
 SELECT t1.a, t2.b FROM prt1 t1, prt2 t2 WHERE t1::text = t2::text AND t1.a = t2.b ORDER BY t1.a;
  a  | b  
@@ -4497,9 +4512,10 @@ EXPLAIN (COSTS OFF)
 SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2 ON (t1.a = t2.a AND t1.c = t2.c) LEFT JOIN plt1_adv t3 ON (t1.a = t3.a AND t1.c = t3.c) WHERE t1.b < 10 ORDER BY t1.a;
                                    QUERY PLAN                                   
 --------------------------------------------------------------------------------
- Sort
+ Merge Append
    Sort Key: t1.a
-   ->  Append
+   ->  Sort
+         Sort Key: t1_1.a
          ->  Hash Right Join
                Hash Cond: ((t3_1.a = t1_1.a) AND (t3_1.c = t1_1.c))
                ->  Seq Scan on plt1_adv_p1 t3_1
@@ -4510,6 +4526,8 @@ SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2
                            ->  Hash
                                  ->  Seq Scan on plt1_adv_p1 t1_1
                                        Filter: (b < 10)
+   ->  Sort
+         Sort Key: t1_2.a
          ->  Hash Right Join
                Hash Cond: ((t3_2.a = t1_2.a) AND (t3_2.c = t1_2.c))
                ->  Seq Scan on plt1_adv_p2 t3_2
@@ -4520,6 +4538,8 @@ SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2
                            ->  Hash
                                  ->  Seq Scan on plt1_adv_p2 t1_2
                                        Filter: (b < 10)
+   ->  Sort
+         Sort Key: t1_3.a
          ->  Hash Right Join
                Hash Cond: ((t3_3.a = t1_3.a) AND (t3_3.c = t1_3.c))
                ->  Seq Scan on plt1_adv_p3 t3_3
@@ -4530,15 +4550,19 @@ SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2
                            ->  Hash
                                  ->  Seq Scan on plt1_adv_p3 t1_3
                                        Filter: (b < 10)
-         ->  Nested Loop Left Join
-               Join Filter: ((t1_4.a = t3_4.a) AND (t1_4.c = t3_4.c))
-               ->  Nested Loop Left Join
-                     Join Filter: ((t1_4.a = t2_4.a) AND (t1_4.c = t2_4.c))
+   ->  Nested Loop Left Join
+         Join Filter: ((t1_4.a = t3_4.a) AND (t1_4.c = t3_4.c))
+         ->  Merge Left Join
+               Merge Cond: ((t1_4.a = t2_4.a) AND (t1_4.c = t2_4.c))
+               ->  Sort
+                     Sort Key: t1_4.a, t1_4.c
                      ->  Seq Scan on plt1_adv_extra t1_4
                            Filter: (b < 10)
+               ->  Sort
+                     Sort Key: t2_4.a, t2_4.c
                      ->  Seq Scan on plt2_adv_extra t2_4
-               ->  Seq Scan on plt1_adv_extra t3_4
-(41 rows)
+         ->  Seq Scan on plt1_adv_extra t3_4
+(50 rows)
 
 SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2 ON (t1.a = t2.a AND t1.c = t2.c) LEFT JOIN plt1_adv t3 ON (t1.a = t3.a AND t1.c = t3.c) WHERE t1.b < 10 ORDER BY t1.a;
  a  |  c   | a |  c   | a |  c   
diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index d1966cd7d82..43e2962009e 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -4768,9 +4768,10 @@ select min(a) over (partition by a order by a) from part_abc where a >= stable_o
                Window: w1 AS (PARTITION BY part_abc.a ORDER BY part_abc.a)
                ->  Append
                      Subplans Removed: 1
-                     ->  Index Scan using part_abc_2_a_idx on part_abc_2 part_abc_1
-                           Index Cond: (a >= (stable_one() + 1))
-                           Filter: (d <= stable_one())
+                     ->  Sort
+                           Sort Key: part_abc_1.a
+                           ->  Seq Scan on part_abc_2 part_abc_1
+                                 Filter: ((d <= stable_one()) AND (a >= (stable_one() + 1)))
                      ->  Merge Append
                            Sort Key: part_abc_3.a
                            Subplans Removed: 1
@@ -4785,9 +4786,10 @@ select min(a) over (partition by a order by a) from part_abc where a >= stable_o
                Window: w1 AS (PARTITION BY part_abc_5.a ORDER BY part_abc_5.a)
                ->  Append
                      Subplans Removed: 1
-                     ->  Index Scan using part_abc_2_a_idx on part_abc_2 part_abc_6
-                           Index Cond: (a >= (stable_one() + 1))
-                           Filter: (d >= stable_one())
+                     ->  Sort
+                           Sort Key: part_abc_6.a
+                           ->  Seq Scan on part_abc_2 part_abc_6
+                                 Filter: ((d >= stable_one()) AND (a >= (stable_one() + 1)))
                      ->  Merge Append
                            Sort Key: a
                            Subplans Removed: 1
@@ -4797,7 +4799,7 @@ select min(a) over (partition by a order by a) from part_abc where a >= stable_o
                            ->  Index Scan using part_abc_3_3_a_idx on part_abc_3_3 part_abc_9
                                  Index Cond: (a >= (stable_one() + 1))
                                  Filter: (d >= stable_one())
-(35 rows)
+(37 rows)
 
 drop view part_abc_view;
 drop table part_abc;
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index d0977b05f98..e6ba4880b9c 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -1207,12 +1207,14 @@ select event_id
 ----------------------------------------------------------
  Merge Append
    Sort Key: events.event_id
-   ->  Index Scan using events_pkey on events
+   ->  Sort
+         Sort Key: events.event_id
+         ->  Seq Scan on events
    ->  Sort
          Sort Key: events_1.event_id
          ->  Seq Scan on events_child events_1
    ->  Index Scan using other_events_pkey on other_events
-(7 rows)
+(9 rows)
 
 drop table events_child, events, other_events;
 reset enable_indexonlyscan;
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 699e8ac09c8..c58beebbd1e 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -641,12 +641,14 @@ insert into matest2 (name) values ('Test 4');
 insert into matest3 (name) values ('Test 5');
 insert into matest3 (name) values ('Test 6');
 
-set enable_indexscan = off;  -- force use of seqscan/sort, so no merge
+set enable_indexscan = off;  -- force use of seqscan/sort
+set enable_sort = off; -- since merge append may employ sort in children we need to disable sort
 explain (verbose, costs off) select * from matest0 order by 1-id;
 select * from matest0 order by 1-id;
 explain (verbose, costs off) select min(1-id) from matest0;
 select min(1-id) from matest0;
 reset enable_indexscan;
+reset enable_sort;
 
 set enable_seqscan = off;  -- plan with fewest seqscans should be merge
 set enable_parallel_append = off; -- Don't let parallel-append interfere
-- 
2.39.5 (Apple Git-154)

#28

Andrei Lepikhov

lepihov@gmail.com

5 months ago

In reply to: Alexander Korotkov (#27)

1 attachment(s)

Re: MergeAppend could consider sorting cheapest child path

On 27/7/2025 00:51, Alexander Korotkov wrote:

On Tue, Jul 22, 2025 at 2:13 PM Andrei Lepikhov <lepihov@gmail.com
I've another idea. cost_tuplesort() puts 2.0 under logarithm to prefer
tuplesort over heapsort. I think we can adjust cost_gather_merge() and
cost_merge_append() to do the same. 0001 patch implements that. I
think the plan changes of 0001 might be reasonable since most cases deal
with small rowsets. One thing concerns me: 0002 still affects one of
the postgres_fdw checks. Could you, please, take a look?

Thanks for the idea!
I analysed your approach a little bit.
Initially, I ran the test script I had created previously [1]https://github.com/danolivo/conf/blob/main/Scripts/sort-vs-mergeappend-3.sql and
discovered that on a large scale (1e6 - 1e7 tuples), the plan still
defaults to MergeAppend, which deviates from the execution time (7190 ms
for Sort+Append and 8450 ms for MergeAppend+Sort).

Attempting to find out the reason, I combined all the costs into a
single formula for each strategy:

MergeAppend+Sort:
total_cost =CO*ntuples*(1+2*log(ntuples)) + Ccput * 0.5 * ntuples+
2*CO*N*log(N) + A
Sort+Append:
total_cost = CO*ntuples*(1+2*log(ntuples))+ Ccput * 0.5 * ntuples + A

Terms:
- A - sum of total costs of underlying subtrees
- CO - cpu_operator_cost
- Ccput - cpu_tuple_cost
- N - number of subpaths (streams)

Given the significant gap in total execution time between these
strategies, I believe it would be reasonable to introduce a coefficient
to the equation's 'ntuples' variable component that will keep the gap
between big quicksort and MergeAppend's heapsort out of the fuzzy factor
gap.

Discovering papers on the value of constant in quicksort [2]https://arxiv.org/abs/1504.01459 and
heapsort [3], I realised that there is a difference. The constant's
value varies in a wide range: 1.3-1.5 for quicksort and 2-3 for
heapsort. Considering that we should change the current cost model as
little as possible, not to break the balance, we may just increase the
constant value for the heap sort to maintain a bare minimum gap between
strategies out of the fuzzy factor. In this case, the merge append
constant should be around 3.8 - 4.0.

With this minor change, we see a shift in the regression tests. Most of
these changes were introduced by the new append strategy. Although I
haven't analysed these changes in depth yet, I believe they are all
related to the small data sets and should fade out on a larger scale.

See this minor correction in the attachment. postgres_fdw tests are
stable now.

[1]: https://github.com/danolivo/conf/blob/main/Scripts/sort-vs-mergeappend-3.sql
https://github.com/danolivo/conf/blob/main/Scripts/sort-vs-mergeappend-3.sql
[2]: https://arxiv.org/abs/1504.01459
[2]: https://arxiv.org/abs/1504.01459

--
regards, Andrei Lepikhov

Attachments:

v9-0001-Sketch.patchtext/plain; charset=UTF-8; name=v9-0001-Sketch.patchDownload

From cca6ed05cf8128a1e88ea07021ba21953cbc1a6b Mon Sep 17 00:00:00 2001
From: "Andrei V. Lepikhov" <lepihov@gmail.com>
Date: Thu, 31 Jul 2025 14:53:08 +0200
Subject: [PATCH v9 1/2] Sketch

---
 src/backend/optimizer/path/costsize.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 344a3188317..c353001c581 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -512,7 +512,7 @@ cost_gather_merge(GatherMergePath *path, PlannerInfo *root,
 	logN = LOG2(N);
 
 	/* Assumed cost per tuple comparison */
-	comparison_cost = 2.0 * cpu_operator_cost;
+	comparison_cost = 3.9 * cpu_operator_cost;
 
 	/* Heap creation cost */
 	startup_cost += comparison_cost * N * logN;
@@ -2474,7 +2474,7 @@ cost_merge_append(Path *path, PlannerInfo *root,
 	logN = LOG2(N);
 
 	/* Assumed cost per tuple comparison */
-	comparison_cost = 2.0 * cpu_operator_cost;
+	comparison_cost = 3.9 * cpu_operator_cost;
 
 	/* Heap creation cost */
 	startup_cost += comparison_cost * N * logN;
-- 
2.50.1

#29

Alexander Korotkov

aekorotkov@gmail.com

4 months ago

In reply to: Andrei Lepikhov (#28)

1 attachment(s)

Re: MergeAppend could consider sorting cheapest child path

On Thu, Jul 31, 2025 at 5:20 PM Andrei Lepikhov <lepihov@gmail.com> wrote:

On 27/7/2025 00:51, Alexander Korotkov wrote:

On Tue, Jul 22, 2025 at 2:13 PM Andrei Lepikhov <lepihov@gmail.com
I've another idea. cost_tuplesort() puts 2.0 under logarithm to prefer
tuplesort over heapsort. I think we can adjust cost_gather_merge() and
cost_merge_append() to do the same. 0001 patch implements that. I
think the plan changes of 0001 might be reasonable since most cases deal
with small rowsets. One thing concerns me: 0002 still affects one of
the postgres_fdw checks. Could you, please, take a look?

Thanks for the idea!
I analysed your approach a little bit.
Initially, I ran the test script I had created previously [1] and
discovered that on a large scale (1e6 - 1e7 tuples), the plan still
defaults to MergeAppend, which deviates from the execution time (7190 ms
for Sort+Append and 8450 ms for MergeAppend+Sort).

Attempting to find out the reason, I combined all the costs into a
single formula for each strategy:

MergeAppend+Sort:
total_cost =CO*ntuples*(1+2*log(ntuples)) + Ccput * 0.5 * ntuples+
2*CO*N*log(N) + A
Sort+Append:
total_cost = CO*ntuples*(1+2*log(ntuples))+ Ccput * 0.5 * ntuples + A

Terms:
- A - sum of total costs of underlying subtrees
- CO - cpu_operator_cost
- Ccput - cpu_tuple_cost
- N - number of subpaths (streams)

Given the significant gap in total execution time between these
strategies, I believe it would be reasonable to introduce a coefficient
to the equation's 'ntuples' variable component that will keep the gap
between big quicksort and MergeAppend's heapsort out of the fuzzy factor
gap.

Discovering papers on the value of constant in quicksort [2] and
heapsort [3], I realised that there is a difference. The constant's
value varies in a wide range: 1.3-1.5 for quicksort and 2-3 for
heapsort. Considering that we should change the current cost model as
little as possible, not to break the balance, we may just increase the
constant value for the heap sort to maintain a bare minimum gap between
strategies out of the fuzzy factor. In this case, the merge append
constant should be around 3.8 - 4.0.

With this minor change, we see a shift in the regression tests. Most of
these changes were introduced by the new append strategy. Although I
haven't analysed these changes in depth yet, I believe they are all
related to the small data sets and should fade out on a larger scale.

See this minor correction in the attachment. postgres_fdw tests are
stable now.

I have another idea. What if we allow MergeAppend paths only when at
least one subpath is preordered. This trick also allow us to exclude
MergeAppend(Sort) dominating Sort(Append). I see the regression tests
changes now have much less volume and looks more reasonable. What do
you think?

Also, do you think get_cheapest_fractional_path_for_pathkeys_ext() and
get_cheapest_path_for_pathkeys_ext() should consider incremental sort?
The revised patch teaches them to do so.

------
Regards,
Alexander Korotkov
Supabase

Attachments:

v10-0001-Consider-an-explicit-sort-of-the-MergeAppend-sub.patchapplication/octet-stream; name=v10-0001-Consider-an-explicit-sort-of-the-MergeAppend-sub.patchDownload

From 46e26346a159c6881f9484b80f274e8417396e66 Mon Sep 17 00:00:00 2001
From: "Andrei V. Lepikhov" <lepihov@gmail.com>
Date: Tue, 3 Jun 2025 11:37:23 +0200
Subject: [PATCH v10] Consider an explicit sort of the MergeAppend subpaths.

Broaden the optimiser's search scope slightly: when retrieving optimal subpaths
that match pathkeys for the planning MergeAppend, also consider the case of
an overall optimal path that includes an explicit Sort node at the top.

It may provide a more effective plan in both full and fractional scan cases:
1. The Sort node may be pushed down to subpaths under a parallel or async Append.
2. The case when a minor set of subpaths doesn't have a proper index, and it
is profitable to sort them instead of switching to plain Append.

Having implemented that strategy, it became clear that the cost of multiple
small sortings merged by a single MergeAppend node exceeds that of a single Sort
operation over a plain Append. The code and benchmarks demonstrate that such
an assumption is incorrect because the Sort operator has optimisations that work
faster than a MergeAppend.
To arrange the cost model, change the merge cost multiplier, considering that
heap rebuilding needs two comparison operations.
---
 .../postgres_fdw/expected/postgres_fdw.out    |   6 +-
 src/backend/optimizer/path/allpaths.c         |  74 ++++---
 src/backend/optimizer/path/pathkeys.c         | 183 ++++++++++++++++++
 src/include/optimizer/paths.h                 |  10 +
 src/test/regress/expected/inherit.out         |  27 +--
 src/test/regress/expected/partition_join.out  | 124 +++++++-----
 src/test/regress/expected/partition_prune.out |  16 +-
 src/test/regress/expected/union.out           |   6 +-
 src/test/regress/sql/inherit.sql              |   4 +-
 9 files changed, 337 insertions(+), 113 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index d3323b04676..ba924bb3cbd 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -10486,13 +10486,15 @@ SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a
    ->  Nested Loop
          Join Filter: (t1.a = t2.b)
          ->  Append
-               ->  Foreign Scan on ftprt1_p1 t1_1
+               ->  Sort
+                     Sort Key: t1_1.a
+                     ->  Foreign Scan on ftprt1_p1 t1_1
                ->  Foreign Scan on ftprt1_p2 t1_2
          ->  Materialize
                ->  Append
                      ->  Foreign Scan on ftprt2_p1 t2_1
                      ->  Foreign Scan on ftprt2_p2 t2_2
-(10 rows)
+(12 rows)
 
 SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
   a  |  b  
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 6cc6966b060..35c85d7be97 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1792,6 +1792,9 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 		List	   *total_subpaths = NIL;
 		List	   *fractional_subpaths = NIL;
 		bool		startup_neq_total = false;
+		bool		total_has_ordered = false;
+		bool		startup_has_ordered = false;
+		bool		fractional_has_ordered = false;
 		bool		match_partition_order;
 		bool		match_partition_order_desc;
 		int			end_index;
@@ -1855,29 +1858,24 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 
 			/* Locate the right paths, if they are available. */
 			cheapest_startup =
-				get_cheapest_path_for_pathkeys(childrel->pathlist,
-											   pathkeys,
-											   NULL,
-											   STARTUP_COST,
-											   false);
+				get_cheapest_path_for_pathkeys_ext(root, childrel, pathkeys,
+												   NULL, STARTUP_COST, false);
 			cheapest_total =
-				get_cheapest_path_for_pathkeys(childrel->pathlist,
-											   pathkeys,
-											   NULL,
-											   TOTAL_COST,
-											   false);
+				get_cheapest_path_for_pathkeys_ext(root, childrel, pathkeys,
+												   NULL, TOTAL_COST, false);
+
+			if (pathkeys_contained_in(pathkeys, cheapest_startup->pathkeys))
+				startup_has_ordered = true;
+
+			if (pathkeys_contained_in(pathkeys, cheapest_total->pathkeys))
+				total_has_ordered = true;
 
 			/*
-			 * If we can't find any paths with the right order just use the
-			 * cheapest-total path; we'll have to sort it later.
+			 * In accordance to current planning logic there are no
+			 * parameterised paths under a merge append.
 			 */
-			if (cheapest_startup == NULL || cheapest_total == NULL)
-			{
-				cheapest_startup = cheapest_total =
-					childrel->cheapest_total_path;
-				/* Assert we do have an unparameterized path for this child */
-				Assert(cheapest_total->param_info == NULL);
-			}
+			Assert(cheapest_startup != NULL && cheapest_total != NULL);
+			Assert(cheapest_total->param_info == NULL);
 
 			/*
 			 * When building a fractional path, determine a cheapest
@@ -1904,21 +1902,20 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 					path_fraction /= childrel->rows;
 
 				cheapest_fractional =
-					get_cheapest_fractional_path_for_pathkeys(childrel->pathlist,
-															  pathkeys,
-															  NULL,
-															  path_fraction);
+					get_cheapest_fractional_path_for_pathkeys_ext(root,
+																  childrel,
+																  pathkeys,
+																  NULL,
+																  path_fraction);
+
+				if (pathkeys_contained_in(pathkeys, cheapest_fractional->pathkeys))
+					fractional_has_ordered = true;
 
 				/*
-				 * If we found no path with matching pathkeys, use the
-				 * cheapest total path instead.
-				 *
-				 * XXX We might consider partially sorted paths too (with an
-				 * incremental sort on top). But we'd have to build all the
-				 * incremental paths, do the costing etc.
+				 * In accordance to current planning logic there are no
+				 * parameterised fractional paths under a merge append.
 				 */
-				if (!cheapest_fractional)
-					cheapest_fractional = cheapest_total;
+				Assert(cheapest_fractional != NULL);
 			}
 
 			/*
@@ -2009,19 +2006,20 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 		else
 		{
 			/* We need MergeAppend */
-			add_path(rel, (Path *) create_merge_append_path(root,
-															rel,
-															startup_subpaths,
-															pathkeys,
-															NULL));
-			if (startup_neq_total)
+			if (total_has_ordered)
+				add_path(rel, (Path *) create_merge_append_path(root,
+																rel,
+																startup_subpaths,
+																pathkeys,
+																NULL));
+			if (startup_neq_total && startup_has_ordered)
 				add_path(rel, (Path *) create_merge_append_path(root,
 																rel,
 																total_subpaths,
 																pathkeys,
 																NULL));
 
-			if (fractional_subpaths)
+			if (fractional_subpaths && fractional_has_ordered)
 				add_path(rel, (Path *) create_merge_append_path(root,
 																rel,
 																fractional_subpaths,
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 8b04d40d36d..0eb618f304c 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -19,11 +19,13 @@
 
 #include "access/stratnum.h"
 #include "catalog/pg_opfamily.h"
+#include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/cost.h"
 #include "optimizer/optimizer.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
+#include "optimizer/planner.h"
 #include "partitioning/partbounds.h"
 #include "rewrite/rewriteManip.h"
 #include "utils/lsyscache.h"
@@ -648,6 +650,98 @@ get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 	return matched_path;
 }
 
+/*
+ * get_cheapest_path_for_pathkeys_ext
+ *	  Calls get_cheapest_path_for_pathkeys to obtain cheapest path that
+ *	  satisfies defined criterias and сonsiders one more option: choose
+ *	  overall-optimal path (according the criterion) and explicitly sort its
+ *	  output to satisfy the pathkeys.
+ *
+ *	  Caller is responsible to insert corresponding sort path at the top of
+ *	  returned path if it will be chosen to be used.
+ *
+ *	  Return NULL if no such path.
+ */
+Path *
+get_cheapest_path_for_pathkeys_ext(PlannerInfo *root, RelOptInfo *rel,
+								   List *pathkeys, Relids required_outer,
+								   CostSelector cost_criterion,
+								   bool require_parallel_safe)
+{
+	Path		sort_path;
+	Path	   *base_path = rel->cheapest_total_path;
+	Path	   *path;
+
+	/* In generate_orderedappend_paths() all childrels do have some paths */
+	Assert(base_path);
+
+	path = get_cheapest_path_for_pathkeys(rel->pathlist, pathkeys,
+										  required_outer, cost_criterion,
+										  require_parallel_safe);
+
+	/*
+	 * Stop here if the cheapest total path doesn't satisfy necessary
+	 * conditions
+	 */
+	if ((require_parallel_safe && !base_path->parallel_safe) ||
+		!bms_is_subset(PATH_REQ_OUTER(base_path), required_outer))
+		return path;
+
+	if (path == NULL)
+
+		/*
+		 * Current pathlist doesn't fit the pathkeys. No need to check extra
+		 * sort path ways.
+		 */
+		return base_path;
+
+	/* Consider the cheapest total path with extra sort */
+	if (path != base_path)
+	{
+		int			presorted_keys;
+
+		if (!pathkeys_count_contained_in(pathkeys, base_path->pathkeys,
+										 &presorted_keys))
+		{
+			/*
+			 * We'll need to insert a Sort node, so include costs for that.
+			 * We choose to use incremental sort if it is enabled and there
+			 * are presorted keys; otherwise we use full sort.
+			 *
+			 * We can use the parent's LIMIT if any, since we certainly won't
+			 * pull more than that many tuples from any child.
+			 */
+			if (enable_incremental_sort && presorted_keys > 0)
+			{
+				cost_incremental_sort(&sort_path, root, pathkeys,
+									  presorted_keys,
+									  base_path->disabled_nodes,
+									  base_path->startup_cost,
+									  base_path->total_cost, base_path->rows,
+									  base_path->pathtarget->width, 0.0,
+									  work_mem, -1.0);
+			}
+			else
+			{
+				cost_sort(&sort_path, root, pathkeys, base_path->disabled_nodes,
+						  base_path->total_cost, base_path->rows,
+						  base_path->pathtarget->width, 0.0, work_mem, -1.0);
+			}
+		}
+		else
+		{
+			sort_path.rows = base_path->rows;
+			sort_path.disabled_nodes = base_path->disabled_nodes;
+			sort_path.startup_cost = base_path->startup_cost;
+			sort_path.total_cost = base_path->total_cost;
+		}
+
+		if (compare_path_costs(&sort_path, path, cost_criterion) < 0)
+			return base_path;
+	}
+	return path;
+}
+
 /*
  * get_cheapest_fractional_path_for_pathkeys
  *	  Find the cheapest path (for retrieving a specified fraction of all
@@ -690,6 +784,95 @@ get_cheapest_fractional_path_for_pathkeys(List *paths,
 	return matched_path;
 }
 
+/*
+ * get_cheapest_fractional_path_for_pathkeys_ext
+ *	  obtain cheapest fractional path that satisfies defined criterias excluding
+ *	  pathkeys and explicitly sort its output to satisfy the pathkeys.
+ *
+ *	  Caller is responsible to insert corresponding sort path at the top of
+ *	  returned path if it will be chosen to be used.
+ *
+ *	  Return NULL if no such path.
+ */
+Path *
+get_cheapest_fractional_path_for_pathkeys_ext(PlannerInfo *root,
+											  RelOptInfo *rel,
+											  List *pathkeys,
+											  Relids required_outer,
+											  double fraction)
+{
+	Path		sort_path;
+	Path	   *base_path = rel->cheapest_total_path;
+	Path	   *path;
+
+	/* In generate_orderedappend_paths() all childrels do have some paths */
+	Assert(base_path);
+
+	path = get_cheapest_fractional_path_for_pathkeys(rel->pathlist, pathkeys,
+													 required_outer, fraction);
+
+	/*
+	 * Stop here if the cheapest total path doesn't satisfy necessary
+	 * conditions
+	 */
+	if (!bms_is_subset(PATH_REQ_OUTER(base_path), required_outer))
+		return path;
+
+	if (path == NULL)
+
+		/*
+		 * Current pathlist doesn't fit the pathkeys. No need to check extra
+		 * sort path ways.
+		 */
+		return base_path;
+
+	/* Consider the cheapest total path with extra sort */
+	if (path != base_path)
+	{
+		int			presorted_keys;
+
+		if (!pathkeys_count_contained_in(pathkeys, base_path->pathkeys,
+										 &presorted_keys))
+		{
+			/*
+			 * We'll need to insert a Sort node, so include costs for that.
+			 * We choose to use incremental sort if it is enabled and there
+			 * are presorted keys; otherwise we use full sort.
+			 *
+			 * We can use the parent's LIMIT if any, since we certainly won't
+			 * pull more than that many tuples from any child.
+			 */
+			if (enable_incremental_sort && presorted_keys > 0)
+			{
+				cost_incremental_sort(&sort_path, root, pathkeys,
+									  presorted_keys,
+									  base_path->disabled_nodes,
+									  base_path->startup_cost,
+									  base_path->total_cost, base_path->rows,
+									  base_path->pathtarget->width, 0.0,
+									  work_mem, -1.0);
+			}
+			else
+			{
+				cost_sort(&sort_path, root, pathkeys, base_path->disabled_nodes,
+						  base_path->total_cost, base_path->rows,
+						  base_path->pathtarget->width, 0.0, work_mem, -1.0);
+			}
+		}
+		else
+		{
+			sort_path.rows = base_path->rows;
+			sort_path.disabled_nodes = base_path->disabled_nodes;
+			sort_path.startup_cost = base_path->startup_cost;
+			sort_path.total_cost = base_path->total_cost;
+		}
+
+		if (compare_fractional_path_costs(&sort_path, path, fraction) <= 0)
+			return base_path;
+	}
+
+	return path;
+}
 
 /*
  * get_cheapest_parallel_safe_total_inner
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index cbade77b717..fd7f6f115b3 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -219,10 +219,20 @@ extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
 											bool require_parallel_safe);
+extern Path *get_cheapest_path_for_pathkeys_ext(PlannerInfo *root,
+												RelOptInfo *rel, List *pathkeys,
+												Relids required_outer,
+												CostSelector cost_criterion,
+												bool require_parallel_safe);
 extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 													   List *pathkeys,
 													   Relids required_outer,
 													   double fraction);
+extern Path *get_cheapest_fractional_path_for_pathkeys_ext(PlannerInfo *root,
+														   RelOptInfo *rel,
+														   List *pathkeys,
+														   Relids required_outer,
+														   double fraction);
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
 								  ScanDirection scandir);
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index 5b5055babdc..a00d606d9e7 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -1665,11 +1665,13 @@ insert into matest2 (name) values ('Test 3');
 insert into matest2 (name) values ('Test 4');
 insert into matest3 (name) values ('Test 5');
 insert into matest3 (name) values ('Test 6');
-set enable_indexscan = off;  -- force use of seqscan/sort, so no merge
+set enable_indexscan = off;  -- force use of seqscan/sort
+set enable_sort = off; -- since merge append may employ sort in children we need to disable sort
 explain (verbose, costs off) select * from matest0 order by 1-id;
                          QUERY PLAN                         
 ------------------------------------------------------------
  Sort
+   Disabled: true
    Output: matest0.id, matest0.name, ((1 - matest0.id))
    Sort Key: ((1 - matest0.id))
    ->  Result
@@ -1683,7 +1685,7 @@ explain (verbose, costs off) select * from matest0 order by 1-id;
                      Output: matest0_3.id, matest0_3.name
                ->  Seq Scan on public.matest3 matest0_4
                      Output: matest0_4.id, matest0_4.name
-(14 rows)
+(15 rows)
 
 select * from matest0 order by 1-id;
  id |  name  
@@ -1719,6 +1721,7 @@ select min(1-id) from matest0;
 (1 row)
 
 reset enable_indexscan;
+reset enable_sort;
 set enable_seqscan = off;  -- plan with fewest seqscans should be merge
 set enable_parallel_append = off; -- Don't let parallel-append interfere
 explain (verbose, costs off) select * from matest0 order by 1-id;
@@ -1837,23 +1840,25 @@ explain (costs off)
 select t1.* from matest0 t1, matest0 t2
 where t1.b = t2.b and t2.c = t2.d
 order by t1.b limit 10;
-                            QUERY PLAN                             
--------------------------------------------------------------------
+                         QUERY PLAN                          
+-------------------------------------------------------------
  Limit
    ->  Merge Join
          Merge Cond: (t1.b = t2.b)
          ->  Merge Append
                Sort Key: t1.b
-               ->  Index Scan using matest0i on matest0 t1_1
+               ->  Sort
+                     Sort Key: t1_1.b
+                     ->  Seq Scan on matest0 t1_1
                ->  Index Scan using matest1i on matest1 t1_2
-         ->  Materialize
-               ->  Merge Append
-                     Sort Key: t2.b
-                     ->  Index Scan using matest0i on matest0 t2_1
+         ->  Sort
+               Sort Key: t2.b
+               ->  Append
+                     ->  Seq Scan on matest0 t2_1
                            Filter: (c = d)
-                     ->  Index Scan using matest1i on matest1 t2_2
+                     ->  Seq Scan on matest1 t2_2
                            Filter: (c = d)
-(14 rows)
+(16 rows)
 
 reset enable_nestloop;
 drop table matest0 cascade;
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index 24e06845f92..00e4a805b93 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -65,31 +65,34 @@ SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.b AND t1.b =
 -- inner join with partially-redundant join clauses
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
-                          QUERY PLAN                           
----------------------------------------------------------------
- Sort
+                       QUERY PLAN                        
+---------------------------------------------------------
+ Merge Append
    Sort Key: t1.a
-   ->  Append
-         ->  Merge Join
-               Merge Cond: (t1_1.a = t2_1.a)
-               ->  Index Scan using iprt1_p1_a on prt1_p1 t1_1
-               ->  Sort
-                     Sort Key: t2_1.b
-                     ->  Seq Scan on prt2_p1 t2_1
-                           Filter: (a = b)
+   ->  Merge Join
+         Merge Cond: (t1_1.a = t2_1.a)
+         ->  Index Scan using iprt1_p1_a on prt1_p1 t1_1
+         ->  Sort
+               Sort Key: t2_1.b
+               ->  Seq Scan on prt2_p1 t2_1
+                     Filter: (a = b)
+   ->  Sort
+         Sort Key: t1_2.a
          ->  Hash Join
                Hash Cond: (t1_2.a = t2_2.a)
                ->  Seq Scan on prt1_p2 t1_2
                ->  Hash
                      ->  Seq Scan on prt2_p2 t2_2
                            Filter: (a = b)
+   ->  Sort
+         Sort Key: t1_3.a
          ->  Hash Join
                Hash Cond: (t1_3.a = t2_3.a)
                ->  Seq Scan on prt1_p3 t1_3
                ->  Hash
                      ->  Seq Scan on prt2_p3 t2_3
                            Filter: (a = b)
-(22 rows)
+(25 rows)
 
 SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
  a  |  c   | b  |  c   
@@ -371,9 +374,10 @@ EXPLAIN (COSTS OFF)
 SELECT t1.* FROM prt1 t1 WHERE t1.a IN (SELECT t2.b FROM prt2 t2 WHERE t2.a = 0) AND t1.b = 0 ORDER BY t1.a;
                     QUERY PLAN                    
 --------------------------------------------------
- Sort
+ Merge Append
    Sort Key: t1.a
-   ->  Append
+   ->  Sort
+         Sort Key: t1_1.a
          ->  Hash Semi Join
                Hash Cond: (t1_1.a = t2_1.b)
                ->  Seq Scan on prt1_p1 t1_1
@@ -381,6 +385,8 @@ SELECT t1.* FROM prt1 t1 WHERE t1.a IN (SELECT t2.b FROM prt2 t2 WHERE t2.a = 0)
                ->  Hash
                      ->  Seq Scan on prt2_p1 t2_1
                            Filter: (a = 0)
+   ->  Sort
+         Sort Key: t1_2.a
          ->  Hash Semi Join
                Hash Cond: (t1_2.a = t2_2.b)
                ->  Seq Scan on prt1_p2 t1_2
@@ -388,14 +394,16 @@ SELECT t1.* FROM prt1 t1 WHERE t1.a IN (SELECT t2.b FROM prt2 t2 WHERE t2.a = 0)
                ->  Hash
                      ->  Seq Scan on prt2_p2 t2_2
                            Filter: (a = 0)
-         ->  Nested Loop Semi Join
-               Join Filter: (t1_3.a = t2_3.b)
-               ->  Seq Scan on prt1_p3 t1_3
-                     Filter: (b = 0)
-               ->  Materialize
+   ->  Nested Loop
+         Join Filter: (t1_3.a = t2_3.b)
+         ->  Unique
+               ->  Sort
+                     Sort Key: t2_3.b
                      ->  Seq Scan on prt2_p3 t2_3
                            Filter: (a = 0)
-(24 rows)
+         ->  Seq Scan on prt1_p3 t1_3
+               Filter: (b = 0)
+(29 rows)
 
 SELECT t1.* FROM prt1 t1 WHERE t1.a IN (SELECT t2.b FROM prt2 t2 WHERE t2.a = 0) AND t1.b = 0 ORDER BY t1.a;
   a  | b |  c   
@@ -1387,28 +1395,32 @@ SELECT t1.a, t1.c, t2.b, t2.c, t3.a + t3.b, t3.c FROM (prt1 t1 LEFT JOIN prt2 t2
 -- This should generate a partitionwise join, but currently fails to
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
-                        QUERY PLAN                         
------------------------------------------------------------
- Incremental Sort
+                           QUERY PLAN                            
+-----------------------------------------------------------------
+ Sort
    Sort Key: prt1.a, prt2.b
-   Presorted Key: prt1.a
-   ->  Merge Left Join
-         Merge Cond: (prt1.a = prt2.b)
-         ->  Sort
-               Sort Key: prt1.a
-               ->  Append
-                     ->  Seq Scan on prt1_p1 prt1_1
-                           Filter: ((a < 450) AND (b = 0))
-                     ->  Seq Scan on prt1_p2 prt1_2
-                           Filter: ((a < 450) AND (b = 0))
-         ->  Sort
-               Sort Key: prt2.b
-               ->  Append
+   ->  Merge Right Join
+         Merge Cond: (prt2.b = prt1.a)
+         ->  Append
+               ->  Sort
+                     Sort Key: prt2_1.b
                      ->  Seq Scan on prt2_p2 prt2_1
                            Filter: (b > 250)
+               ->  Sort
+                     Sort Key: prt2_2.b
                      ->  Seq Scan on prt2_p3 prt2_2
                            Filter: (b > 250)
-(19 rows)
+         ->  Materialize
+               ->  Append
+                     ->  Sort
+                           Sort Key: prt1_1.a
+                           ->  Seq Scan on prt1_p1 prt1_1
+                                 Filter: ((a < 450) AND (b = 0))
+                     ->  Sort
+                           Sort Key: prt1_2.a
+                           ->  Seq Scan on prt1_p2 prt1_2
+                                 Filter: ((a < 450) AND (b = 0))
+(23 rows)
 
 SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
   a  |  b  
@@ -1428,25 +1440,33 @@ SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT *
 -- partitionwise join does not apply
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t2.b FROM prt1 t1, prt2 t2 WHERE t1::text = t2::text AND t1.a = t2.b ORDER BY t1.a;
-                                       QUERY PLAN                                        
------------------------------------------------------------------------------------------
+                            QUERY PLAN                            
+------------------------------------------------------------------
  Merge Join
-   Merge Cond: ((t1.a = t2.b) AND (((((t1.*)::prt1))::text) = ((((t2.*)::prt2))::text)))
-   ->  Sort
-         Sort Key: t1.a, ((((t1.*)::prt1))::text)
-         ->  Result
-               ->  Append
-                     ->  Seq Scan on prt1_p1 t1_1
-                     ->  Seq Scan on prt1_p2 t1_2
-                     ->  Seq Scan on prt1_p3 t1_3
-   ->  Sort
-         Sort Key: t2.b, ((((t2.*)::prt2))::text)
-         ->  Result
-               ->  Append
+   Merge Cond: (t1.a = t2.b)
+   Join Filter: ((((t2.*)::prt2))::text = (((t1.*)::prt1))::text)
+   ->  Append
+         ->  Sort
+               Sort Key: t1_1.a
+               ->  Seq Scan on prt1_p1 t1_1
+         ->  Sort
+               Sort Key: t1_2.a
+               ->  Seq Scan on prt1_p2 t1_2
+         ->  Sort
+               Sort Key: t1_3.a
+               ->  Seq Scan on prt1_p3 t1_3
+   ->  Materialize
+         ->  Append
+               ->  Sort
+                     Sort Key: t2_1.b
                      ->  Seq Scan on prt2_p1 t2_1
+               ->  Sort
+                     Sort Key: t2_2.b
                      ->  Seq Scan on prt2_p2 t2_2
+               ->  Sort
+                     Sort Key: t2_3.b
                      ->  Seq Scan on prt2_p3 t2_3
-(16 rows)
+(24 rows)
 
 SELECT t1.a, t2.b FROM prt1 t1, prt2 t2 WHERE t1::text = t2::text AND t1.a = t2.b ORDER BY t1.a;
  a  | b  
diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index d1966cd7d82..43e2962009e 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -4768,9 +4768,10 @@ select min(a) over (partition by a order by a) from part_abc where a >= stable_o
                Window: w1 AS (PARTITION BY part_abc.a ORDER BY part_abc.a)
                ->  Append
                      Subplans Removed: 1
-                     ->  Index Scan using part_abc_2_a_idx on part_abc_2 part_abc_1
-                           Index Cond: (a >= (stable_one() + 1))
-                           Filter: (d <= stable_one())
+                     ->  Sort
+                           Sort Key: part_abc_1.a
+                           ->  Seq Scan on part_abc_2 part_abc_1
+                                 Filter: ((d <= stable_one()) AND (a >= (stable_one() + 1)))
                      ->  Merge Append
                            Sort Key: part_abc_3.a
                            Subplans Removed: 1
@@ -4785,9 +4786,10 @@ select min(a) over (partition by a order by a) from part_abc where a >= stable_o
                Window: w1 AS (PARTITION BY part_abc_5.a ORDER BY part_abc_5.a)
                ->  Append
                      Subplans Removed: 1
-                     ->  Index Scan using part_abc_2_a_idx on part_abc_2 part_abc_6
-                           Index Cond: (a >= (stable_one() + 1))
-                           Filter: (d >= stable_one())
+                     ->  Sort
+                           Sort Key: part_abc_6.a
+                           ->  Seq Scan on part_abc_2 part_abc_6
+                                 Filter: ((d >= stable_one()) AND (a >= (stable_one() + 1)))
                      ->  Merge Append
                            Sort Key: a
                            Subplans Removed: 1
@@ -4797,7 +4799,7 @@ select min(a) over (partition by a order by a) from part_abc where a >= stable_o
                            ->  Index Scan using part_abc_3_3_a_idx on part_abc_3_3 part_abc_9
                                  Index Cond: (a >= (stable_one() + 1))
                                  Filter: (d >= stable_one())
-(35 rows)
+(37 rows)
 
 drop view part_abc_view;
 drop table part_abc;
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index 96962817ed4..0ccadea910c 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -1207,12 +1207,14 @@ select event_id
 ----------------------------------------------------------
  Merge Append
    Sort Key: events.event_id
-   ->  Index Scan using events_pkey on events
+   ->  Sort
+         Sort Key: events.event_id
+         ->  Seq Scan on events
    ->  Sort
          Sort Key: events_1.event_id
          ->  Seq Scan on events_child events_1
    ->  Index Scan using other_events_pkey on other_events
-(7 rows)
+(9 rows)
 
 drop table events_child, events, other_events;
 reset enable_indexonlyscan;
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 699e8ac09c8..c58beebbd1e 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -641,12 +641,14 @@ insert into matest2 (name) values ('Test 4');
 insert into matest3 (name) values ('Test 5');
 insert into matest3 (name) values ('Test 6');
 
-set enable_indexscan = off;  -- force use of seqscan/sort, so no merge
+set enable_indexscan = off;  -- force use of seqscan/sort
+set enable_sort = off; -- since merge append may employ sort in children we need to disable sort
 explain (verbose, costs off) select * from matest0 order by 1-id;
 select * from matest0 order by 1-id;
 explain (verbose, costs off) select min(1-id) from matest0;
 select min(1-id) from matest0;
 reset enable_indexscan;
+reset enable_sort;
 
 set enable_seqscan = off;  -- plan with fewest seqscans should be merge
 set enable_parallel_append = off; -- Don't let parallel-append interfere
-- 
2.39.5 (Apple Git-154)

#30

Richard Guo

guofenglinux@gmail.com

4 months ago

In reply to: Alexander Korotkov (#29)

Re: MergeAppend could consider sorting cheapest child path

On Tue, Sep 2, 2025 at 5:26 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:

I have another idea. What if we allow MergeAppend paths only when at
least one subpath is preordered. This trick also allow us to exclude
MergeAppend(Sort) dominating Sort(Append). I see the regression tests
changes now have much less volume and looks more reasonable. What do
you think?

I skimmed through the test case changes, and I'm not sure all of them
are actual improvements. For example:

          ->  Append
-               ->  Foreign Scan on ftprt1_p1 t1_1
+               ->  Sort
+                     Sort Key: t1_1.a
+                     ->  Foreign Scan on ftprt1_p1 t1_1
                ->  Foreign Scan on ftprt1_p2 t1_2

It seems that this patch moves the sort operation for ftprt1_p1 from
the remote server to local. I'm not sure if this is an improvement,
or why it applies only to ftprt1_p1 and not to ftprt1_p2 (they have
very similar statistics).

Besides, I noticed that some plans have changed from an "Index Scan
with Index Cond" to a "Seq Scan with Filter + Sort". I'm also not
sure whether this change results in better performance.

- Richard

#31

Andrei Lepikhov

lepihov@gmail.com

4 months ago

In reply to: Richard Guo (#30)

1 attachment(s)

Re: MergeAppend could consider sorting cheapest child path

On 2/9/2025 03:27, Richard Guo wrote:

On Tue, Sep 2, 2025 at 5:26 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:

I have another idea. What if we allow MergeAppend paths only when at
least one subpath is preordered. This trick also allow us to exclude
MergeAppend(Sort) dominating Sort(Append). I see the regression tests
changes now have much less volume and looks more reasonable. What do
you think?

I skimmed through the test case changes, and I'm not sure all of them
are actual improvements. For example:
->  Append
-               ->  Foreign Scan on ftprt1_p1 t1_1
+               ->  Sort
+                     Sort Key: t1_1.a
+                     ->  Foreign Scan on ftprt1_p1 t1_1
->  Foreign Scan on ftprt1_p2 t1_2
It seems that this patch moves the sort operation for ftprt1_p1 from
the remote server to local. I'm not sure if this is an improvement,
or why it applies only to ftprt1_p1 and not to ftprt1_p2 (they have
very similar statistics).

I had a look into this case. The next stuff happens.
Initially, within generate_orderedappend_paths, the planner creates an
Append according to the 'match_partition_order' strategy, which
dominates the others.
Next, pathlists of 'Foreign Scan on ftprt1_p1' and 'Foreign Scan on
ftprt1_p2' are different: the first one contains two paths:
1. startup_cost: 100.000, total_cost: 103.090, pathkeys: false
2. startup_cost: 102.880, total_cost: 103.110, pathkeys: true

And the second subpath has only one option to scan:
startup_cost: 100.000, total_cost: 103.660, pathkeys: true

Before, the optimiser always chose the path with pathkeys. However, this
patch attempts to do its best by comparing ForeignScan+Sort and ForeignScan.
Comparing the total path with the explicit Sort and pre-sorted one, we have:
- ForeignScan+Sort: startup_cost: 103.100, total_cost: 103.105
- Presorted: startup_cost: 102.880, total_cost: 103.110
And here is the issue: a difference in the third sign after decimal
point. Let's check remote estimations with and without Sort:

With:
LockRows (cost=2.88..2.90 rows=1 width=25)
-> Sort (cost=2.88..2.89 rows=1 width=25)
Sort Key: t1.a
-> Seq Scan on public.fprt1_p1 t1 (cost=0.00..2.88 ...

Without:
LockRows (cost=0.00..2.88 rows=1 width=25)
-> Seq Scan on public.fprt1_p1 t1 (cost=0.00..2.88 ...

As you can see, according to these estimations, LockRows costs nothing
without sorting and 0.1 with Sort. So, fluctuation was added by
EXPLAIN's rounding.

What to do? At first, we can do nothing and just correct the output. But
I don't like unstable tests. We can adjust the query slightly to
increase the estimations or improve the estimation using extended
statistics. I prefer the more elegant variant with extended statistics.
See the attachment for a sketch on how to stabilise the output. With
this patch applied before this feature, the test output stays the same.

Besides, I noticed that some plans have changed from an "Index Scan
with Index Cond" to a "Seq Scan with Filter + Sort". I'm also not
sure whether this change results in better performance.

As you know, according to the cost model, SeqScan looks better on scans
of tiny tables and full scans. I didn't delve as deeply into these cases
yet as I did in the previous one, but it's clear that we're still seeing
the issue with tiny tables.

--
regards, Andrei Lepikhov

Attachments:

extstat.difftext/plain; charset=UTF-8; name=extstat.diffDownload

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index d3323b04676..cad0f35801e 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -10478,21 +10478,25 @@ SELECT t1.a, t1.phv, t2.b, t2.phv FROM (SELECT 't1_phv' phv, * FROM fprt1 WHERE
 (14 rows)
 
 -- test FOR UPDATE; partitionwise join does not apply
+CREATE STATISTICS stat1 ON (a % 25) FROM fprt1_p1;
+ANALYZE fprt1_p1;
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
-                       QUERY PLAN                       
---------------------------------------------------------
+                          QUERY PLAN                          
+--------------------------------------------------------------
  LockRows
-   ->  Nested Loop
-         Join Filter: (t1.a = t2.b)
-         ->  Append
-               ->  Foreign Scan on ftprt1_p1 t1_1
-               ->  Foreign Scan on ftprt1_p2 t1_2
-         ->  Materialize
+   ->  Sort
+         Sort Key: t1.a
+         ->  Hash Join
+               Hash Cond: (t2.b = t1.a)
                ->  Append
                      ->  Foreign Scan on ftprt2_p1 t2_1
                      ->  Foreign Scan on ftprt2_p2 t2_2
-(10 rows)
+               ->  Hash
+                     ->  Append
+                           ->  Foreign Scan on ftprt1_p1 t1_1
+                           ->  Foreign Scan on ftprt1_p2 t1_2
+(12 rows)
 
 SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
   a  |  b  
@@ -10503,6 +10507,7 @@ SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a
  400 | 400
 (4 rows)
 
+DROP STATISTICS stat1;
 RESET enable_partitionwise_join;
 -- ===================================================================
 -- test partitionwise aggregates
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 2c609e060b7..fbcec9dfb71 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -3308,9 +3308,12 @@ SELECT t1.a, t1.phv, t2.b, t2.phv FROM (SELECT 't1_phv' phv, * FROM fprt1 WHERE
 SELECT t1.a, t1.phv, t2.b, t2.phv FROM (SELECT 't1_phv' phv, * FROM fprt1 WHERE a % 25 = 0) t1 FULL JOIN (SELECT 't2_phv' phv, * FROM fprt2 WHERE b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY t1.a, t2.b;
 
 -- test FOR UPDATE; partitionwise join does not apply
+CREATE STATISTICS stat1 ON (a % 25) FROM fprt1_p1;
+ANALYZE fprt1_p1;
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
 SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
+DROP STATISTICS stat1;
 
 RESET enable_partitionwise_join;

#32

Andrei Lepikhov

lepihov@gmail.com

4 months ago

In reply to: Alexander Korotkov (#29)

1 attachment(s)

Re: MergeAppend could consider sorting cheapest child path

On 1/9/2025 22:26, Alexander Korotkov wrote:

On Thu, Jul 31, 2025 at 5:20 PM Andrei Lepikhov <lepihov@gmail.com> wrote:

See this minor correction in the attachment. postgres_fdw tests are
stable now.

I have another idea. What if we allow MergeAppend paths only when at
least one subpath is preordered. This trick also allow us to exclude
MergeAppend(Sort) dominating Sort(Append). I see the regression tests
changes now have much less volume and looks more reasonable. What do
you think?

I believe a slight mistake has been made with the total_has_ordered /
startup_has_ordered parameters, which has caused unnecessary test
changes in inherit.out (See updated patch in the attachment). Although
not the best test in general (it depends on the autovacuum), it
highlighted the case where a startup-optimal strategy is necessary, even
when a fractional-optimal path is available, which may lead to continue
of the discussion [1]/messages/by-id/CAPpHfduicoMCJ0b0mMvMQ5KVqLimJ7pKdxajciSF+P7JF31v+A@mail.gmail.com.>

Also, do you think get_cheapest_fractional_path_for_pathkeys_ext() and
get_cheapest_path_for_pathkeys_ext() should consider incremental sort?
The revised patch teaches them to do so.

Following 55a780e9476 [2]/messages/by-id/CAMbWs49wSNPPD=FOQqzjPNZ_N9EGDv=7-ou0dFgd0HSwP3fTAg@mail.gmail.com it should be considered, of course.

[1]: /messages/by-id/CAPpHfduicoMCJ0b0mMvMQ5KVqLimJ7pKdxajciSF+P7JF31v+A@mail.gmail.com
/messages/by-id/CAPpHfduicoMCJ0b0mMvMQ5KVqLimJ7pKdxajciSF+P7JF31v+A@mail.gmail.com
[2]: /messages/by-id/CAMbWs49wSNPPD=FOQqzjPNZ_N9EGDv=7-ou0dFgd0HSwP3fTAg@mail.gmail.com
/messages/by-id/CAMbWs49wSNPPD=FOQqzjPNZ_N9EGDv=7-ou0dFgd0HSwP3fTAg@mail.gmail.com

--
regards, Andrei Lepikhov

Attachments:

v11-0001-Consider-an-explicit-sort-of-the-MergeAppend-sub.patchtext/plain; charset=UTF-8; name=v11-0001-Consider-an-explicit-sort-of-the-MergeAppend-sub.patchDownload

From 4451c69ee8bbdfd1f5973e822b35b17266fa0995 Mon Sep 17 00:00:00 2001
From: "Andrei V. Lepikhov" <lepihov@gmail.com>
Date: Tue, 3 Jun 2025 11:37:23 +0200
Subject: [PATCH v11] Consider an explicit sort of the MergeAppend subpaths.

Broaden the optimiser's search scope slightly: when retrieving optimal subpaths
that match pathkeys for the planning MergeAppend, also consider the case of
an overall optimal path that includes an explicit Sort node at the top.

It may provide a more effective plan in both full and fractional scan cases:
1. The Sort node may be pushed down to subpaths under a parallel or async Append.
2. The case when a minor set of subpaths doesn't have a proper index, and it
is profitable to sort them instead of switching to plain Append.

Having implemented that strategy, it became clear that the cost of multiple
small sortings merged by a single MergeAppend node exceeds that of a single Sort
operation over a plain Append. The code and benchmarks demonstrate that such
an assumption is incorrect because the Sort operator has optimisations that work
faster than a MergeAppend.
To arrange the cost model, change the merge cost multiplier, considering that
heap rebuilding needs two comparison operations.
---
 .../postgres_fdw/expected/postgres_fdw.out    |   6 +-
 src/backend/optimizer/path/allpaths.c         |  74 ++++---
 src/backend/optimizer/path/pathkeys.c         | 183 ++++++++++++++++++
 src/include/optimizer/paths.h                 |  10 +
 src/test/regress/expected/inherit.out         |  19 +-
 src/test/regress/expected/partition_join.out  | 149 ++++++++------
 src/test/regress/expected/partition_prune.out |  16 +-
 src/test/regress/expected/union.out           |   6 +-
 src/test/regress/sql/inherit.sql              |   4 +-
 9 files changed, 351 insertions(+), 116 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 78b8367d289..beaa9df7024 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -10474,13 +10474,15 @@ SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a
    ->  Nested Loop
          Join Filter: (t1.a = t2.b)
          ->  Append
-               ->  Foreign Scan on ftprt1_p1 t1_1
+               ->  Sort
+                     Sort Key: t1_1.a
+                     ->  Foreign Scan on ftprt1_p1 t1_1
                ->  Foreign Scan on ftprt1_p2 t1_2
          ->  Materialize
                ->  Append
                      ->  Foreign Scan on ftprt2_p1 t2_1
                      ->  Foreign Scan on ftprt2_p2 t2_2
-(10 rows)
+(12 rows)
 
 SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
   a  |  b  
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 6cc6966b060..8a16d9200b7 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1792,6 +1792,9 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 		List	   *total_subpaths = NIL;
 		List	   *fractional_subpaths = NIL;
 		bool		startup_neq_total = false;
+		bool		total_has_ordered = false;
+		bool		startup_has_ordered = false;
+		bool		fractional_has_ordered = false;
 		bool		match_partition_order;
 		bool		match_partition_order_desc;
 		int			end_index;
@@ -1855,29 +1858,24 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 
 			/* Locate the right paths, if they are available. */
 			cheapest_startup =
-				get_cheapest_path_for_pathkeys(childrel->pathlist,
-											   pathkeys,
-											   NULL,
-											   STARTUP_COST,
-											   false);
+				get_cheapest_path_for_pathkeys_ext(root, childrel, pathkeys,
+												   NULL, STARTUP_COST, false);
 			cheapest_total =
-				get_cheapest_path_for_pathkeys(childrel->pathlist,
-											   pathkeys,
-											   NULL,
-											   TOTAL_COST,
-											   false);
+				get_cheapest_path_for_pathkeys_ext(root, childrel, pathkeys,
+												   NULL, TOTAL_COST, false);
+
+			if (pathkeys_contained_in(pathkeys, cheapest_startup->pathkeys))
+				startup_has_ordered = true;
+
+			if (pathkeys_contained_in(pathkeys, cheapest_total->pathkeys))
+				total_has_ordered = true;
 
 			/*
-			 * If we can't find any paths with the right order just use the
-			 * cheapest-total path; we'll have to sort it later.
+			 * In accordance to current planning logic there are no
+			 * parameterised paths under a merge append.
 			 */
-			if (cheapest_startup == NULL || cheapest_total == NULL)
-			{
-				cheapest_startup = cheapest_total =
-					childrel->cheapest_total_path;
-				/* Assert we do have an unparameterized path for this child */
-				Assert(cheapest_total->param_info == NULL);
-			}
+			Assert(cheapest_startup != NULL && cheapest_total != NULL);
+			Assert(cheapest_total->param_info == NULL);
 
 			/*
 			 * When building a fractional path, determine a cheapest
@@ -1904,21 +1902,20 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 					path_fraction /= childrel->rows;
 
 				cheapest_fractional =
-					get_cheapest_fractional_path_for_pathkeys(childrel->pathlist,
-															  pathkeys,
-															  NULL,
-															  path_fraction);
+					get_cheapest_fractional_path_for_pathkeys_ext(root,
+																  childrel,
+																  pathkeys,
+																  NULL,
+																  path_fraction);
+
+				if (pathkeys_contained_in(pathkeys, cheapest_fractional->pathkeys))
+					fractional_has_ordered = true;
 
 				/*
-				 * If we found no path with matching pathkeys, use the
-				 * cheapest total path instead.
-				 *
-				 * XXX We might consider partially sorted paths too (with an
-				 * incremental sort on top). But we'd have to build all the
-				 * incremental paths, do the costing etc.
+				 * In accordance to current planning logic there are no
+				 * parameterised fractional paths under a merge append.
 				 */
-				if (!cheapest_fractional)
-					cheapest_fractional = cheapest_total;
+				Assert(cheapest_fractional != NULL);
 			}
 
 			/*
@@ -2009,19 +2006,20 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 		else
 		{
 			/* We need MergeAppend */
-			add_path(rel, (Path *) create_merge_append_path(root,
-															rel,
-															startup_subpaths,
-															pathkeys,
-															NULL));
-			if (startup_neq_total)
+			if (startup_has_ordered)
+				add_path(rel, (Path *) create_merge_append_path(root,
+																rel,
+																startup_subpaths,
+																pathkeys,
+																NULL));
+			if (startup_neq_total && total_has_ordered)
 				add_path(rel, (Path *) create_merge_append_path(root,
 																rel,
 																total_subpaths,
 																pathkeys,
 																NULL));
 
-			if (fractional_subpaths)
+			if (fractional_subpaths && fractional_has_ordered)
 				add_path(rel, (Path *) create_merge_append_path(root,
 																rel,
 																fractional_subpaths,
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 8b04d40d36d..0eb618f304c 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -19,11 +19,13 @@
 
 #include "access/stratnum.h"
 #include "catalog/pg_opfamily.h"
+#include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/cost.h"
 #include "optimizer/optimizer.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
+#include "optimizer/planner.h"
 #include "partitioning/partbounds.h"
 #include "rewrite/rewriteManip.h"
 #include "utils/lsyscache.h"
@@ -648,6 +650,98 @@ get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 	return matched_path;
 }
 
+/*
+ * get_cheapest_path_for_pathkeys_ext
+ *	  Calls get_cheapest_path_for_pathkeys to obtain cheapest path that
+ *	  satisfies defined criterias and сonsiders one more option: choose
+ *	  overall-optimal path (according the criterion) and explicitly sort its
+ *	  output to satisfy the pathkeys.
+ *
+ *	  Caller is responsible to insert corresponding sort path at the top of
+ *	  returned path if it will be chosen to be used.
+ *
+ *	  Return NULL if no such path.
+ */
+Path *
+get_cheapest_path_for_pathkeys_ext(PlannerInfo *root, RelOptInfo *rel,
+								   List *pathkeys, Relids required_outer,
+								   CostSelector cost_criterion,
+								   bool require_parallel_safe)
+{
+	Path		sort_path;
+	Path	   *base_path = rel->cheapest_total_path;
+	Path	   *path;
+
+	/* In generate_orderedappend_paths() all childrels do have some paths */
+	Assert(base_path);
+
+	path = get_cheapest_path_for_pathkeys(rel->pathlist, pathkeys,
+										  required_outer, cost_criterion,
+										  require_parallel_safe);
+
+	/*
+	 * Stop here if the cheapest total path doesn't satisfy necessary
+	 * conditions
+	 */
+	if ((require_parallel_safe && !base_path->parallel_safe) ||
+		!bms_is_subset(PATH_REQ_OUTER(base_path), required_outer))
+		return path;
+
+	if (path == NULL)
+
+		/*
+		 * Current pathlist doesn't fit the pathkeys. No need to check extra
+		 * sort path ways.
+		 */
+		return base_path;
+
+	/* Consider the cheapest total path with extra sort */
+	if (path != base_path)
+	{
+		int			presorted_keys;
+
+		if (!pathkeys_count_contained_in(pathkeys, base_path->pathkeys,
+										 &presorted_keys))
+		{
+			/*
+			 * We'll need to insert a Sort node, so include costs for that.
+			 * We choose to use incremental sort if it is enabled and there
+			 * are presorted keys; otherwise we use full sort.
+			 *
+			 * We can use the parent's LIMIT if any, since we certainly won't
+			 * pull more than that many tuples from any child.
+			 */
+			if (enable_incremental_sort && presorted_keys > 0)
+			{
+				cost_incremental_sort(&sort_path, root, pathkeys,
+									  presorted_keys,
+									  base_path->disabled_nodes,
+									  base_path->startup_cost,
+									  base_path->total_cost, base_path->rows,
+									  base_path->pathtarget->width, 0.0,
+									  work_mem, -1.0);
+			}
+			else
+			{
+				cost_sort(&sort_path, root, pathkeys, base_path->disabled_nodes,
+						  base_path->total_cost, base_path->rows,
+						  base_path->pathtarget->width, 0.0, work_mem, -1.0);
+			}
+		}
+		else
+		{
+			sort_path.rows = base_path->rows;
+			sort_path.disabled_nodes = base_path->disabled_nodes;
+			sort_path.startup_cost = base_path->startup_cost;
+			sort_path.total_cost = base_path->total_cost;
+		}
+
+		if (compare_path_costs(&sort_path, path, cost_criterion) < 0)
+			return base_path;
+	}
+	return path;
+}
+
 /*
  * get_cheapest_fractional_path_for_pathkeys
  *	  Find the cheapest path (for retrieving a specified fraction of all
@@ -690,6 +784,95 @@ get_cheapest_fractional_path_for_pathkeys(List *paths,
 	return matched_path;
 }
 
+/*
+ * get_cheapest_fractional_path_for_pathkeys_ext
+ *	  obtain cheapest fractional path that satisfies defined criterias excluding
+ *	  pathkeys and explicitly sort its output to satisfy the pathkeys.
+ *
+ *	  Caller is responsible to insert corresponding sort path at the top of
+ *	  returned path if it will be chosen to be used.
+ *
+ *	  Return NULL if no such path.
+ */
+Path *
+get_cheapest_fractional_path_for_pathkeys_ext(PlannerInfo *root,
+											  RelOptInfo *rel,
+											  List *pathkeys,
+											  Relids required_outer,
+											  double fraction)
+{
+	Path		sort_path;
+	Path	   *base_path = rel->cheapest_total_path;
+	Path	   *path;
+
+	/* In generate_orderedappend_paths() all childrels do have some paths */
+	Assert(base_path);
+
+	path = get_cheapest_fractional_path_for_pathkeys(rel->pathlist, pathkeys,
+													 required_outer, fraction);
+
+	/*
+	 * Stop here if the cheapest total path doesn't satisfy necessary
+	 * conditions
+	 */
+	if (!bms_is_subset(PATH_REQ_OUTER(base_path), required_outer))
+		return path;
+
+	if (path == NULL)
+
+		/*
+		 * Current pathlist doesn't fit the pathkeys. No need to check extra
+		 * sort path ways.
+		 */
+		return base_path;
+
+	/* Consider the cheapest total path with extra sort */
+	if (path != base_path)
+	{
+		int			presorted_keys;
+
+		if (!pathkeys_count_contained_in(pathkeys, base_path->pathkeys,
+										 &presorted_keys))
+		{
+			/*
+			 * We'll need to insert a Sort node, so include costs for that.
+			 * We choose to use incremental sort if it is enabled and there
+			 * are presorted keys; otherwise we use full sort.
+			 *
+			 * We can use the parent's LIMIT if any, since we certainly won't
+			 * pull more than that many tuples from any child.
+			 */
+			if (enable_incremental_sort && presorted_keys > 0)
+			{
+				cost_incremental_sort(&sort_path, root, pathkeys,
+									  presorted_keys,
+									  base_path->disabled_nodes,
+									  base_path->startup_cost,
+									  base_path->total_cost, base_path->rows,
+									  base_path->pathtarget->width, 0.0,
+									  work_mem, -1.0);
+			}
+			else
+			{
+				cost_sort(&sort_path, root, pathkeys, base_path->disabled_nodes,
+						  base_path->total_cost, base_path->rows,
+						  base_path->pathtarget->width, 0.0, work_mem, -1.0);
+			}
+		}
+		else
+		{
+			sort_path.rows = base_path->rows;
+			sort_path.disabled_nodes = base_path->disabled_nodes;
+			sort_path.startup_cost = base_path->startup_cost;
+			sort_path.total_cost = base_path->total_cost;
+		}
+
+		if (compare_fractional_path_costs(&sort_path, path, fraction) <= 0)
+			return base_path;
+	}
+
+	return path;
+}
 
 /*
  * get_cheapest_parallel_safe_total_inner
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index cbade77b717..fd7f6f115b3 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -219,10 +219,20 @@ extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 											Relids required_outer,
 											CostSelector cost_criterion,
 											bool require_parallel_safe);
+extern Path *get_cheapest_path_for_pathkeys_ext(PlannerInfo *root,
+												RelOptInfo *rel, List *pathkeys,
+												Relids required_outer,
+												CostSelector cost_criterion,
+												bool require_parallel_safe);
 extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 													   List *pathkeys,
 													   Relids required_outer,
 													   double fraction);
+extern Path *get_cheapest_fractional_path_for_pathkeys_ext(PlannerInfo *root,
+														   RelOptInfo *rel,
+														   List *pathkeys,
+														   Relids required_outer,
+														   double fraction);
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
 								  ScanDirection scandir);
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index 5b5055babdc..a5be7789fd7 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -1665,11 +1665,13 @@ insert into matest2 (name) values ('Test 3');
 insert into matest2 (name) values ('Test 4');
 insert into matest3 (name) values ('Test 5');
 insert into matest3 (name) values ('Test 6');
-set enable_indexscan = off;  -- force use of seqscan/sort, so no merge
+set enable_indexscan = off;  -- force use of seqscan/sort
+set enable_sort = off; -- since merge append may employ sort in children we need to disable sort
 explain (verbose, costs off) select * from matest0 order by 1-id;
                          QUERY PLAN                         
 ------------------------------------------------------------
  Sort
+   Disabled: true
    Output: matest0.id, matest0.name, ((1 - matest0.id))
    Sort Key: ((1 - matest0.id))
    ->  Result
@@ -1683,7 +1685,7 @@ explain (verbose, costs off) select * from matest0 order by 1-id;
                      Output: matest0_3.id, matest0_3.name
                ->  Seq Scan on public.matest3 matest0_4
                      Output: matest0_4.id, matest0_4.name
-(14 rows)
+(15 rows)
 
 select * from matest0 order by 1-id;
  id |  name  
@@ -1719,6 +1721,7 @@ select min(1-id) from matest0;
 (1 row)
 
 reset enable_indexscan;
+reset enable_sort;
 set enable_seqscan = off;  -- plan with fewest seqscans should be merge
 set enable_parallel_append = off; -- Don't let parallel-append interfere
 explain (verbose, costs off) select * from matest0 order by 1-id;
@@ -1844,16 +1847,20 @@ order by t1.b limit 10;
          Merge Cond: (t1.b = t2.b)
          ->  Merge Append
                Sort Key: t1.b
-               ->  Index Scan using matest0i on matest0 t1_1
+               ->  Sort
+                     Sort Key: t1_1.b
+                     ->  Seq Scan on matest0 t1_1
                ->  Index Scan using matest1i on matest1 t1_2
          ->  Materialize
                ->  Merge Append
                      Sort Key: t2.b
-                     ->  Index Scan using matest0i on matest0 t2_1
-                           Filter: (c = d)
+                     ->  Sort
+                           Sort Key: t2_1.b
+                           ->  Seq Scan on matest0 t2_1
+                                 Filter: (c = d)
                      ->  Index Scan using matest1i on matest1 t2_2
                            Filter: (c = d)
-(14 rows)
+(18 rows)
 
 reset enable_nestloop;
 drop table matest0 cascade;
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index 24e06845f92..3de11957f18 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -65,31 +65,34 @@ SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.b AND t1.b =
 -- inner join with partially-redundant join clauses
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
-                          QUERY PLAN                           
----------------------------------------------------------------
- Sort
+                       QUERY PLAN                        
+---------------------------------------------------------
+ Merge Append
    Sort Key: t1.a
-   ->  Append
-         ->  Merge Join
-               Merge Cond: (t1_1.a = t2_1.a)
-               ->  Index Scan using iprt1_p1_a on prt1_p1 t1_1
-               ->  Sort
-                     Sort Key: t2_1.b
-                     ->  Seq Scan on prt2_p1 t2_1
-                           Filter: (a = b)
+   ->  Merge Join
+         Merge Cond: (t1_1.a = t2_1.a)
+         ->  Index Scan using iprt1_p1_a on prt1_p1 t1_1
+         ->  Sort
+               Sort Key: t2_1.b
+               ->  Seq Scan on prt2_p1 t2_1
+                     Filter: (a = b)
+   ->  Sort
+         Sort Key: t1_2.a
          ->  Hash Join
                Hash Cond: (t1_2.a = t2_2.a)
                ->  Seq Scan on prt1_p2 t1_2
                ->  Hash
                      ->  Seq Scan on prt2_p2 t2_2
                            Filter: (a = b)
+   ->  Sort
+         Sort Key: t1_3.a
          ->  Hash Join
                Hash Cond: (t1_3.a = t2_3.a)
                ->  Seq Scan on prt1_p3 t1_3
                ->  Hash
                      ->  Seq Scan on prt2_p3 t2_3
                            Filter: (a = b)
-(22 rows)
+(25 rows)
 
 SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
  a  |  c   | b  |  c   
@@ -371,9 +374,10 @@ EXPLAIN (COSTS OFF)
 SELECT t1.* FROM prt1 t1 WHERE t1.a IN (SELECT t2.b FROM prt2 t2 WHERE t2.a = 0) AND t1.b = 0 ORDER BY t1.a;
                     QUERY PLAN                    
 --------------------------------------------------
- Sort
+ Merge Append
    Sort Key: t1.a
-   ->  Append
+   ->  Sort
+         Sort Key: t1_1.a
          ->  Hash Semi Join
                Hash Cond: (t1_1.a = t2_1.b)
                ->  Seq Scan on prt1_p1 t1_1
@@ -381,6 +385,8 @@ SELECT t1.* FROM prt1 t1 WHERE t1.a IN (SELECT t2.b FROM prt2 t2 WHERE t2.a = 0)
                ->  Hash
                      ->  Seq Scan on prt2_p1 t2_1
                            Filter: (a = 0)
+   ->  Sort
+         Sort Key: t1_2.a
          ->  Hash Semi Join
                Hash Cond: (t1_2.a = t2_2.b)
                ->  Seq Scan on prt1_p2 t1_2
@@ -388,14 +394,16 @@ SELECT t1.* FROM prt1 t1 WHERE t1.a IN (SELECT t2.b FROM prt2 t2 WHERE t2.a = 0)
                ->  Hash
                      ->  Seq Scan on prt2_p2 t2_2
                            Filter: (a = 0)
-         ->  Nested Loop Semi Join
-               Join Filter: (t1_3.a = t2_3.b)
-               ->  Seq Scan on prt1_p3 t1_3
-                     Filter: (b = 0)
-               ->  Materialize
+   ->  Nested Loop
+         Join Filter: (t1_3.a = t2_3.b)
+         ->  Unique
+               ->  Sort
+                     Sort Key: t2_3.b
                      ->  Seq Scan on prt2_p3 t2_3
                            Filter: (a = 0)
-(24 rows)
+         ->  Seq Scan on prt1_p3 t1_3
+               Filter: (b = 0)
+(29 rows)
 
 SELECT t1.* FROM prt1 t1 WHERE t1.a IN (SELECT t2.b FROM prt2 t2 WHERE t2.a = 0) AND t1.b = 0 ORDER BY t1.a;
   a  | b |  c   
@@ -1387,28 +1395,32 @@ SELECT t1.a, t1.c, t2.b, t2.c, t3.a + t3.b, t3.c FROM (prt1 t1 LEFT JOIN prt2 t2
 -- This should generate a partitionwise join, but currently fails to
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
-                        QUERY PLAN                         
------------------------------------------------------------
- Incremental Sort
+                           QUERY PLAN                            
+-----------------------------------------------------------------
+ Sort
    Sort Key: prt1.a, prt2.b
-   Presorted Key: prt1.a
-   ->  Merge Left Join
-         Merge Cond: (prt1.a = prt2.b)
-         ->  Sort
-               Sort Key: prt1.a
-               ->  Append
-                     ->  Seq Scan on prt1_p1 prt1_1
-                           Filter: ((a < 450) AND (b = 0))
-                     ->  Seq Scan on prt1_p2 prt1_2
-                           Filter: ((a < 450) AND (b = 0))
-         ->  Sort
-               Sort Key: prt2.b
-               ->  Append
+   ->  Merge Right Join
+         Merge Cond: (prt2.b = prt1.a)
+         ->  Append
+               ->  Sort
+                     Sort Key: prt2_1.b
                      ->  Seq Scan on prt2_p2 prt2_1
                            Filter: (b > 250)
+               ->  Sort
+                     Sort Key: prt2_2.b
                      ->  Seq Scan on prt2_p3 prt2_2
                            Filter: (b > 250)
-(19 rows)
+         ->  Materialize
+               ->  Append
+                     ->  Sort
+                           Sort Key: prt1_1.a
+                           ->  Seq Scan on prt1_p1 prt1_1
+                                 Filter: ((a < 450) AND (b = 0))
+                     ->  Sort
+                           Sort Key: prt1_2.a
+                           ->  Seq Scan on prt1_p2 prt1_2
+                                 Filter: ((a < 450) AND (b = 0))
+(23 rows)
 
 SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
   a  |  b  
@@ -1428,25 +1440,33 @@ SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT *
 -- partitionwise join does not apply
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t2.b FROM prt1 t1, prt2 t2 WHERE t1::text = t2::text AND t1.a = t2.b ORDER BY t1.a;
-                                       QUERY PLAN                                        
------------------------------------------------------------------------------------------
+                            QUERY PLAN                            
+------------------------------------------------------------------
  Merge Join
-   Merge Cond: ((t1.a = t2.b) AND (((((t1.*)::prt1))::text) = ((((t2.*)::prt2))::text)))
-   ->  Sort
-         Sort Key: t1.a, ((((t1.*)::prt1))::text)
-         ->  Result
-               ->  Append
-                     ->  Seq Scan on prt1_p1 t1_1
-                     ->  Seq Scan on prt1_p2 t1_2
-                     ->  Seq Scan on prt1_p3 t1_3
-   ->  Sort
-         Sort Key: t2.b, ((((t2.*)::prt2))::text)
-         ->  Result
-               ->  Append
+   Merge Cond: (t1.a = t2.b)
+   Join Filter: ((((t2.*)::prt2))::text = (((t1.*)::prt1))::text)
+   ->  Append
+         ->  Sort
+               Sort Key: t1_1.a
+               ->  Seq Scan on prt1_p1 t1_1
+         ->  Sort
+               Sort Key: t1_2.a
+               ->  Seq Scan on prt1_p2 t1_2
+         ->  Sort
+               Sort Key: t1_3.a
+               ->  Seq Scan on prt1_p3 t1_3
+   ->  Materialize
+         ->  Append
+               ->  Sort
+                     Sort Key: t2_1.b
                      ->  Seq Scan on prt2_p1 t2_1
+               ->  Sort
+                     Sort Key: t2_2.b
                      ->  Seq Scan on prt2_p2 t2_2
+               ->  Sort
+                     Sort Key: t2_3.b
                      ->  Seq Scan on prt2_p3 t2_3
-(16 rows)
+(24 rows)
 
 SELECT t1.a, t2.b FROM prt1 t1, prt2 t2 WHERE t1::text = t2::text AND t1.a = t2.b ORDER BY t1.a;
  a  | b  
@@ -4512,9 +4532,10 @@ EXPLAIN (COSTS OFF)
 SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2 ON (t1.a = t2.a AND t1.c = t2.c) LEFT JOIN plt1_adv t3 ON (t1.a = t3.a AND t1.c = t3.c) WHERE t1.b < 10 ORDER BY t1.a;
                                    QUERY PLAN                                   
 --------------------------------------------------------------------------------
- Sort
+ Merge Append
    Sort Key: t1.a
-   ->  Append
+   ->  Sort
+         Sort Key: t1_1.a
          ->  Hash Right Join
                Hash Cond: ((t3_1.a = t1_1.a) AND (t3_1.c = t1_1.c))
                ->  Seq Scan on plt1_adv_p1 t3_1
@@ -4525,6 +4546,8 @@ SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2
                            ->  Hash
                                  ->  Seq Scan on plt1_adv_p1 t1_1
                                        Filter: (b < 10)
+   ->  Sort
+         Sort Key: t1_2.a
          ->  Hash Right Join
                Hash Cond: ((t3_2.a = t1_2.a) AND (t3_2.c = t1_2.c))
                ->  Seq Scan on plt1_adv_p2 t3_2
@@ -4535,6 +4558,8 @@ SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2
                            ->  Hash
                                  ->  Seq Scan on plt1_adv_p2 t1_2
                                        Filter: (b < 10)
+   ->  Sort
+         Sort Key: t1_3.a
          ->  Hash Right Join
                Hash Cond: ((t3_3.a = t1_3.a) AND (t3_3.c = t1_3.c))
                ->  Seq Scan on plt1_adv_p3 t3_3
@@ -4545,15 +4570,19 @@ SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2
                            ->  Hash
                                  ->  Seq Scan on plt1_adv_p3 t1_3
                                        Filter: (b < 10)
-         ->  Nested Loop Left Join
-               Join Filter: ((t1_4.a = t3_4.a) AND (t1_4.c = t3_4.c))
-               ->  Nested Loop Left Join
-                     Join Filter: ((t1_4.a = t2_4.a) AND (t1_4.c = t2_4.c))
+   ->  Nested Loop Left Join
+         Join Filter: ((t1_4.a = t3_4.a) AND (t1_4.c = t3_4.c))
+         ->  Merge Left Join
+               Merge Cond: ((t1_4.a = t2_4.a) AND (t1_4.c = t2_4.c))
+               ->  Sort
+                     Sort Key: t1_4.a, t1_4.c
                      ->  Seq Scan on plt1_adv_extra t1_4
                            Filter: (b < 10)
+               ->  Sort
+                     Sort Key: t2_4.a, t2_4.c
                      ->  Seq Scan on plt2_adv_extra t2_4
-               ->  Seq Scan on plt1_adv_extra t3_4
-(41 rows)
+         ->  Seq Scan on plt1_adv_extra t3_4
+(50 rows)
 
 SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2 ON (t1.a = t2.a AND t1.c = t2.c) LEFT JOIN plt1_adv t3 ON (t1.a = t3.a AND t1.c = t3.c) WHERE t1.b < 10 ORDER BY t1.a;
  a  |  c   | a |  c   | a |  c   
diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index d1966cd7d82..43e2962009e 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -4768,9 +4768,10 @@ select min(a) over (partition by a order by a) from part_abc where a >= stable_o
                Window: w1 AS (PARTITION BY part_abc.a ORDER BY part_abc.a)
                ->  Append
                      Subplans Removed: 1
-                     ->  Index Scan using part_abc_2_a_idx on part_abc_2 part_abc_1
-                           Index Cond: (a >= (stable_one() + 1))
-                           Filter: (d <= stable_one())
+                     ->  Sort
+                           Sort Key: part_abc_1.a
+                           ->  Seq Scan on part_abc_2 part_abc_1
+                                 Filter: ((d <= stable_one()) AND (a >= (stable_one() + 1)))
                      ->  Merge Append
                            Sort Key: part_abc_3.a
                            Subplans Removed: 1
@@ -4785,9 +4786,10 @@ select min(a) over (partition by a order by a) from part_abc where a >= stable_o
                Window: w1 AS (PARTITION BY part_abc_5.a ORDER BY part_abc_5.a)
                ->  Append
                      Subplans Removed: 1
-                     ->  Index Scan using part_abc_2_a_idx on part_abc_2 part_abc_6
-                           Index Cond: (a >= (stable_one() + 1))
-                           Filter: (d >= stable_one())
+                     ->  Sort
+                           Sort Key: part_abc_6.a
+                           ->  Seq Scan on part_abc_2 part_abc_6
+                                 Filter: ((d >= stable_one()) AND (a >= (stable_one() + 1)))
                      ->  Merge Append
                            Sort Key: a
                            Subplans Removed: 1
@@ -4797,7 +4799,7 @@ select min(a) over (partition by a order by a) from part_abc where a >= stable_o
                            ->  Index Scan using part_abc_3_3_a_idx on part_abc_3_3 part_abc_9
                                  Index Cond: (a >= (stable_one() + 1))
                                  Filter: (d >= stable_one())
-(35 rows)
+(37 rows)
 
 drop view part_abc_view;
 drop table part_abc;
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index 96962817ed4..0ccadea910c 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -1207,12 +1207,14 @@ select event_id
 ----------------------------------------------------------
  Merge Append
    Sort Key: events.event_id
-   ->  Index Scan using events_pkey on events
+   ->  Sort
+         Sort Key: events.event_id
+         ->  Seq Scan on events
    ->  Sort
          Sort Key: events_1.event_id
          ->  Seq Scan on events_child events_1
    ->  Index Scan using other_events_pkey on other_events
-(7 rows)
+(9 rows)
 
 drop table events_child, events, other_events;
 reset enable_indexonlyscan;
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 699e8ac09c8..c58beebbd1e 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -641,12 +641,14 @@ insert into matest2 (name) values ('Test 4');
 insert into matest3 (name) values ('Test 5');
 insert into matest3 (name) values ('Test 6');
 
-set enable_indexscan = off;  -- force use of seqscan/sort, so no merge
+set enable_indexscan = off;  -- force use of seqscan/sort
+set enable_sort = off; -- since merge append may employ sort in children we need to disable sort
 explain (verbose, costs off) select * from matest0 order by 1-id;
 select * from matest0 order by 1-id;
 explain (verbose, costs off) select min(1-id) from matest0;
 select min(1-id) from matest0;
 reset enable_indexscan;
+reset enable_sort;
 
 set enable_seqscan = off;  -- plan with fewest seqscans should be merge
 set enable_parallel_append = off; -- Don't let parallel-append interfere
-- 
2.51.0

#33

Alexander Korotkov

aekorotkov@gmail.com

4 months ago

In reply to: Andrei Lepikhov (#32)

1 attachment(s)

Re: MergeAppend could consider sorting cheapest child path

On Fri, Sep 5, 2025 at 11:45 AM Andrei Lepikhov <lepihov@gmail.com> wrote:

On 1/9/2025 22:26, Alexander Korotkov wrote:

On Thu, Jul 31, 2025 at 5:20 PM Andrei Lepikhov <lepihov@gmail.com> wrote:

See this minor correction in the attachment. postgres_fdw tests are
stable now.

I have another idea. What if we allow MergeAppend paths only when at
least one subpath is preordered. This trick also allow us to exclude
MergeAppend(Sort) dominating Sort(Append). I see the regression tests
changes now have much less volume and looks more reasonable. What do
you think?

I believe a slight mistake has been made with the total_has_ordered /
startup_has_ordered parameters, which has caused unnecessary test
changes in inherit.out (See updated patch in the attachment). Although
not the best test in general (it depends on the autovacuum), it
highlighted the case where a startup-optimal strategy is necessary, even
when a fractional-optimal path is available, which may lead to continue
of the discussion [1].>

Also, do you think get_cheapest_fractional_path_for_pathkeys_ext() and
get_cheapest_path_for_pathkeys_ext() should consider incremental sort?
The revised patch teaches them to do so.

Following 55a780e9476 [2] it should be considered, of course.

Great, thank you for catching this. The diff of costs is attached. I
see the costs now are better or within the fuzz factor.

------
Regards,
Alexander Korotkov
Supabase

Attachments:

costs.diffapplication/octet-stream; name=costs.diffDownload

diff -U3 /Users/smagen/projects/postgresql/env/master/src/src/test/regress/expected/inherit.out /Users/smagen/projects/postgresql/env/master/src/src/test/regress/results/inherit.out
--- /Users/smagen/projects/postgresql/env/master/src/src/test/regress/expected/inherit.out	2025-09-07 11:27:22
+++ /Users/smagen/projects/postgresql/env/master/src/src/test/regress/results/inherit.out	2025-09-07 14:22:44
@@ -1842,21 +1842,25 @@
 order by t1.b limit 10;
                                               QUERY PLAN                                              
 ------------------------------------------------------------------------------------------------------
- Limit  (cost=0.57..20.88 rows=10 width=16)
-   ->  Merge Join  (cost=0.57..189.37 rows=93 width=16)
+ Limit  (cost=0.35..18.92 rows=10 width=16)
+   ->  Merge Join  (cost=0.35..173.11 rows=93 width=16)
          Merge Cond: (t1.b = t2.b)
-         ->  Merge Append  (cost=0.29..98.56 rows=1851 width=16)
+         ->  Merge Append  (cost=0.17..90.44 rows=1851 width=16)
                Sort Key: t1.b
-               ->  Index Scan using matest0i on matest0 t1_1  (cost=0.12..8.14 rows=1 width=16)
-               ->  Index Scan using matest1i on matest1 t1_2  (cost=0.15..71.90 rows=1850 width=16)
-         ->  Materialize  (cost=0.29..84.81 rows=10 width=4)
-               ->  Merge Append  (cost=0.29..84.78 rows=10 width=4)
+               ->  Sort  (cost=0.01..0.02 rows=1 width=16)
+                     Sort Key: t1_1.b
+                     ->  Seq Scan on matest0 t1_1  (cost=0.00..0.00 rows=1 width=16)
+               ->  Index Scan using matest1i on matest1 t1_2  (cost=0.15..71.90 rows=1850 width=16)
+         ->  Materialize  (cost=0.17..76.68 rows=10 width=4)
+               ->  Merge Append  (cost=0.17..76.65 rows=10 width=4)
                      Sort Key: t2.b
-                     ->  Index Scan using matest0i on matest0 t2_1  (cost=0.12..8.14 rows=1 width=4)
-                           Filter: (c = d)
+                     ->  Sort  (cost=0.01..0.02 rows=1 width=4)
+                           Sort Key: t2_1.b
+                           ->  Seq Scan on matest0 t2_1  (cost=0.00..0.00 rows=1 width=4)
+                                 Filter: (c = d)
                      ->  Index Scan using matest1i on matest1 t2_2  (cost=0.15..76.53 rows=9 width=4)
                            Filter: (c = d)
-(14 rows)
+(18 rows)
 
 reset enable_nestloop;
 drop table matest0 cascade;
diff -U3 /Users/smagen/projects/postgresql/env/master/src/src/test/regress/expected/union.out /Users/smagen/projects/postgresql/env/master/src/src/test/regress/results/union.out
--- /Users/smagen/projects/postgresql/env/master/src/src/test/regress/expected/union.out	2025-09-07 11:27:30
+++ /Users/smagen/projects/postgresql/env/master/src/src/test/regress/results/union.out	2025-09-07 14:22:46
@@ -1205,14 +1205,16 @@
  order by event_id;
                                            QUERY PLAN                                           
 ------------------------------------------------------------------------------------------------
- Merge Append  (cost=180.09..342.66 rows=5101 width=4)
+ Merge Append  (cost=179.97..334.53 rows=5101 width=4)
    Sort Key: events.event_id
-   ->  Index Scan using events_pkey on events  (cost=0.12..8.14 rows=1 width=4)
+   ->  Sort  (cost=0.01..0.02 rows=1 width=4)
+         Sort Key: events.event_id
+         ->  Seq Scan on events  (cost=0.00..0.00 rows=1 width=4)
    ->  Sort  (cost=179.78..186.16 rows=2550 width=4)
          Sort Key: events_1.event_id
          ->  Seq Scan on events_child events_1  (cost=0.00..35.50 rows=2550 width=4)
    ->  Index Scan using other_events_pkey on other_events  (cost=0.15..82.41 rows=2550 width=4)
-(7 rows)
+(9 rows)
 
 drop table events_child, events, other_events;
 reset enable_indexonlyscan;
diff -U3 /Users/smagen/projects/postgresql/env/master/src/src/test/regress/expected/partition_join.out /Users/smagen/projects/postgresql/env/master/src/src/test/regress/results/partition_join.out
--- /Users/smagen/projects/postgresql/env/master/src/src/test/regress/expected/partition_join.out	2025-09-07 11:27:39
+++ /Users/smagen/projects/postgresql/env/master/src/src/test/regress/results/partition_join.out	2025-09-07 14:23:04
@@ -65,31 +65,34 @@
 -- inner join with partially-redundant join clauses
 EXPLAIN
 SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
-                                             QUERY PLAN                                             
-----------------------------------------------------------------------------------------------------
- Sort  (cost=11.52..11.53 rows=3 width=18)
+                                          QUERY PLAN                                          
+----------------------------------------------------------------------------------------------
+ Merge Append  (cost=10.15..11.58 rows=3 width=18)
    Sort Key: t1.a
-   ->  Append  (cost=2.20..11.50 rows=3 width=18)
-         ->  Merge Join  (cost=2.20..3.58 rows=1 width=18)
-               Merge Cond: (t1_1.a = t2_1.a)
-               ->  Index Scan using iprt1_p1_a on prt1_p1 t1_1  (cost=0.14..14.02 rows=125 width=9)
-               ->  Sort  (cost=2.06..2.06 rows=1 width=13)
-                     Sort Key: t2_1.b
-                     ->  Seq Scan on prt2_p1 t2_1  (cost=0.00..2.05 rows=1 width=13)
-                           Filter: (a = b)
+   ->  Merge Join  (cost=2.20..3.58 rows=1 width=18)
+         Merge Cond: (t1_1.a = t2_1.a)
+         ->  Index Scan using iprt1_p1_a on prt1_p1 t1_1  (cost=0.14..14.02 rows=125 width=9)
+         ->  Sort  (cost=2.06..2.06 rows=1 width=13)
+               Sort Key: t2_1.b
+               ->  Seq Scan on prt2_p1 t2_1  (cost=0.00..2.05 rows=1 width=13)
+                     Filter: (a = b)
+   ->  Sort  (cost=4.79..4.79 rows=1 width=18)
+         Sort Key: t1_2.a
          ->  Hash Join  (cost=2.05..4.78 rows=1 width=18)
                Hash Cond: (t1_2.a = t2_2.a)
                ->  Seq Scan on prt1_p2 t1_2  (cost=0.00..2.25 rows=125 width=9)
                ->  Hash  (cost=2.04..2.04 rows=1 width=13)
                      ->  Seq Scan on prt2_p2 t2_2  (cost=0.00..2.04 rows=1 width=13)
                            Filter: (a = b)
+   ->  Sort  (cost=3.13..3.14 rows=1 width=18)
+         Sort Key: t1_3.a
          ->  Hash Join  (cost=1.43..3.12 rows=1 width=18)
                Hash Cond: (t1_3.a = t2_3.a)
                ->  Seq Scan on prt1_p3 t1_3  (cost=0.00..1.50 rows=50 width=9)
                ->  Hash  (cost=1.41..1.41 rows=1 width=13)
                      ->  Seq Scan on prt2_p3 t2_3  (cost=0.00..1.41 rows=1 width=13)
                            Filter: (a = b)
-(22 rows)
+(25 rows)
 
 SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
  a  |  c   | b  |  c   
@@ -371,9 +374,10 @@
 SELECT t1.* FROM prt1 t1 WHERE t1.a IN (SELECT t2.b FROM prt2 t2 WHERE t2.a = 0) AND t1.b = 0 ORDER BY t1.a;
                                      QUERY PLAN                                     
 ------------------------------------------------------------------------------------
- Sort  (cost=12.44..12.44 rows=3 width=13)
+ Merge Append  (cost=10.79..12.50 rows=3 width=13)
    Sort Key: t1.a
-   ->  Append  (cost=2.10..12.41 rows=3 width=13)
+   ->  Sort  (cost=4.69..4.69 rows=1 width=13)
+         Sort Key: t1_1.a
          ->  Hash Semi Join  (cost=2.10..4.68 rows=1 width=13)
                Hash Cond: (t1_1.a = t2_1.b)
                ->  Seq Scan on prt1_p1 t1_1  (cost=0.00..2.56 rows=5 width=13)
@@ -381,6 +385,8 @@
                ->  Hash  (cost=2.05..2.05 rows=4 width=4)
                      ->  Seq Scan on prt2_p1 t2_1  (cost=0.00..2.05 rows=4 width=4)
                            Filter: (a = 0)
+   ->  Sort  (cost=4.66..4.67 rows=1 width=13)
+         Sort Key: t1_2.a
          ->  Hash Semi Join  (cost=2.08..4.65 rows=1 width=13)
                Hash Cond: (t1_2.a = t2_2.b)
                ->  Seq Scan on prt1_p2 t1_2  (cost=0.00..2.56 rows=5 width=13)
@@ -388,14 +394,16 @@
                ->  Hash  (cost=2.04..2.04 rows=3 width=4)
                      ->  Seq Scan on prt2_p2 t2_2  (cost=0.00..2.04 rows=3 width=4)
                            Filter: (a = 0)
-         ->  Nested Loop Semi Join  (cost=0.00..3.07 rows=1 width=13)
-               Join Filter: (t1_3.a = t2_3.b)
-               ->  Seq Scan on prt1_p3 t1_3  (cost=0.00..1.62 rows=2 width=13)
-                     Filter: (b = 0)
-               ->  Materialize  (cost=0.00..1.42 rows=1 width=4)
+   ->  Nested Loop  (cost=1.42..3.08 rows=1 width=13)
+         Join Filter: (t1_3.a = t2_3.b)
+         ->  Unique  (cost=1.42..1.43 rows=1 width=4)
+               ->  Sort  (cost=1.42..1.43 rows=1 width=4)
+                     Sort Key: t2_3.b
                      ->  Seq Scan on prt2_p3 t2_3  (cost=0.00..1.41 rows=1 width=4)
                            Filter: (a = 0)
-(24 rows)
+         ->  Seq Scan on prt1_p3 t1_3  (cost=0.00..1.62 rows=2 width=13)
+               Filter: (b = 0)
+(29 rows)
 
 SELECT t1.* FROM prt1 t1 WHERE t1.a IN (SELECT t2.b FROM prt2 t2 WHERE t2.a = 0) AND t1.b = 0 ORDER BY t1.a;
   a  | b |  c   
@@ -1387,28 +1395,32 @@
 -- This should generate a partitionwise join, but currently fails to
 EXPLAIN
 SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
-                                      QUERY PLAN                                       
----------------------------------------------------------------------------------------
- Incremental Sort  (cost=14.03..15.02 rows=9 width=8)
+                                         QUERY PLAN                                         
+--------------------------------------------------------------------------------------------
+ Sort  (cost=14.23..14.25 rows=9 width=8)
    Sort Key: prt1.a, prt2.b
-   Presorted Key: prt1.a
-   ->  Merge Left Join  (cost=13.95..14.61 rows=9 width=8)
-         Merge Cond: (prt1.a = prt2.b)
-         ->  Sort  (cost=5.94..5.96 rows=9 width=4)
-               Sort Key: prt1.a
-               ->  Append  (cost=0.00..5.79 rows=9 width=4)
-                     ->  Seq Scan on prt1_p1 prt1_1  (cost=0.00..2.88 rows=5 width=4)
-                           Filter: ((a < 450) AND (b = 0))
-                     ->  Seq Scan on prt1_p2 prt1_2  (cost=0.00..2.88 rows=4 width=4)
-                           Filter: ((a < 450) AND (b = 0))
-         ->  Sort  (cost=8.01..8.30 rows=116 width=4)
-               Sort Key: prt2.b
-               ->  Append  (cost=0.00..4.03 rows=116 width=4)
+   ->  Merge Right Join  (cost=12.78..14.09 rows=9 width=8)
+         Merge Cond: (prt2.b = prt1.a)
+         ->  Append  (cost=6.93..7.80 rows=116 width=4)
+               ->  Sort  (cost=4.68..4.89 rows=83 width=4)
+                     Sort Key: prt2_1.b
                      ->  Seq Scan on prt2_p2 prt2_1  (cost=0.00..2.04 rows=83 width=4)
                            Filter: (b > 250)
+               ->  Sort  (cost=2.24..2.33 rows=33 width=4)
+                     Sort Key: prt2_2.b
                      ->  Seq Scan on prt2_p3 prt2_2  (cost=0.00..1.41 rows=33 width=4)
                            Filter: (b > 250)
-(19 rows)
+         ->  Materialize  (cost=5.85..5.94 rows=9 width=4)
+               ->  Append  (cost=5.85..5.92 rows=9 width=4)
+                     ->  Sort  (cost=2.93..2.95 rows=5 width=4)
+                           Sort Key: prt1_1.a
+                           ->  Seq Scan on prt1_p1 prt1_1  (cost=0.00..2.88 rows=5 width=4)
+                                 Filter: ((a < 450) AND (b = 0))
+                     ->  Sort  (cost=2.92..2.92 rows=4 width=4)
+                           Sort Key: prt1_2.a
+                           ->  Seq Scan on prt1_p2 prt1_2  (cost=0.00..2.88 rows=4 width=4)
+                                 Filter: ((a < 450) AND (b = 0))
+(23 rows)
 
 SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
   a  |  b  
@@ -1428,25 +1440,33 @@
 -- partitionwise join does not apply
 EXPLAIN
 SELECT t1.a, t2.b FROM prt1 t1, prt2 t2 WHERE t1::text = t2::text AND t1.a = t2.b ORDER BY t1.a;
-                                       QUERY PLAN                                        
------------------------------------------------------------------------------------------
- Merge Join  (cost=33.49..42.25 rows=1 width=8)
-   Merge Cond: ((t1.a = t2.b) AND (((((t1.*)::prt1))::text) = ((((t2.*)::prt2))::text)))
-   ->  Sort  (cost=19.84..20.59 rows=300 width=36)
-         Sort Key: t1.a, ((((t1.*)::prt1))::text)
-         ->  Result  (cost=0.00..7.50 rows=300 width=36)
-               ->  Append  (cost=0.00..7.50 rows=300 width=36)
-                     ->  Seq Scan on prt1_p1 t1_1  (cost=0.00..2.25 rows=125 width=36)
-                     ->  Seq Scan on prt1_p2 t1_2  (cost=0.00..2.25 rows=125 width=36)
-                     ->  Seq Scan on prt1_p3 t1_3  (cost=0.00..1.50 rows=50 width=36)
-   ->  Sort  (cost=13.64..14.14 rows=200 width=36)
-         Sort Key: t2.b, ((((t2.*)::prt2))::text)
-         ->  Result  (cost=0.00..6.00 rows=200 width=36)
-               ->  Append  (cost=0.00..6.00 rows=200 width=36)
+                                      QUERY PLAN                                      
+--------------------------------------------------------------------------------------
+ Merge Join  (cost=27.28..37.28 rows=1 width=8)
+   Merge Cond: (t1.a = t2.b)
+   Join Filter: ((((t2.*)::prt2))::text = (((t1.*)::prt1))::text)
+   ->  Append  (cost=16.12..18.37 rows=300 width=36)
+         ->  Sort  (cost=6.60..6.92 rows=125 width=36)
+               Sort Key: t1_1.a
+               ->  Seq Scan on prt1_p1 t1_1  (cost=0.00..2.25 rows=125 width=36)
+         ->  Sort  (cost=6.60..6.92 rows=125 width=36)
+               Sort Key: t1_2.a
+               ->  Seq Scan on prt1_p2 t1_2  (cost=0.00..2.25 rows=125 width=36)
+         ->  Sort  (cost=2.91..3.04 rows=50 width=36)
+               Sort Key: t1_3.a
+               ->  Seq Scan on prt1_p3 t1_3  (cost=0.00..1.50 rows=50 width=36)
+   ->  Materialize  (cost=11.16..13.16 rows=200 width=36)
+         ->  Append  (cost=11.16..12.66 rows=200 width=36)
+               ->  Sort  (cost=4.52..4.73 rows=84 width=36)
+                     Sort Key: t2_1.b
                      ->  Seq Scan on prt2_p1 t2_1  (cost=0.00..1.84 rows=84 width=36)
+               ->  Sort  (cost=4.48..4.68 rows=83 width=36)
+                     Sort Key: t2_2.b
                      ->  Seq Scan on prt2_p2 t2_2  (cost=0.00..1.83 rows=83 width=36)
+               ->  Sort  (cost=2.16..2.24 rows=33 width=36)
+                     Sort Key: t2_3.b
                      ->  Seq Scan on prt2_p3 t2_3  (cost=0.00..1.33 rows=33 width=36)
-(16 rows)
+(24 rows)
 
 SELECT t1.a, t2.b FROM prt1 t1, prt2 t2 WHERE t1::text = t2::text AND t1.a = t2.b ORDER BY t1.a;
  a  | b  
@@ -4512,9 +4532,10 @@
 SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2 ON (t1.a = t2.a AND t1.c = t2.c) LEFT JOIN plt1_adv t3 ON (t1.a = t3.a AND t1.c = t3.c) WHERE t1.b < 10 ORDER BY t1.a;
                                              QUERY PLAN                                             
 ----------------------------------------------------------------------------------------------------
- Sort  (cost=20.82..20.83 rows=4 width=34)
+ Merge Append  (cost=19.81..20.94 rows=4 width=34)
    Sort Key: t1.a
-   ->  Append  (cost=3.84..20.78 rows=4 width=34)
+   ->  Sort  (cost=5.91..5.92 rows=1 width=34)
+         Sort Key: t1_1.a
          ->  Hash Right Join  (cost=3.84..5.90 rows=1 width=34)
                Hash Cond: ((t3_1.a = t1_1.a) AND (t3_1.c = t1_1.c))
                ->  Seq Scan on plt1_adv_p1 t3_1  (cost=0.00..1.60 rows=60 width=9)
@@ -4525,6 +4546,8 @@
                            ->  Hash  (cost=1.75..1.75 rows=1 width=9)
                                  ->  Seq Scan on plt1_adv_p1 t1_1  (cost=0.00..1.75 rows=1 width=9)
                                        Filter: (b < 10)
+   ->  Sort  (cost=5.91..5.92 rows=1 width=34)
+         Sort Key: t1_2.a
          ->  Hash Right Join  (cost=3.84..5.90 rows=1 width=34)
                Hash Cond: ((t3_2.a = t1_2.a) AND (t3_2.c = t1_2.c))
                ->  Seq Scan on plt1_adv_p2 t3_2  (cost=0.00..1.60 rows=60 width=9)
@@ -4535,6 +4558,8 @@
                            ->  Hash  (cost=1.75..1.75 rows=1 width=9)
                                  ->  Seq Scan on plt1_adv_p2 t1_2  (cost=0.00..1.75 rows=1 width=9)
                                        Filter: (b < 10)
+   ->  Sort  (cost=5.91..5.92 rows=1 width=34)
+         Sort Key: t1_3.a
          ->  Hash Right Join  (cost=3.84..5.90 rows=1 width=34)
                Hash Cond: ((t3_3.a = t1_3.a) AND (t3_3.c = t1_3.c))
                ->  Seq Scan on plt1_adv_p3 t3_3  (cost=0.00..1.60 rows=60 width=9)
@@ -4545,15 +4570,19 @@
                            ->  Hash  (cost=1.75..1.75 rows=1 width=9)
                                  ->  Seq Scan on plt1_adv_p3 t1_3  (cost=0.00..1.75 rows=1 width=9)
                                        Filter: (b < 10)
-         ->  Nested Loop Left Join  (cost=0.00..3.06 rows=1 width=34)
-               Join Filter: ((t1_4.a = t3_4.a) AND (t1_4.c = t3_4.c))
-               ->  Nested Loop Left Join  (cost=0.00..2.04 rows=1 width=25)
-                     Join Filter: ((t1_4.a = t2_4.a) AND (t1_4.c = t2_4.c))
+   ->  Nested Loop Left Join  (cost=2.04..3.10 rows=1 width=34)
+         Join Filter: ((t1_4.a = t3_4.a) AND (t1_4.c = t3_4.c))
+         ->  Merge Left Join  (cost=2.04..2.07 rows=1 width=25)
+               Merge Cond: ((t1_4.a = t2_4.a) AND (t1_4.c = t2_4.c))
+               ->  Sort  (cost=1.02..1.03 rows=1 width=36)
+                     Sort Key: t1_4.a, t1_4.c
                      ->  Seq Scan on plt1_adv_extra t1_4  (cost=0.00..1.01 rows=1 width=36)
                            Filter: (b < 10)
+               ->  Sort  (cost=1.02..1.02 rows=1 width=36)
+                     Sort Key: t2_4.a, t2_4.c
                      ->  Seq Scan on plt2_adv_extra t2_4  (cost=0.00..1.01 rows=1 width=36)
-               ->  Seq Scan on plt1_adv_extra t3_4  (cost=0.00..1.01 rows=1 width=36)
-(41 rows)
+         ->  Seq Scan on plt1_adv_extra t3_4  (cost=0.00..1.01 rows=1 width=36)
+(50 rows)
 
 SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM plt1_adv t1 LEFT JOIN plt2_adv t2 ON (t1.a = t2.a AND t1.c = t2.c) LEFT JOIN plt1_adv t3 ON (t1.a = t3.a AND t1.c = t3.c) WHERE t1.b < 10 ORDER BY t1.a;
  a  |  c   | a |  c   | a |  c   
diff -U3 /Users/smagen/projects/postgresql/env/master/src/src/test/regress/expected/partition_prune.out /Users/smagen/projects/postgresql/env/master/src/src/test/regress/results/partition_prune.out
--- /Users/smagen/projects/postgresql/env/master/src/src/test/regress/expected/partition_prune.out	2025-09-07 11:29:17
+++ /Users/smagen/projects/postgresql/env/master/src/src/test/regress/results/partition_prune.out	2025-09-07 14:23:04
@@ -4762,15 +4762,16 @@
 select min(a) over (partition by a order by a) from part_abc where a >= stable_one() + 1 and d >= stable_one();
                                                              QUERY PLAN                                                             
 ------------------------------------------------------------------------------------------------------------------------------------
- Append  (cost=3.21..1271.00 rows=1050 width=4)
-   ->  Subquery Scan on "*SELECT* 1_1"  (cost=3.21..632.87 rows=525 width=4)
-         ->  WindowAgg  (cost=3.21..627.62 rows=525 width=8)
+ Append  (cost=4.35..1256.77 rows=1050 width=4)
+   ->  Subquery Scan on "*SELECT* 1_1"  (cost=4.35..625.76 rows=525 width=4)
+         ->  WindowAgg  (cost=4.35..620.51 rows=525 width=8)
                Window: w1 AS (PARTITION BY part_abc.a ORDER BY part_abc.a)
-               ->  Append  (cost=2.02..618.44 rows=525 width=4)
+               ->  Append  (cost=3.17..611.32 rows=525 width=4)
                      Subplans Removed: 1
-                     ->  Index Scan using part_abc_2_a_idx on part_abc_2 part_abc_1  (cost=0.38..8.65 rows=1 width=4)
-                           Index Cond: (a >= (stable_one() + 1))
-                           Filter: (d <= stable_one())
+                     ->  Sort  (cost=1.53..1.53 rows=1 width=4)
+                           Sort Key: part_abc_1.a
+                           ->  Seq Scan on part_abc_2 part_abc_1  (cost=0.00..1.52 rows=1 width=4)
+                                 Filter: ((d <= stable_one()) AND (a >= (stable_one() + 1)))
                      ->  Merge Append  (cost=1.24..456.65 rows=393 width=4)
                            Sort Key: part_abc_3.a
                            Subplans Removed: 1
@@ -4780,14 +4781,15 @@
                            ->  Index Scan using part_abc_3_2_a_idx on part_abc_3_2 part_abc_4  (cost=0.41..150.52 rows=131 width=4)
                                  Index Cond: (a >= (stable_one() + 1))
                                  Filter: (d <= stable_one())
-   ->  Subquery Scan on "*SELECT* 2"  (cost=3.21..632.87 rows=525 width=4)
-         ->  WindowAgg  (cost=3.21..627.62 rows=525 width=8)
+   ->  Subquery Scan on "*SELECT* 2"  (cost=4.35..625.76 rows=525 width=4)
+         ->  WindowAgg  (cost=4.35..620.51 rows=525 width=8)
                Window: w1 AS (PARTITION BY part_abc_5.a ORDER BY part_abc_5.a)
-               ->  Append  (cost=2.02..618.44 rows=525 width=4)
+               ->  Append  (cost=3.17..611.32 rows=525 width=4)
                      Subplans Removed: 1
-                     ->  Index Scan using part_abc_2_a_idx on part_abc_2 part_abc_6  (cost=0.38..8.65 rows=1 width=4)
-                           Index Cond: (a >= (stable_one() + 1))
-                           Filter: (d >= stable_one())
+                     ->  Sort  (cost=1.53..1.53 rows=1 width=4)
+                           Sort Key: part_abc_6.a
+                           ->  Seq Scan on part_abc_2 part_abc_6  (cost=0.00..1.52 rows=1 width=4)
+                                 Filter: ((d >= stable_one()) AND (a >= (stable_one() + 1)))
                      ->  Merge Append  (cost=1.24..456.65 rows=393 width=4)
                            Sort Key: a
                            Subplans Removed: 1
@@ -4797,7 +4799,7 @@
                            ->  Index Scan using part_abc_3_3_a_idx on part_abc_3_3 part_abc_9  (cost=0.41..150.52 rows=131 width=4)
                                  Index Cond: (a >= (stable_one() + 1))
                                  Filter: (d >= stable_one())
-(35 rows)
+(37 rows)
 
 drop view part_abc_view;
 drop table part_abc;

#34

Richard Guo

guofenglinux@gmail.com

4 months ago

In reply to: Alexander Korotkov (#33)

Re: MergeAppend could consider sorting cheapest child path

On Sun, Sep 7, 2025 at 8:26 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Great, thank you for catching this. The diff of costs is attached. I
see the costs now are better or within the fuzz factor.

Will have a review by the end of this commitfest.

- Richard

#35

Alexander Korotkov

aekorotkov@gmail.com

4 months ago

In reply to: Richard Guo (#34)

Re: MergeAppend could consider sorting cheapest child path

On Mon, Sep 8, 2025 at 11:39 AM Richard Guo <guofenglinux@gmail.com> wrote:

On Sun, Sep 7, 2025 at 8:26 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Great, thank you for catching this. The diff of costs is attached. I
see the costs now are better or within the fuzz factor.

Will have a review by the end of this commitfest.

Great, thank you, Richard!

------
Regards,
Alexander Korotkov
Supabase

#36

Alexander Korotkov

aekorotkov@gmail.com

3 months ago

In reply to: Richard Guo (#34)

Re: MergeAppend could consider sorting cheapest child path

On Mon, Sep 8, 2025 at 11:39 AM Richard Guo <guofenglinux@gmail.com> wrote:

On Sun, Sep 7, 2025 at 8:26 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Great, thank you for catching this. The diff of costs is attached. I
see the costs now are better or within the fuzz factor.

Will have a review by the end of this commitfest.

Did you manage to take a look at this patch?

------
Regards,
Alexander Korotkov
Supabase

#37

Alena Rybakina

a.rybakina@postgrespro.ru

3 months ago

In reply to: Alexander Korotkov (#33)

Re: MergeAppend could consider sorting cheapest child path

On 07.09.2025 14:26, Alexander Korotkov wrote:

On Fri, Sep 5, 2025 at 11:45 AM Andrei Lepikhov <lepihov@gmail.com> wrote:

On 1/9/2025 22:26, Alexander Korotkov wrote:

On Thu, Jul 31, 2025 at 5:20 PM Andrei Lepikhov <lepihov@gmail.com> wrote:

See this minor correction in the attachment. postgres_fdw tests are
stable now.

I have another idea. What if we allow MergeAppend paths only when at
least one subpath is preordered. This trick also allow us to exclude
MergeAppend(Sort) dominating Sort(Append). I see the regression tests
changes now have much less volume and looks more reasonable. What do
you think?

I believe a slight mistake has been made with the total_has_ordered /
startup_has_ordered parameters, which has caused unnecessary test
changes in inherit.out (See updated patch in the attachment). Although
not the best test in general (it depends on the autovacuum), it
highlighted the case where a startup-optimal strategy is necessary, even
when a fractional-optimal path is available, which may lead to continue
of the discussion [1].>

Also, do you think get_cheapest_fractional_path_for_pathkeys_ext() and
get_cheapest_path_for_pathkeys_ext() should consider incremental sort?
The revised patch teaches them to do so.

Following 55a780e9476 [2] it should be considered, of course.

Great, thank you for catching this. The diff of costs is attached. I
see the costs now are better or within the fuzz factor.

Hi! I looked at regression test changes but one of them confused me.

Example where the plan shape changed:

EXPLAIN (COSTS OFF)
SELECT t1.a, t2.b
FROM (SELECT * FROM prt1 WHERE a < 450) t1
LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2
ON t1.a = t2.b
WHERE t1.b = 0
ORDER BY t1.a, t2.b;

-- before
                        QUERY PLAN
-----------------------------------------------------------
Incremental Sort
   Sort Key: prt1.a, prt2.b
   Presorted Key: prt1.a
   -> Merge Left Join
         Merge Cond: (prt1.a = prt2.b)
         -> Sort
               Sort Key: prt1.a
               -> Append
                     -> Seq Scan on prt1_p1 prt1_1
                           Filter: ((a < 450) AND (b = 0))
                     -> Seq Scan on prt1_p2 prt1_2
                           Filter: ((a < 450) AND (b = 0))
         -> Sort
               Sort Key: prt2.b
               -> Append
                     -> Seq Scan on prt2_p2 prt2_1
                           Filter: (b > 250)
                     -> Seq Scan on prt2_p3 prt2_2
                           Filter: (b > 250)
(19 rows)

-- now
                           QUERY PLAN
-----------------------------------------------------------------
Sort
   Sort Key: prt1.a, prt2.b
   -> Merge Right Join
         Merge Cond: (prt2.b = prt1.a)
         -> Append
               -> Sort
                     Sort Key: prt2_1.b
                     -> Seq Scan on prt2_p2 prt2_1
                           Filter: (b > 250)
               -> Sort
                     Sort Key: prt2_2.b
                     -> Seq Scan on prt2_p3 prt2_2
                           Filter: (b > 250)
         -> Materialize
               -> Append
                     -> Sort
                           Sort Key: prt1_1.a
                           -> Seq Scan on prt1_p1 prt1_1
                                 Filter: ((a < 450) AND (b = 0))
                     -> Sort
                           Sort Key: prt1_2.a
                           -> Seq Scan on prt1_p2 prt1_2
                                 Filter: ((a < 450) AND (b = 0))
(23 rows)

Previously we had incremental sort on (t1.a, t2.b) with prt1.a already
presorted; now we sort both t1.a and t2.b after a merge right join. It
looks inefficiently or I missed something?

Other tests looked fine for me.

#38

Alexander Korotkov

aekorotkov@gmail.com

3 months ago

In reply to: Alena Rybakina (#37)

Re: MergeAppend could consider sorting cheapest child path

Hi, Alena!

On Fri, Oct 3, 2025 at 1:42 AM Alena Rybakina <a.rybakina@postgrespro.ru> wrote:

On 07.09.2025 14:26, Alexander Korotkov wrote:

On Fri, Sep 5, 2025 at 11:45 AM Andrei Lepikhov <lepihov@gmail.com> wrote:

On 1/9/2025 22:26, Alexander Korotkov wrote:

On Thu, Jul 31, 2025 at 5:20 PM Andrei Lepikhov <lepihov@gmail.com> wrote:

See this minor correction in the attachment. postgres_fdw tests are
stable now.

I have another idea. What if we allow MergeAppend paths only when at
least one subpath is preordered. This trick also allow us to exclude
MergeAppend(Sort) dominating Sort(Append). I see the regression tests
changes now have much less volume and looks more reasonable. What do
you think?

I believe a slight mistake has been made with the total_has_ordered /
startup_has_ordered parameters, which has caused unnecessary test
changes in inherit.out (See updated patch in the attachment). Although
not the best test in general (it depends on the autovacuum), it
highlighted the case where a startup-optimal strategy is necessary, even
when a fractional-optimal path is available, which may lead to continue
of the discussion [1].>

Also, do you think get_cheapest_fractional_path_for_pathkeys_ext() and
get_cheapest_path_for_pathkeys_ext() should consider incremental sort?
The revised patch teaches them to do so.

Following 55a780e9476 [2] it should be considered, of course.

Great, thank you for catching this. The diff of costs is attached. I
see the costs now are better or within the fuzz factor.

Hi! I looked at regression test changes but one of them confused me.

Example where the plan shape changed:

EXPLAIN (COSTS OFF)
SELECT t1.a, t2.b
FROM (SELECT * FROM prt1 WHERE a < 450) t1
LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2
ON t1.a = t2.b
WHERE t1.b = 0
ORDER BY t1.a, t2.b;

-- before
QUERY PLAN
-----------------------------------------------------------
Incremental Sort
Sort Key: prt1.a, prt2.b
Presorted Key: prt1.a
-> Merge Left Join
Merge Cond: (prt1.a = prt2.b)
-> Sort
Sort Key: prt1.a
-> Append
-> Seq Scan on prt1_p1 prt1_1
Filter: ((a < 450) AND (b = 0))
-> Seq Scan on prt1_p2 prt1_2
Filter: ((a < 450) AND (b = 0))
-> Sort
Sort Key: prt2.b
-> Append
-> Seq Scan on prt2_p2 prt2_1
Filter: (b > 250)
-> Seq Scan on prt2_p3 prt2_2
Filter: (b > 250)
(19 rows)

-- now
QUERY PLAN
-----------------------------------------------------------------
Sort
Sort Key: prt1.a, prt2.b
-> Merge Right Join
Merge Cond: (prt2.b = prt1.a)
-> Append
-> Sort
Sort Key: prt2_1.b
-> Seq Scan on prt2_p2 prt2_1
Filter: (b > 250)
-> Sort
Sort Key: prt2_2.b
-> Seq Scan on prt2_p3 prt2_2
Filter: (b > 250)
-> Materialize
-> Append
-> Sort
Sort Key: prt1_1.a
-> Seq Scan on prt1_p1 prt1_1
Filter: ((a < 450) AND (b = 0))
-> Sort
Sort Key: prt1_2.a
-> Seq Scan on prt1_p2 prt1_2
Filter: ((a < 450) AND (b = 0))
(23 rows)

Previously we had incremental sort on (t1.a, t2.b) with prt1.a already
presorted; now we sort both t1.a and t2.b after a merge right join. It
looks inefficiently or I missed something?

Other tests looked fine for me.

Thank you for taking a look at this.

According to our cost model incremental sort has additional overhead
but gives huge wins on large row sets. The row sets here are very
small. You can get from [1] that the total cost became smaller.

Links.
1. /messages/by-id/CAPpHfdsn_mPy1v6Gf8rmdkBDsDLU+=J4M4sBzgaFv21cWruZFA@mail.gmail.com

------
Regards,
Alexander Korotkov
Supabase

#39

Richard Guo

guofenglinux@gmail.com

3 months ago

In reply to: Alexander Korotkov (#36)

Re: MergeAppend could consider sorting cheapest child path

On Thu, Oct 2, 2025 at 11:49 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Mon, Sep 8, 2025 at 11:39 AM Richard Guo <guofenglinux@gmail.com> wrote:

Will have a review by the end of this commitfest.

Did you manage to take a look at this patch?

Sorry, I haven't had a chance to review it yet, but it's on my to-do
list. I'll get to it as soon as I can.

- Richard

#40

Richard Guo

guofenglinux@gmail.com

3 months ago

In reply to: Richard Guo (#39)

Re: MergeAppend could consider sorting cheapest child path

On Fri, Oct 3, 2025 at 4:05 PM Richard Guo <guofenglinux@gmail.com> wrote:

Sorry, I haven't had a chance to review it yet, but it's on my to-do
list. I'll get to it as soon as I can.

I've had some time to review this patch, but I have a few concerns.
From the changes in the test cases, it seems that this patch
encourages using MergeAppend+Sort over Sort+Append. However, I'm not
sure MergeAppend+Sort is always more efficient than Sort+Append.
While it is in some cases, there are likely cases where it isn't.

I also noticed the hack you added to avoid using MergeAppend+Sort when
none of the chosen subpaths are ordered. It seems to me that this
contradicts the idea of this patch. If MergeAppend+Sort is indeed a
better plan, why wouldn't it apply in cases where no chosen subpaths
are ordered?

For example, imagine a table with 1000 child tables, where only one
child has a chosen subpath that is ordered, and the other 999 do not.
In this case, this patch would consider using MergeAppend+Sort, but I
don't think there's much practical difference between this case and
one where none of the chosen subpaths are ordered.

Moreover, I think this hack may cause us to miss some paths that the
current master is able to explore. When child pathkeys exist, the
master can generate MergeAppend paths. However, with the hack in
this patch, if none of the chosen subpaths for the child tables are
ordered, the MergeAppend paths will be missed. I think this is a
regression.

Regarding the code, for the newly added function
get_cheapest_path_for_pathkeys_ext(), I think it's a reasonable
expectation from the function name that the returned path satisfies
the given pathkeys. However, this function can return a path that is
not ordered according to those pathkeys, which I think is not a good
design choice.

Also, I'm not sure about this coding style:

+   if (path == NULL)
+
+       /*
+        * Current pathlist doesn't fit the pathkeys. No need to check extra
+        * sort path ways.
+        */
+       return base_path;

On one hand, I don't see this style often in our codebase. On the
other hand, I have noticed commits that try to fix this style by
adding braces (cf. commit aadf7db66). So I wonder if we can avoid
this style altogether from the start.

The commit message states that "To arrange the cost model, change the
merge cost multiplier". However, I didn't find any related changes in
the patch. Am I missing something? Additionally, if you did change
some cost model multiplier, I think it's better to support this change
with benchmark results.

- Richard

#41

David Rowley

dgrowleyml@gmail.com

3 months ago

In reply to: Richard Guo (#40)

Re: MergeAppend could consider sorting cheapest child path

On Wed, 15 Oct 2025 at 19:45, Richard Guo <guofenglinux@gmail.com> wrote:

I also noticed the hack you added to avoid using MergeAppend+Sort when
none of the chosen subpaths are ordered. It seems to me that this
contradicts the idea of this patch. If MergeAppend+Sort is indeed a
better plan, why wouldn't it apply in cases where no chosen subpaths
are ordered?

FWIW, I've not really followed this closely, but from the parts I have
read it seems the patch could cause a Sort -> unsorted path to be used
over a path that's already correctly sorted. This reminds me of a
patch I proposed in [1]/messages/by-id/CAApHDvojKdBR3MR59JXmaCYbyHB6Q_5qPRU+dy93En8wm+XiDA@mail.gmail.com and then subsequently decided it was a bad
idea in [2]/messages/by-id/CAApHDvohAZLQSW4AiHUKmLGNuHYbi0pves+9_9ik3cAYevc2GQ@mail.gmail.com because of concerns of having too many Sorts in a single
plan. Sort only calls tuplesort_end() at executor shutdown, so that
means possibly using up to work_mem per sort node. If you have 1000x
Sort nodes, then that's up to 1000x work_mem. Since the planner
doesn't have any abilities to consider the overall memory consumption,
I thought it was a bad idea due to increased OOM risk. If I'm not
mistaken it looks like this could suffer from the same problem.

David

[1]: /messages/by-id/CAApHDvojKdBR3MR59JXmaCYbyHB6Q_5qPRU+dy93En8wm+XiDA@mail.gmail.com
[2]: /messages/by-id/CAApHDvohAZLQSW4AiHUKmLGNuHYbi0pves+9_9ik3cAYevc2GQ@mail.gmail.com

#42

Andrei Lepikhov

lepihov@gmail.com

3 months ago

In reply to: David Rowley (#41)

Re: MergeAppend could consider sorting cheapest child path

On 15/10/2025 09:59, David Rowley wrote:

On Wed, 15 Oct 2025 at 19:45, Richard Guo <guofenglinux@gmail.com> wrote:

I also noticed the hack you added to avoid using MergeAppend+Sort when
none of the chosen subpaths are ordered. It seems to me that this
contradicts the idea of this patch. If MergeAppend+Sort is indeed a
better plan, why wouldn't it apply in cases where no chosen subpaths
are ordered?

FWIW, I've not really followed this closely, but from the parts I have
read it seems the patch could cause a Sort -> unsorted path to be used
over a path that's already correctly sorted. This reminds me of a
patch I proposed in [1] and then subsequently decided it was a bad
idea in [2] because of concerns of having too many Sorts in a single
plan. Sort only calls tuplesort_end() at executor shutdown, so that
means possibly using up to work_mem per sort node. If you have 1000x
Sort nodes, then that's up to 1000x work_mem. Since the planner
doesn't have any abilities to consider the overall memory consumption,
I thought it was a bad idea due to increased OOM risk. If I'm not
mistaken it looks like this could suffer from the same problem.

Thanks for your feedback!
This patch originated from the practice of how table partitioning can
severely impact query execution. It is a rare case in our experience
when all partitions have symmetrical indexes, especially those located
remotely. People adopt a set of indexes according to the current load
profile on hot partitions. In fact, it is a typical case when a
timestamp orders partitions, and most of them are rarely updated.

So, the goal is to use MergeAppend when only a few partitions lack a
proper index.
The concern about memory consumption makes sense, of course. However, we
choose to sort based on cost estimations that usually work when the
optimiser decides between fetching many tuples during an Index Scan,
compared to only a few tuples to fetch with subsequent sorting.
Additionally, scan estimation typically yields good predictions
(compared to JOIN), and I personally estimate the OOM risk to be low.

Additionally, this patch revealed an issue with the cost model: there is
no significant difference between a single massive Sort and multiple
sorts followed by MergeAppend. Our experiments show that it is incorrect
(one Sort operator demonstrates more efficacy) and may be corrected.

--
regards, Andrei Lepikhov,
pgEdge

#43

David Rowley

dgrowleyml@gmail.com

3 months ago

In reply to: Andrei Lepikhov (#42)

Re: MergeAppend could consider sorting cheapest child path

On Wed, 15 Oct 2025 at 22:26, Andrei Lepikhov <lepihov@gmail.com> wrote:

This patch originated from the practice of how table partitioning can
severely impact query execution. It is a rare case in our experience
when all partitions have symmetrical indexes, especially those located
remotely. People adopt a set of indexes according to the current load
profile on hot partitions. In fact, it is a typical case when a
timestamp orders partitions, and most of them are rarely updated.

So, the goal is to use MergeAppend when only a few partitions lack a
proper index.

hmm... doesn't that already work?

create table hp (a int ) partition by hash(a);
create table hp0 partition of hp for values with(modulus 2, remainder 0);
create table hp1 partition of hp for values with(modulus 2, remainder 1);
create index on hp0(a);

explain (costs off) select * from hp order by a;

Merge Append
Sort Key: hp.a
-> Index Only Scan using hp0_a_idx on hp0 hp_1
-> Sort
Sort Key: hp_2.a
-> Seq Scan on hp1 hp_2

Or is this a case of that you want to also consider Seq Scan on hp0 ->
Sort if it's cheaper than Index Scan on hp0_a_idx just in case that's
enough to make Merge Append cheap enough to beat Append -> Sort?

The concern about memory consumption makes sense, of course. However, we
choose to sort based on cost estimations that usually work when the
optimiser decides between fetching many tuples during an Index Scan,
compared to only a few tuples to fetch with subsequent sorting.
Additionally, scan estimation typically yields good predictions
(compared to JOIN), and I personally estimate the OOM risk to be low.

Additionally, this patch revealed an issue with the cost model: there is
no significant difference between a single massive Sort and multiple
sorts followed by MergeAppend. Our experiments show that it is incorrect
(one Sort operator demonstrates more efficacy) and may be corrected.

Do you mean "no significant difference [in the costings] between"?

Not sure if I follow you here. You've said "one Sort operator
demonstrates more efficacy", do you mean Sort atop of Append is
better? If so, why does the patch try to encourage plans with Merge
Append with many Sorts?

David

#44

Andrei Lepikhov

lepihov@gmail.com

2 months ago

In reply to: David Rowley (#43)

Re: MergeAppend could consider sorting cheapest child path

On 15/10/2025 14:35, David Rowley wrote:

On Wed, 15 Oct 2025 at 22:26, Andrei Lepikhov <lepihov@gmail.com> wrote:
Or is this a case of that you want to also consider Seq Scan on hp0 ->
Sort if it's cheaper than Index Scan on hp0_a_idx just in case that's
enough to make Merge Append cheap enough to beat Append -> Sort?

I spent some time reviewing original user complaints. However, after
switching employers, I no longer have direct access to the reports :(((
- it was the main benefit of working for the company, which has massive
migrations from Oracle and SQL Server.
I recall the problem raised with multiple foreign partitions, where
MergeAppend by X is a preferable strategy (due to the need for ORDER BY
X, or MergeJoin, etc). For some partitions, IndexScan(X) fetches too
many tuples from disk. In this case, IndexScan(Y) + Sort (X) drastically
improves the situation. That's why we proposed to look into the
cheaper_total path + sort, not only the path that fits pathkeys.

Additionally, this patch revealed an issue with the cost model: there is
no significant difference between a single massive Sort and multiple
sorts followed by MergeAppend. Our experiments show that it is incorrect
(one Sort operator demonstrates more efficacy) and may be corrected.

Do you mean "no significant difference [in the costings] between"?

Yes>

Not sure if I follow you here. You've said "one Sort operator
demonstrates more efficacy", do you mean Sort atop of Append is
better? If so, why does the patch try to encourage plans with Merge
Append with many Sorts?

Sort-Append definitely better than MergeAppend-IndexScan.
This patch just reveals the issue that current cost model doesn't differ
these two strategies. In the corner case it triggers a suboptimal plan.

--
regards, Andrei Lepikhov,
pgEdge