Support loser tree for k-way merge

Started by cca5507about 1 month ago12 messages
#1cca5507
cca5507@qq.com
1 attachment(s)

Hi hackers,

Background
==========
Now we use 'heap' during the k-way merge, it's O(n log k). The 'loser tree' is also O(n log k), but
it's usually has fewer comparisons than the 'heap'. When the tuple comparator is complex, the
'loser tree' can significantly speed up the k-way merge.

Test
====
With the WIP patch(v1-0001), I got a 3% ~ 13%(different work_mem) speed up in the following test case:

SET max_parallel_workers_per_gather = 0;
CREATE UNLOGGED TABLE t AS SELECT generate_series(1, 20000000) AS a, md5(random()::text) AS b;
create extension if not exists pg_prewarm;
select pg_prewarm('t');

SET enable_loser_tree = OFF;
# SET work_mem = '4MB'; ('8MB' '16MB' '32MB' '64MB' ...)
explain analyze select * from t order by b;

SET enable_loser_tree = ON;
# SET work_mem = '4MB'; ('8MB' '16MB' '32MB' '64MB' ...)
explain analyze select * from t order by b;

Open questions
==============
1) Now I add a GUC 'enable_loser_tree' to control the use of loser tree, maybe we should
decide whether to use the 'loser tree' based on the value of 'k', the complexity of tuple
comparators or just always use the 'loser tree'?

Looking forward to your reply and comment.

--
Regards,
ChangAo Chen

Attachments:

v1-0001-Support-loser-tree-for-k-way-merge.patchapplication/octet-stream; charset=utf-8; name=v1-0001-Support-loser-tree-for-k-way-merge.patchDownload
From 2981406dbf5bb738273b3ff4c26854ce7d715dd1 Mon Sep 17 00:00:00 2001
From: ChangAo Chen <cca5507@qq.com>
Date: Wed, 3 Dec 2025 18:58:28 +0800
Subject: [PATCH v1] Support loser tree for k-way merge.

The loser tree usually has fewer comparisons than the heap during
k-way merge.
---
 src/backend/utils/misc/guc_parameters.dat |   7 +
 src/backend/utils/sort/tuplesort.c        | 160 +++++++++++++++++++++-
 src/include/utils/guc.h                   |   1 +
 3 files changed, 167 insertions(+), 1 deletion(-)

diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 3b9d8349078..123ece8ea75 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -882,6 +882,13 @@
   boot_val => 'true',
 },
 
+{ name => 'enable_loser_tree', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD',
+  short_desc => 'Enables loser tree for k-way merge.',
+  flags => 'GUC_NOT_IN_SAMPLE',
+  variable => 'enable_loser_tree',
+  boot_val => 'false',
+},
+
 { name => 'enable_material', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD',
   short_desc => 'Enables the planner\'s use of materialization.',
   flags => 'GUC_EXPLAIN',
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 5d4411dc33f..381f5909094 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -122,6 +122,9 @@
 
 /* GUC variables */
 bool		trace_sort = false;
+bool        enable_loser_tree = false;
+
+#define LOSER_TREE_EOF -1
 
 #ifdef DEBUG_BOUNDED_SORT
 bool		optimize_bounded_sort = true;
@@ -219,6 +222,9 @@ struct Tuplesortstate
 	int			memtupsize;		/* allocated length of memtuples array */
 	bool		growmemtuples;	/* memtuples' growth still underway? */
 
+	bool        useLoserTree;   /* use loser tree for k-way merge if true */
+	int        *losers;         /* array of losers, losers[0] is the winner */
+
 	/*
 	 * Memory for tuples is sometimes allocated using a simple slab allocator,
 	 * rather than with palloc().  Currently, we switch to slab allocation
@@ -468,6 +474,8 @@ static void tuplesort_sort_memtuples(Tuplesortstate *state);
 static void tuplesort_heap_insert(Tuplesortstate *state, SortTuple *tuple);
 static void tuplesort_heap_replace_top(Tuplesortstate *state, SortTuple *tuple);
 static void tuplesort_heap_delete_top(Tuplesortstate *state);
+static void tuplesort_loser_tree_build(Tuplesortstate *state);
+static void tuplesort_loser_tree_adjust(Tuplesortstate *state, int start);
 static void reversedirection(Tuplesortstate *state);
 static unsigned int getlen(LogicalTape *tape, bool eofOK);
 static void markrunend(LogicalTape *tape);
@@ -681,6 +689,8 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate, int sortopt)
 	state->base.sortopt = sortopt;
 	state->base.tuples = true;
 	state->abbrevNext = 10;
+	state->useLoserTree = enable_loser_tree;
+	state->losers = NULL;
 
 	/*
 	 * workMem is forced to be at least 64KB, the current minimum valid value
@@ -1647,6 +1657,37 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 				state->lastReturnedTuple = NULL;
 			}
 
+			if (state->useLoserTree && state->memtupcount > 0)
+			{
+				int          i = state->losers[0];
+				int          start = i + state->memtupcount;
+				int			 srcTapeIndex = state->memtuples[i].srctape;
+				LogicalTape *srcTape = state->inputTapes[srcTapeIndex];
+				SortTuple	 newtup;
+
+				*stup = state->memtuples[i];
+
+				state->lastReturnedTuple = stup->tuple;
+
+				if (mergereadnext(state, srcTape, &newtup))
+				{
+					newtup.srctape = srcTapeIndex;
+					state->memtuples[i] = newtup;
+				}
+				else
+				{
+					state->losers[start] = LOSER_TREE_EOF;
+					state->nInputRuns--;
+					LogicalTapeClose(srcTape);
+				}
+				tuplesort_loser_tree_adjust(state, start);
+
+				if (state->losers[0] == LOSER_TREE_EOF)
+					state->memtupcount = 0;
+
+				return true;
+			}
+
 			/*
 			 * This code should match the inner loop of mergeonerun().
 			 */
@@ -2206,6 +2247,41 @@ mergeonerun(Tuplesortstate *state)
 
 	Assert(state->slabAllocatorUsed);
 
+	if (state->useLoserTree && state->memtupcount > 0)
+	{
+		while (state->losers[0] != LOSER_TREE_EOF)
+		{
+			int         i = state->losers[0];
+			int         start = i + state->memtupcount;
+			SortTuple	stup;
+
+			/* write the tuple to destTape */
+			srcTapeIndex = state->memtuples[i].srctape;
+			srcTape = state->inputTapes[srcTapeIndex];
+			WRITETUP(state, state->destTape, &state->memtuples[i]);
+
+			/* recycle the slot of the tuple we just wrote out, for the next read */
+			if (state->memtuples[i].tuple)
+				RELEASE_SLAB_SLOT(state, state->memtuples[i].tuple);
+
+			if (mergereadnext(state, srcTape, &stup))
+			{
+				stup.srctape = srcTapeIndex;
+				state->memtuples[i] = stup;
+			}
+			else
+			{
+				state->losers[start] = LOSER_TREE_EOF;
+				state->nInputRuns--;
+			}
+
+			tuplesort_loser_tree_adjust(state, start);
+		}
+		state->memtupcount = 0;
+		markrunend(state->destTape);
+		return;
+	}
+
 	/*
 	 * Execute merge by repeatedly extracting lowest tuple in heap, writing it
 	 * out, and replacing it with next tuple from same tape (if there is
@@ -2270,9 +2346,16 @@ beginmerge(Tuplesortstate *state)
 		if (mergereadnext(state, state->inputTapes[srcTapeIndex], &tup))
 		{
 			tup.srctape = srcTapeIndex;
-			tuplesort_heap_insert(state, &tup);
+
+			if (state->useLoserTree)
+				state->memtuples[state->memtupcount++] = tup;
+			else
+				tuplesort_heap_insert(state, &tup);
 		}
 	}
+
+	if (state->useLoserTree && state->memtupcount > 0)
+		tuplesort_loser_tree_build(state);
 }
 
 /*
@@ -2823,6 +2906,81 @@ tuplesort_heap_replace_top(Tuplesortstate *state, SortTuple *tuple)
 	memtuples[i] = *tuple;
 }
 
+static void
+tuplesort_loser_tree_build(Tuplesortstate *state)
+{
+	int k = state->memtupcount; /* k-way */
+	int winner[MAXORDER * 2];
+	int i;
+
+	Assert(k > 0 && k <= MAXORDER);
+
+	if (state->losers == NULL)
+		state->losers = (int *) MemoryContextAlloc(state->base.maincontext,
+												   MAXORDER * 2 * sizeof(int));
+
+	for (i = k; i < 2 * k; i++)
+	{
+		winner[i] = i - k;
+		state->losers[i] = i - k;
+	}
+
+	for (i = k - 1; i >= 1; i--)
+	{
+		int l = i * 2;
+		int r = l + 1;
+
+		if (COMPARETUP(state,
+					   &state->memtuples[winner[l]],
+					   &state->memtuples[winner[r]]) < 0)
+		{
+			winner[i] = winner[l];
+			state->losers[i] = winner[r];
+		}
+		else
+		{
+			winner[i] = winner[r];
+			state->losers[i] = winner[l];
+		}
+	}
+
+	/* set the final winner to state->losers[0] */
+	state->losers[0] = winner[1];
+}
+
+static void
+tuplesort_loser_tree_adjust(Tuplesortstate *state, int start)
+{
+	int winner;
+	int parent;
+
+	Assert(state->losers != NULL);
+	Assert(start >= state->memtupcount && start < state->memtupcount * 2);
+
+	winner = state->losers[start];
+
+	for (parent = start / 2; parent > 0; parent /= 2)
+	{
+		int loser = state->losers[parent];
+
+		/* eof is always the loser */
+		if (loser == LOSER_TREE_EOF)
+			continue;
+
+		if (winner == LOSER_TREE_EOF ||
+			COMPARETUP(state,
+					   &state->memtuples[winner],
+					   &state->memtuples[loser]) > 0)
+		{
+			state->losers[parent] = winner;
+			winner = loser;
+		}
+	}
+
+	/* set the final winner to state->losers[0] */
+	state->losers[0] = winner;
+}
+
 /*
  * Function to reverse the sort direction from its current state
  *
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index f21ec37da89..8c22eed716b 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -324,6 +324,7 @@ extern PGDLLIMPORT int tcp_user_timeout;
 extern PGDLLIMPORT char *role_string;
 extern PGDLLIMPORT bool in_hot_standby_guc;
 extern PGDLLIMPORT bool trace_sort;
+extern PGDLLIMPORT bool enable_loser_tree;
 
 #ifdef DEBUG_BOUNDED_SORT
 extern PGDLLIMPORT bool optimize_bounded_sort;
-- 
2.34.1

#2Heikki Linnakangas
hlinnaka@iki.fi
In reply to: cca5507 (#1)
Re: Support loser tree for k-way merge

On 03/12/2025 13:48, cca5507 wrote:

With the WIP patch(v1-0001), I got a 3% ~ 13%(different work_mem) speed up in the following test case:

Nice speedup!

1) Now I add a GUC 'enable_loser_tree' to control the use of loser tree, maybe we should
decide whether to use the 'loser tree' based on the value of 'k', the complexity of tuple
comparators or just always use the 'loser tree'?

What is the worst case scenario for the loser tree, where the heap is
faster? How big is the difference?

- Heikki

#3cca5507
cca5507@qq.com
In reply to: Heikki Linnakangas (#2)
Re: Support loser tree for k-way merge

Hi Heikki,

What is the worst case scenario for the loser tree, where the heap is 
faster? How big is the difference?

In tuplesort_heap_replace_top(), it has 2 comparisons each level, but it can early return
if the parent less than both child.

In tuplesort_loser_tree_adjust(), it has 1 comparison each level, but it can't early return.

So on specific data, the heap may be better than the loser tree. But I think the possibility
is very small.

--
Regards,
ChangAo Chen

#4cca5507
cca5507@qq.com
In reply to: cca5507 (#3)
Re: Support loser tree for k-way merge

So on specific data, the heap may be better than the loser tree. But I think the possibility
is very small.

For example, a column contains all the same values, so the heap can always early return
when replacing top.

--
Regards,
ChangAo Chen

#5Sami Imseih
samimseih@gmail.com
In reply to: cca5507 (#4)
Re: Support loser tree for k-way merge

Hi,

Thanks for raising this.

Now we use 'heap' during the k-way merge, it's O(n log k). The 'loser tree' is also O(n log k), but
it's usually has fewer comparisons than the 'heap'. When the tuple comparator is complex, the
'loser tree' can significantly speed up the k-way merge.

I was playing around with this and I do also see the performance
benefits that you
mention with loser tree.

```
DROP TABLE IF EXISTS t;
CREATE TABLE t (
col1 INT,
col2 INT
);

INSERT INTO t (col1, col2)
SELECT
(i % 10000000),
(i % 100)
FROM generate_series(1, 10000000) AS s(i);
```

Using something like the above, testing an "external merge" on the
high cardinality
column there is a bit of an improvement with loser tree, but there is
a bit of a loss on
low-cardinality.

In general though, I see a bit of an issue with allowing a GUC to
change strategies. What happens if there are multiple "external merge"
nodes and they can each benefit from different strategies? Maybe that's
not an issue since all the other enable_ GUCs have that problem, but
it is something to consider.

Can we drive the decision for what to do based on optimizer
stats, i.e. n_distinct and row counts? Not sure what the calculation would
be specifically, but something else to consider.

We can still provide the GUC to override the optimizer decisions,
but at least the optimizer, given up-to-date stats, may get it right most
of the time.

--
Sami Imseih
Amazon Web Services (AWS)

#6cca5507
cca5507@qq.com
In reply to: Sami Imseih (#5)
Re: Support loser tree for k-way merge

Hi,

Thank you for your reply.

Can we drive the decision for what to do based on optimizer
stats, i.e. n_distinct and row counts? Not sure what the calculation would
be specifically, but something else to consider.

We can still provide the GUC to  override the optimizer decisions,
but at least the optimizer, given up-to-date stats, may get it right most
of the time.

That makes sense to me.

TODO
====
1) Consider optimizer statistics when deciding whether to use the heap or the loser tree.
2) Do we need a USEMEM() call to the array of losers?
3) Now the array length of losers is MAXORDER * 2, and in fact MAXORDER is enough, need some refactor of the code. (Is it worth doing?)
4) Add more code comments and doc.

Help are welcome!

--
Regards,
ChangAo Chen

#7John Naylor
johncnaylorls@gmail.com
In reply to: Sami Imseih (#5)
Re: Support loser tree for k-way merge

On Thu, Dec 4, 2025 at 1:14 AM Sami Imseih <samimseih@gmail.com> wrote:

Can we drive the decision for what to do based on optimizer
stats, i.e. n_distinct and row counts? Not sure what the calculation would
be specifically, but something else to consider.

It's happened multiple times before that someone proposes a change
that makes sorting faster on some inputs, but turns out to regress on
low cardinality (I've done it myself). It seems to be pretty hard not
to regress that case. Occasionally the author proposes to take
optimizer stats into account, and that was rejected because
cardinality stats are often wildly wrong.

Further, underestimation is far more common than overestimation, in
which case IIUC the planner would just continue to choose the existing
heap method.

We can still provide the GUC to override the optimizer decisions,
but at least the optimizer, given up-to-date stats, may get it right most
of the time.

I don't have much faith that people will properly set a GUC whose
effects depends on the input characteristics and memory settings.

The new method might be a better overall trade-off, but we'd need some
more comprehensive measurements to know what we're dealing with.

--
John Naylor
Amazon Web Services

#8cca5507
cca5507@qq.com
In reply to: John Naylor (#7)
Re: Support loser tree for k-way merge

Hi,

I summarized the number of comparisons needed for different 'k':

k = 2, heap: 1, loser tree: 1
k = 3, heap: 2, loser tree: [1, 2]
k = 4, heap: [2, 3], loser tree: 2
k = 5, heap: [2, 4], loser tree: [2, 3]

So if k < 5, the loser tree is never worse than the heap for any input data.

Thoughts?

--
Regards,
ChangAo Chen

#9cca5507
cca5507@qq.com
In reply to: cca5507 (#8)
2 attachment(s)
Re: Support loser tree for k-way merge

Hi,

Can we support loser tree first and set the default value of enable_loser_tree to off?

Anyway, I attach 2 patches. (pass check-world)

v2-0001: support tuplesort_heap_build() which can build a valid heap in O(n), better
than currently O(n log n). This also make beginmerge() more clear and simple when
supporting loser tree.

v2-0002: support the loser tree.

--
Regards,
ChangAo Chen

Attachments:

v2-0001-Support-tuplesort_heap_build-in-tuplesort.c.patchapplication/octet-stream; charset=utf-8; name=v2-0001-Support-tuplesort_heap_build-in-tuplesort.c.patchDownload
From bef514a687bcdc6f13b84ded17d62ab3cf1ff848 Mon Sep 17 00:00:00 2001
From: ChangAo Chen <cca5507@qq.com>
Date: Fri, 5 Dec 2025 14:37:21 +0800
Subject: [PATCH v2 1/2] Support tuplesort_heap_build() in tuplesort.c.

Build a valid heap with n tuples by using tuplesort_heap_insert() is
O(n log n), we can use tuplesort_heap_build() to optimize it, which
is O(n).
---
 src/backend/utils/sort/tuplesort.c | 103 +++++++++++++++--------------
 1 file changed, 53 insertions(+), 50 deletions(-)

diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 5d4411dc33f..e28b8cfa868 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -465,7 +465,7 @@ static void dumptuples(Tuplesortstate *state, bool alltuples);
 static void make_bounded_heap(Tuplesortstate *state);
 static void sort_bounded_heap(Tuplesortstate *state);
 static void tuplesort_sort_memtuples(Tuplesortstate *state);
-static void tuplesort_heap_insert(Tuplesortstate *state, SortTuple *tuple);
+static void tuplesort_heap_build(Tuplesortstate *state);
 static void tuplesort_heap_replace_top(Tuplesortstate *state, SortTuple *tuple);
 static void tuplesort_heap_delete_top(Tuplesortstate *state);
 static void reversedirection(Tuplesortstate *state);
@@ -2269,10 +2269,14 @@ beginmerge(Tuplesortstate *state)
 
 		if (mergereadnext(state, state->inputTapes[srcTapeIndex], &tup))
 		{
+			Assert(state->memtupcount < state->memtupsize);
+			CHECK_FOR_INTERRUPTS();
 			tup.srctape = srcTapeIndex;
-			tuplesort_heap_insert(state, &tup);
+			state->memtuples[state->memtupcount++] = tup;
 		}
 	}
+
+	tuplesort_heap_build(state);
 }
 
 /*
@@ -2593,32 +2597,25 @@ make_bounded_heap(Tuplesortstate *state)
 	/* Reverse sort direction so largest entry will be at root */
 	reversedirection(state);
 
-	state->memtupcount = 0;		/* make the heap empty */
-	for (i = 0; i < tupcount; i++)
+	/* Build a valid heap from the first "state->bound" tuples */
+	state->memtupcount = state->bound;
+	tuplesort_heap_build(state);
+
+	/* Process the remaining tuples */
+	for (i = state->bound; i < tupcount; i++)
 	{
-		if (state->memtupcount < state->bound)
+		/*
+		 * The heap is full.  Replace the largest entry with the new
+		 * tuple, or just discard it, if it's larger than anything already
+		 * in the heap.
+		 */
+		if (COMPARETUP(state, &state->memtuples[i], &state->memtuples[0]) <= 0)
 		{
-			/* Insert next tuple into heap */
-			/* Must copy source tuple to avoid possible overwrite */
-			SortTuple	stup = state->memtuples[i];
-
-			tuplesort_heap_insert(state, &stup);
+			free_sort_tuple(state, &state->memtuples[i]);
+			CHECK_FOR_INTERRUPTS();
 		}
 		else
-		{
-			/*
-			 * The heap is full.  Replace the largest entry with the new
-			 * tuple, or just discard it, if it's larger than anything already
-			 * in the heap.
-			 */
-			if (COMPARETUP(state, &state->memtuples[i], &state->memtuples[0]) <= 0)
-			{
-				free_sort_tuple(state, &state->memtuples[i]);
-				CHECK_FOR_INTERRUPTS();
-			}
-			else
-				tuplesort_heap_replace_top(state, &state->memtuples[i]);
-		}
+			tuplesort_heap_replace_top(state, &state->memtuples[i]);
 	}
 
 	Assert(state->memtupcount == state->bound);
@@ -2721,40 +2718,46 @@ tuplesort_sort_memtuples(Tuplesortstate *state)
 }
 
 /*
- * Insert a new tuple into an empty or existing heap, maintaining the
- * heap invariant.  Caller is responsible for ensuring there's room.
- *
- * Note: For some callers, tuple points to a memtuples[] entry above the
- * end of the heap.  This is safe as long as it's not immediately adjacent
- * to the end of the heap (ie, in the [memtupcount] array entry) --- if it
- * is, it might get overwritten before being moved into the heap!
+ * Build a valid heap from memtuples[].
  */
 static void
-tuplesort_heap_insert(Tuplesortstate *state, SortTuple *tuple)
+tuplesort_heap_build(Tuplesortstate *state)
 {
-	SortTuple  *memtuples;
-	int			j;
-
-	memtuples = state->memtuples;
-	Assert(state->memtupcount < state->memtupsize);
-
-	CHECK_FOR_INTERRUPTS();
+	SortTuple   *memtuples = state->memtuples;
+	SortTuple	 stup;
+	int          hole;
+	unsigned int i, j, n;
 
 	/*
-	 * Sift-up the new entry, per Knuth 5.2.3 exercise 16. Note that Knuth is
-	 * using 1-based array indexes, not 0-based.
+	 * state->memtupcount is "int", but we use "unsigned int" for i, j, n.
+	 * This prevents overflow in the "2 * i + 1" calculation, since at the top
+	 * of the loop we must have i < n <= INT_MAX <= UINT_MAX/2.
 	 */
-	j = state->memtupcount++;
-	while (j > 0)
+	n = state->memtupcount;
+	for (hole = (state->memtupcount - 2) / 2; hole >= 0; hole--)
 	{
-		int			i = (j - 1) >> 1;
-
-		if (COMPARETUP(state, tuple, &memtuples[i]) >= 0)
-			break;
-		memtuples[j] = memtuples[i];
-		j = i;
+		CHECK_FOR_INTERRUPTS();
+		/* i is where the "hole" is */
+		i = hole;
+		/* copy memtuples[i] because it will get overwritten below */
+		stup = memtuples[i];
+		/* sift down */
+		for (;;)
+		{
+			j = 2 * i + 1;
+
+			if (j >= n)
+				break;
+			if (j + 1 < n &&
+				COMPARETUP(state, &memtuples[j], &memtuples[j + 1]) > 0)
+				j++;
+			if (COMPARETUP(state, &stup, &memtuples[j]) <= 0)
+				break;
+			memtuples[i] = memtuples[j];
+			i = j;
+		}
+		memtuples[i] = stup;
 	}
-	memtuples[j] = *tuple;
 }
 
 /*
-- 
2.52.0

v2-0002-Support-loser-tree-for-k-way-merge.patchapplication/octet-stream; charset=utf-8; name=v2-0002-Support-loser-tree-for-k-way-merge.patchDownload
From 9dbf07f28d257bd6aef8bacb38076f0da9fc8a45 Mon Sep 17 00:00:00 2001
From: ChangAo Chen <cca5507@qq.com>
Date: Fri, 5 Dec 2025 23:48:05 +0800
Subject: [PATCH v2 2/2] Support loser tree for k-way merge.

The loser tree usually has fewer comparisons than the heap during
k-way merge. But the heap maybe better if the cardinality of the
input data is very low. Add a GUC 'enable_loser_tree' to control
the use of loser tree, default to off.
---
 doc/src/sgml/config.sgml                      |  14 ++
 src/backend/utils/misc/guc_parameters.dat     |   7 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/backend/utils/sort/tuplesort.c            | 218 ++++++++++++++----
 src/include/utils/guc.h                       |   1 +
 src/test/regress/expected/sysviews.out        |   3 +-
 6 files changed, 202 insertions(+), 42 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 405c9689bd0..031591bb820 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5617,6 +5617,20 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-loser-tree" xreflabel="enable_loser_tree">
+      <term><varname>enable_loser_tree</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_loser_tree</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the external k-way merge's use of loser tree.
+        The default is <literal>off</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-material" xreflabel="enable_material">
       <term><varname>enable_material</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 3b9d8349078..cbece911495 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -882,6 +882,13 @@
   boot_val => 'true',
 },
 
+{ name => 'enable_loser_tree', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD',
+  short_desc => 'Enables loser tree for external k-way merge.',
+  flags => 'GUC_EXPLAIN',
+  variable => 'enable_loser_tree',
+  boot_val => 'false',
+},
+
 { name => 'enable_material', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD',
   short_desc => 'Enables the planner\'s use of materialization.',
   flags => 'GUC_EXPLAIN',
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index dc9e2255f8a..2ffba918a43 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -412,6 +412,7 @@
 #enable_incremental_sort = on
 #enable_indexscan = on
 #enable_indexonlyscan = on
+#enable_loser_tree = off
 #enable_material = on
 #enable_memoize = on
 #enable_mergejoin = on
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index e28b8cfa868..57c3d0266dc 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -42,16 +42,16 @@
  * end of the input is reached, we dump out remaining tuples in memory into
  * a final run, then merge the runs.
  *
- * When merging runs, we use a heap containing just the frontmost tuple from
- * each source run; we repeatedly output the smallest tuple and replace it
- * with the next tuple from its source tape (if any).  When the heap empties,
- * the merge is complete.  The basic merge algorithm thus needs very little
- * memory --- only M tuples for an M-way merge, and M is constrained to a
- * small number.  However, we can still make good use of our full workMem
- * allocation by pre-reading additional blocks from each source tape.  Without
- * prereading, our access pattern to the temporary file would be very erratic;
- * on average we'd read one block from each of M source tapes during the same
- * time that we're writing M blocks to the output tape, so there is no
+ * When merging runs, we use a heap or loser tree containing just the frontmost
+ * tuple from each source run; we repeatedly output the smallest tuple and
+ * replace it with the next tuple from its source tape (if any).  When the heap
+ * or loser tree empties, the merge is complete.  The basic merge algorithm thus
+ * needs very little memory --- only M tuples for an M-way merge, and M is
+ * constrained to a small number.  However, we can still make good use of our
+ * full workMem allocation by pre-reading additional blocks from each source tape.
+ * Without prereading, our access pattern to the temporary file would be very
+ * erratic; on average we'd read one block from each of M source tapes during
+ * the same time that we're writing M blocks to the output tape, so there is no
  * sequentiality of access at all, defeating the read-ahead methods used by
  * most Unix kernels.  Worse, the output tape gets written into a very random
  * sequence of blocks of the temp file, ensuring that things will be even
@@ -120,8 +120,12 @@
 #define INITIAL_MEMTUPSIZE Max(1024, \
 	ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
 
+/* LOSER_TREE_EOF is always a loser in comparison */
+#define LOSER_TREE_EOF -1
+
 /* GUC variables */
 bool		trace_sort = false;
+bool        enable_loser_tree = false;
 
 #ifdef DEBUG_BOUNDED_SORT
 bool		optimize_bounded_sort = true;
@@ -206,6 +210,8 @@ struct Tuplesortstate
 								 * space */
 	TupSortStatus maxSpaceStatus;	/* sort status when maxSpace was reached */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
+	bool        useLoserTree;   /* use loser tree for k-way merge if true */
+	int16      *losers;         /* array of losers, losers[0] is the winner */
 
 	/*
 	 * This array holds the tuples now in sort memory.  If we are in state
@@ -468,6 +474,8 @@ static void tuplesort_sort_memtuples(Tuplesortstate *state);
 static void tuplesort_heap_build(Tuplesortstate *state);
 static void tuplesort_heap_replace_top(Tuplesortstate *state, SortTuple *tuple);
 static void tuplesort_heap_delete_top(Tuplesortstate *state);
+static void tuplesort_loser_tree_build(Tuplesortstate *state);
+static void tuplesort_loser_tree_adjust(Tuplesortstate *state);
 static void reversedirection(Tuplesortstate *state);
 static unsigned int getlen(LogicalTape *tape, bool eofOK);
 static void markrunend(LogicalTape *tape);
@@ -682,6 +690,10 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate, int sortopt)
 	state->base.tuples = true;
 	state->abbrevNext = 10;
 
+	/* Use loser tree for k-way merge if enable_loser_tree is true */
+	state->useLoserTree = enable_loser_tree;
+	state->losers = NULL;
+
 	/*
 	 * workMem is forced to be at least 64KB, the current minimum valid value
 	 * for the work_mem GUC.  This is a defense against parallel sort callers
@@ -1652,11 +1664,12 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 			 */
 			if (state->memtupcount > 0)
 			{
-				int			srcTapeIndex = state->memtuples[0].srctape;
+				int         i = (state->useLoserTree ? state->losers[0] : 0);
+				int			srcTapeIndex = state->memtuples[i].srctape;
 				LogicalTape *srcTape = state->inputTapes[srcTapeIndex];
 				SortTuple	newtup;
 
-				*stup = state->memtuples[0];
+				*stup = state->memtuples[i];
 
 				/*
 				 * Remember the tuple we return, so that we can recycle its
@@ -1665,16 +1678,17 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 				state->lastReturnedTuple = stup->tuple;
 
 				/*
-				 * Pull next tuple from tape, and replace the returned tuple
-				 * at top of the heap with it.
+				 * Pull next tuple from tape, and adjust the heap or loser tree.
 				 */
 				if (!mergereadnext(state, srcTape, &newtup))
 				{
 					/*
 					 * If no more data, we've reached end of run on this tape.
-					 * Remove the top node from the heap.
 					 */
-					tuplesort_heap_delete_top(state);
+					if (state->useLoserTree)
+						state->memtuples[i].srctape = LOSER_TREE_EOF;
+					else
+						tuplesort_heap_delete_top(state);
 					state->nInputRuns--;
 
 					/*
@@ -1682,10 +1696,22 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 					 * anyway, but better to release the memory early.
 					 */
 					LogicalTapeClose(srcTape);
-					return true;
 				}
-				newtup.srctape = srcTapeIndex;
-				tuplesort_heap_replace_top(state, &newtup);
+				else
+				{
+					newtup.srctape = srcTapeIndex;
+					if (state->useLoserTree)
+						state->memtuples[i] = newtup;
+					else
+						tuplesort_heap_replace_top(state, &newtup);
+				}
+
+				if (state->useLoserTree)
+				{
+					tuplesort_loser_tree_adjust(state);
+					if (state->losers[0] == LOSER_TREE_EOF)
+						state->memtupcount = 0;
+				}
 				return true;
 			}
 			return false;
@@ -2066,8 +2092,8 @@ mergeruns(Tuplesortstate *state)
 		init_slab_allocator(state, 0);
 
 	/*
-	 * Allocate a new 'memtuples' array, for the heap.  It will hold one tuple
-	 * from each input tape.
+	 * Allocate a new 'memtuples' array.  It will hold one tuple from each
+	 * input tape.
 	 *
 	 * We could shrink this, too, between passes in a multi-pass merge, but we
 	 * don't bother.  (The initial input tapes are still in outputTapes.  The
@@ -2078,6 +2104,14 @@ mergeruns(Tuplesortstate *state)
 														state->nOutputTapes * sizeof(SortTuple));
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
+	/* Allocate a 'losers' array if we will use loser tree for merge */
+	if (state->useLoserTree && state->losers == NULL)
+	{
+		state->losers = (int16 *) MemoryContextAlloc(state->base.maincontext,
+													 state->nOutputTapes * sizeof(int16));
+		USEMEM(state, GetMemoryChunkSpace(state->losers));
+	}
+
 	/*
 	 * Use all the remaining memory we have available for tape buffers among
 	 * all the input tapes.  At the beginning of each merge pass, we will
@@ -2195,62 +2229,72 @@ mergeruns(Tuplesortstate *state)
 static void
 mergeonerun(Tuplesortstate *state)
 {
+	int         i;
 	int			srcTapeIndex;
 	LogicalTape *srcTape;
 
 	/*
 	 * Start the merge by loading one tuple from each active source tape into
-	 * the heap.
+	 * the heap or loser tree.
 	 */
 	beginmerge(state);
 
 	Assert(state->slabAllocatorUsed);
 
 	/*
-	 * Execute merge by repeatedly extracting lowest tuple in heap, writing it
-	 * out, and replacing it with next tuple from same tape (if there is
-	 * another one).
+	 * Execute merge by repeatedly extracting the tuple in the heap or loser
+	 * tree, writing it out, pulling next tuple from same tape (if there is
+	 * another one), and adjusting the heap or loser tree.
 	 */
 	while (state->memtupcount > 0)
 	{
 		SortTuple	stup;
 
 		/* write the tuple to destTape */
-		srcTapeIndex = state->memtuples[0].srctape;
+		i = (state->useLoserTree ? state->losers[0] : 0);
+		srcTapeIndex = state->memtuples[i].srctape;
 		srcTape = state->inputTapes[srcTapeIndex];
-		WRITETUP(state, state->destTape, &state->memtuples[0]);
+		WRITETUP(state, state->destTape, &state->memtuples[i]);
 
 		/* recycle the slot of the tuple we just wrote out, for the next read */
-		if (state->memtuples[0].tuple)
-			RELEASE_SLAB_SLOT(state, state->memtuples[0].tuple);
+		if (state->memtuples[i].tuple)
+			RELEASE_SLAB_SLOT(state, state->memtuples[i].tuple);
 
-		/*
-		 * pull next tuple from the tape, and replace the written-out tuple in
-		 * the heap with it.
-		 */
+		/* pull next tuple from the tape, and adjust the heap or loser tree */
 		if (mergereadnext(state, srcTape, &stup))
 		{
 			stup.srctape = srcTapeIndex;
-			tuplesort_heap_replace_top(state, &stup);
+			if (state->useLoserTree)
+				state->memtuples[i] = stup;
+			else
+				tuplesort_heap_replace_top(state, &stup);
 		}
 		else
 		{
-			tuplesort_heap_delete_top(state);
+			if (state->useLoserTree)
+				state->memtuples[i].srctape = LOSER_TREE_EOF;
+			else
+				tuplesort_heap_delete_top(state);
 			state->nInputRuns--;
 		}
+
+		if (state->useLoserTree)
+		{
+			tuplesort_loser_tree_adjust(state);
+			if (state->losers[0] == LOSER_TREE_EOF)
+				state->memtupcount = 0;
+		}
 	}
 
-	/*
-	 * When the heap empties, we're done.  Write an end-of-run marker on the
-	 * output tape.
-	 */
+	/* Done. Write an end-of-run marker on the output tape. */
 	markrunend(state->destTape);
 }
 
 /*
  * beginmerge - initialize for a merge pass
  *
- * Fill the merge heap with the first tuple from each input tape.
+ * Pull the first tuple from each input tape, and build a heap
+ * or loser tree for the merge pass.
  */
 static void
 beginmerge(Tuplesortstate *state)
@@ -2276,7 +2320,10 @@ beginmerge(Tuplesortstate *state)
 		}
 	}
 
-	tuplesort_heap_build(state);
+	if (state->useLoserTree)
+		tuplesort_loser_tree_build(state);
+	else
+		tuplesort_heap_build(state);
 }
 
 /*
@@ -2826,6 +2873,95 @@ tuplesort_heap_replace_top(Tuplesortstate *state, SortTuple *tuple)
 	memtuples[i] = *tuple;
 }
 
+/*
+ * Build a valid loser tree from memtuples[].
+ */
+static void
+tuplesort_loser_tree_build(Tuplesortstate *state)
+{
+	SortTuple  *memtuples = state->memtuples;
+	int         k = state->memtupcount;
+	int16       winners[MAXORDER];
+	int         i;
+
+	Assert(state->losers != NULL);
+	Assert(k <= MAXORDER);
+
+	/* this can happen? */
+	if (unlikely(k <= 0))
+		return;
+
+	/* 1-way merge, losers[0] is always 0 */
+	if (k == 1)
+	{
+		state->losers[0] = 0;
+		return;
+	}
+
+	/* fill the losers array */
+	for (i = k - 1; i > 0; i--)
+	{
+		int l = i * 2;
+		int r = l + 1;
+		int ll = (l >= k ? l - k : winners[l]);
+		int rr = (r >= k ? r - k : winners[r]);
+
+		if (COMPARETUP(state, &memtuples[ll], &memtuples[rr]) < 0)
+		{
+			winners[i] = ll;
+			state->losers[i] = rr;
+		}
+		else
+		{
+			winners[i] = rr;
+			state->losers[i] = ll;
+		}
+	}
+	state->losers[0] = winners[1];
+}
+
+/*
+ * Adjust the loser tree to get the new winner.  Caller must have already
+ * pulled the next tuple from the last winner's input tape, and placed it
+ * to memtuples[losers[0]].
+ */
+static void
+tuplesort_loser_tree_adjust(Tuplesortstate *state)
+{
+	SortTuple  *memtuples = state->memtuples;
+	int         k = state->memtupcount;
+	int         winner = state->losers[0];
+	int         i = (winner + k) / 2;
+
+	Assert(state->losers != NULL);
+	Assert(k > 0 && k <= MAXORDER);
+	Assert(winner >= 0 && winner < k);
+
+	/*
+	 * Reach end of run on the tape, set winner to LOSER_TREE_EOF so that
+	 * it's always a loser in comparison.
+	 */
+	if (memtuples[winner].srctape == LOSER_TREE_EOF)
+		winner = LOSER_TREE_EOF;
+
+	for (; i > 0; i /= 2)
+	{
+		int loser = state->losers[i];
+
+		/* LOSER_TREE_EOF is always a loser */
+		if (loser == LOSER_TREE_EOF)
+			continue;
+
+		if (winner == LOSER_TREE_EOF ||
+			COMPARETUP(state, &memtuples[winner], &memtuples[loser]) > 0)
+		{
+			state->losers[i] = winner;
+			winner = loser;
+		}
+	}
+	state->losers[0] = winner;
+}
+
 /*
  * Function to reverse the sort direction from its current state
  *
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index f21ec37da89..8c22eed716b 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -324,6 +324,7 @@ extern PGDLLIMPORT int tcp_user_timeout;
 extern PGDLLIMPORT char *role_string;
 extern PGDLLIMPORT bool in_hot_standby_guc;
 extern PGDLLIMPORT bool trace_sort;
+extern PGDLLIMPORT bool enable_loser_tree;
 
 #ifdef DEBUG_BOUNDED_SORT
 extern PGDLLIMPORT bool optimize_bounded_sort;
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 0411db832f1..b492e9de6fd 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -159,6 +159,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_incremental_sort        | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
+ enable_loser_tree              | off
  enable_material                | on
  enable_memoize                 | on
  enable_mergejoin               | on
@@ -173,7 +174,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(25 rows)
+(26 rows)
 
 -- There are always wait event descriptions for various types.  InjectionPoint
 -- may be present or absent, depending on history since last postmaster start.
-- 
2.52.0

#10John Naylor
johncnaylorls@gmail.com
In reply to: cca5507 (#9)
Re: Support loser tree for k-way merge

On Thu, Dec 4, 2025 at 1:11 PM cca5507 <cca5507@qq.com> wrote:

I summarized the number of comparisons needed for different 'k':

k = 2, heap: 1, loser tree: 1
k = 3, heap: 2, loser tree: [1, 2]
k = 4, heap: [2, 3], loser tree: 2
k = 5, heap: [2, 4], loser tree: [2, 3]

So if k < 5, the loser tree is never worse than the heap for any input data.

Please explain your notation. For starters, does "comparison" refer to
sortkey comparison or does it include checking for sentinel? If loser
tree can't early return, why is the number not always a constant?

If "k" is very small, I'm guessing the merge step is small compared to
sorting the individual runs, in which case it matters less which one
to use. That's just a guess, though -- we need structured testing.

On Fri, Dec 5, 2025 at 11:11 PM cca5507 <cca5507@qq.com> wrote:

Can we support loser tree first and set the default value of enable_loser_tree to off?

If you, the patch author, cannot demonstrate how to choose this
setting, what makes you think our users can? (That said, a temporary
GUC is useful for testing)

Here's a half-baked idea: If the regressions are mostly in
low-cardinality inputs, is it possible to add a fast path that just
checks if the current key is the same as the last one?

--
John Naylor
Amazon Web Services

#11cca5507
cca5507@qq.com
In reply to: John Naylor (#10)
Re: Support loser tree for k-way merge

 I summarized the number of comparisons needed for different 'k':

 k = 2, heap: 1, loser tree: 1
 k = 3, heap: 2, loser tree: [1, 2]
 k = 4, heap: [2, 3], loser tree: 2
 k = 5, heap: [2, 4], loser tree: [2, 3]

 So if k < 5, the loser tree is never worse than the heap for any input data.

Please explain your notation. For starters, does "comparison" refer to
sortkey comparison or does it include checking for sentinel?

The "k" is "k-way merge", and the "comparisons" are the number of tuple comparisons
during one heap's sift down or loser tree's adjustment.

If loser tree can't early return, why is the number not always a constant?

Because the leaf nodes are in the bottom two levels of the loser tree, and different levels
have different number of comparisons.

If "k" is very small, I'm guessing the merge step is small compared to
sorting the individual runs, in which case it matters less which one
to use. That's just a guess, though -- we need structured testing.

Yeah, the loser tree at most reduce one tuple comparison in each adjustment if k < 5.

If you, the patch author, cannot demonstrate how to choose this
setting, what makes you think our users can? (That said, a temporary
GUC is useful for testing)

Yeah, users may have no ideas about the characteristics of their data, so it's hard for them
to choose the setting.

Here's a half-baked idea: If the regressions are mostly in
low-cardinality inputs, is it possible to add a fast path that just
checks if the current key is the same as the last one?

For heap, it reduces one tuple comparison if the keys are same and increase one if not.
For loser tree, it reduces many tuple comparisons (maybe tree's height - 1?) if the keys
are same and increase one if not. The bad case is all keys are different, so we still need
to decide when to use the fast path, it's hard I think.

--
Regards,
ChangAo Chen

#12Andreas Karlsson
andreas@proxel.se
In reply to: cca5507 (#11)
Re: Support loser tree for k-way merge

On 12/8/25 7:46 AM, cca5507 wrote:

For heap, it reduces one tuple comparison if the keys are same and increase one if not.
For loser tree, it reduces many tuple comparisons (maybe tree's height - 1?) if the keys
are same and increase one if not. The bad case is all keys are different, so we still need
to decide when to use the fast path, it's hard I think.

My suggestion is that you start with trying to find some cases where we
get regressions and measure how big these regressions are and if there
are any clear cutoffs where we can use a simple heuristic to select
algorithm. One thought I have is that pre-sorted input could be slower
with loser than with heap but since I am unfamiliar with loser trees I
could be wrong.

Andreas