Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"

Started by Peter Geogheganover 10 years ago27 messages
#1Peter Geoghegan
pg@heroku.com
2 attachment(s)

The behavior of external sorts that do not require any merge step due
to only having one run (what EXPLAIN ANALYZE output shows as an
"external sort", and not a "merge sort") seems like an area that can
be significantly improved upon. As noted in code comments, this
optimization did not appear in The Art of Computer Programming, Volume
III. It's not an unreasonable idea, but it doesn't work well on modern
machines due to using heapsort, which is known to use the cache
ineffectively. It also often implies significant additional I/O for
little or no benefit. I suspect that all the reports we've heard of
smaller work_mem sizes improving sort performance are actually down to
this one-run optimization *hurting* performance.

The existing optimization for this case tries to benefit from avoiding
a final N-way merge step; that's what it's all about (this is not to
be confused with the other reason why a sort can end in
TSS_SORTEDONTAPE -- because random tape access is required, and an
*on-the-fly* TSS_FINALMERGE merge step cannot happen. Currently,
TSS_SORTEDONTAPE is sometimes the "fast" case and sometimes the slow
case taken only because the caller specified randomAccess, and I'm
improving on only the "fast" case [1]Seems bogus to me that EXPLAIN ANALYZE shows state TSS_SORTEDONTAPE as "external sort" rather than a "merge sort", regardless of whether or not a merge step was actually required (that is, regardless of whether or not state ended up TSS_SORTEDONTAPE due to using the existing "single run" optimization, or because caller required randomAccess), where the caller may or may not
have requested randomAccess. I require that they specify !randomAccess
to use my improved optimization, though, and I'm not trying to avoid a
merge step.)

The existing optimization just dumps all tuples in the memtuples array
(which is a heap at that point) to tape, from the top of the heap,
writing a tuple out at a time, maintaining the heap invariant
throughout. Then, with the memtuples array emptied, tuples are fetched
from tape/disk for client code, without any merge step occurring
on-the-fly (or at all).

Patch -- "quicksort with spillover"
=========================

With the attached patch, I propose to add an additional, better "one
run special case" optimization. This new special case "splits" the
single run into 2 "subruns". One of the runs is comprised of whatever
tuples were in memory when the caller finished passing tuples to
tuplesort. To sort that, we use quicksort, which in general has
various properties that make it much faster than heapsort -- it's a
cache oblivious algorithm, which is important these days. The other
"subrun" is whatever tuples were on-tape when tuplesort_performsort()
was called. This will often be a minority of the total, but certainly
never much more than half. This is already sorted when
tuplesort_performsort() is reached. This spillover is already inserted
at the front of the sorted-on-tape tuples, and so already has
reasonably good cache characteristics.

With the patch, we perform an on-the-fly merge that is somewhat
similar to the existing (unaffected) "merge sort" TSS_FINALMERGE case,
except that one of the runs is in memory, and is potentially much
larger than the on-tape/disk run (but never much smaller), and is
quicksorted. The existing "merge sort" case similarly is only used
when the caller specified !randomAccess.

For subtle reasons, the new TSS_MEMTAPEMERGE case will happen
significantly more frequently than the existing, comparable
TSS_SORTEDONTAPE case currently happens (this applies to !randomAccess
callers only, of course). See comments added to
tuplesort_performsort(), and the commit message for the details. Note
that the existing, comparable case was relocated to
tuplesort_performsort(), to highlight that it is now a fallback for
the new TSS_MEMTAPEMERGE case (also in tuplesort_performsort()).

Performance
==========

This new approach can be much faster. For example:

select setseed(1);
-- 10 million tuple table with low cardinality integer column to sort:
create unlogged table low_cardinality_int4 as select (random() *
1000)::int4 s, 'abcdefghijlmn'::text junk from generate_series(1,
10000000);
set work_mem = '225MB';

-- test query:
select count(distinct(s)) from low_cardinality_int4;
count
-------
1001
(1 row)

On my laptop, a patched Postgres takes about 4.2 seconds to execute
this query. Master takes about 16 seconds. The patch sees this case
quicksort 9,830,398 tuples out of a total of 10 million with this
225MB work_mem setting. This is chosen to be a sympathetic case, but
it is still quite realistic.

We should look at a much less sympathetic case, too. Even when the
in-memory "subrun" is about as small as it can be while still having
this new special case optimization occur at all, the patch still ends
up pretty far ahead of the master branch. With work_mem at 100MB,
4,369,064 tuples are quicksorted by the patch when the above query is
executed, which is less than half of the total (10 million). Execution
with the patch now takes about 10.2 seconds. Master is about 14.7
seconds with the same work_mem setting (yes, master gets faster as the
patch gets slower as work_mem is reduced...but they never come close
to meeting). That's still a fairly decent improvement, and it occurs
when we're on the verge of not using the optimization at all.

Most users that really care about performance will at least have
enough memory to benefit from this optimization when building an index
on a large table, because the patch roughly halves the amount of
memory you need to get at least some of the benefit of an internal
sort.

Performance regressions
----------------------------------

I have been unable to find any regressions in the performance of
queries with the patch. If you're looking for a query that might have
been regressed, I suggest coming up with a case involving a sort with
pass-by-reference types that are expensive. For example, sorting a
tuple on many low cardinality text attributes. Only the most extreme
such cases seem likely to be regressed, though, because heapsort also
has bad temporal locality. My strcoll() result caching patch [2]https://commitfest.postgresql.org/6/294/ tries
to take advantage of temporal locality rather than spatial locality,
which works well with quicksort and mergesort.

Memory use
-----------------

The new TSS_MEMTAPEMERGE case uses no more memory than the existing
"TSS_SORTEDONTAPE due to one run" optimization (actually, it uses very
slightly less) if we only worry about the high-water mark. In both
cases the high-water mark comes as the work_mem limit is reached.
Typically, most memory isn't released until the sort is shut down.

Because we're not dumping all tuples here (only as many as we need to
dump to stay within our memory budget), we're also not freeing memory
for each and every "tuple proper" as each and every SortTuple is
written out (because we usually avoid writing out most tuples, which
is an important goal of the patch). Although the memtuples array
itself is often tuplesort's dominant consumer of work_mem, it's still
possible that the aggregate effect of this patch on some workload's
memory consumption is that more memory is used. I doubt it, though;
the overall effect on memory usage will probably always be to reduce
it. Finishing a sort sooner allows us to make memory available for
other operations sooner. Besides, we have broken no promise to the
caller wrt memory use.

Predictability
==========

The patch alters the performance characteristics of tuplesort, but
very much for the better. With the patch, as less memory is gradually
made available for sorting, performance tends to also degrade
gradually, because we more or less have an algorithm that's a kind of
hybrid internal/external sort, that, roughly speaking, blends from the
former to the latter based on the availability of memory.

It's particularly useful that performance doesn't fall off a cliff
when we can no longer fit everything in memory because work_mem is
slightly too small. The "almost internal" query from my example above
takes about 4.2 seconds. An equivalent internal sort (quicksort) takes
about 3.5 seconds, which is pretty close (~98% of tuples are
quicksorted for the "almost internal" case, but heapification is a
price we must pay to spill even one tuple). Furthermore, as work_mem
is decreased to the point that even the optimization is no longer used
-- when a traditional "merge sort"/TSS_FINALMERGE is used, instead --
there is also no big drop.

*Predictable* performance characteristics are a big asset. Decreasing
work_mem by 10% (or a moderate increase in the size of a table) should
not make the performance of reporting queries tank. Concerns about
worst case performance (e.g. particular queries suddenly taking much
longer to execute) have certainly prevented me from decreasing
work_mem from within postgresql.conf in the past, even though I was
fairly confident that it made sense for the average case.

Open Items
=========

There are a few smaller open items indicated by "XXX" comments. There
is a little overlap with this patch, and a couple of others that are
in the queue that also affect sorting. For example, I'm considerate of
cases that don't exist yet.

Future work
=========

In the future, we should think about optimization of the "merge
sort"/TSS_FINALMERGE case, which should be made to sort *every* run
using quicksort (the devil in the details there, but in general I
think that runs should be quicksorted wherever possible). For now,
what I came up with seems like a relatively simple approach that
offers much of the benefit of that more comprehensive project, since
the need to do a "merge sort" is increasingly rare due to the enormous
main memory sizes that are affordable these days. Of course, Noah's
excellent work on huge repalloc() also contributes to our being able
to put large amounts of memory to good use when sorting.

Since heapification is now a big fraction of the total cost of a sort
sometimes, even where the heap invariant need not be maintained for
any length of time afterwards, it might be worth revisiting the patch
to make that an O(n) rather than a O(log n) operation [3]/messages/by-id/52F16843.8080001@wizmail.org -- Peter Geoghegan. Not sure
about that, but someone should probably look into it. Jeremy Harris is
CC'd here; perhaps he will consider picking that work up once more in
light of this development. It would be nice to get a sort that
quicksorts 99% of all tuples even closer to a conventional internal
sort.

[1]: Seems bogus to me that EXPLAIN ANALYZE shows state TSS_SORTEDONTAPE as "external sort" rather than a "merge sort", regardless of whether or not a merge step was actually required (that is, regardless of whether or not state ended up TSS_SORTEDONTAPE due to using the existing "single run" optimization, or because caller required randomAccess)
TSS_SORTEDONTAPE as "external sort" rather than a "merge sort",
regardless of whether or not a merge step was actually required (that
is, regardless of whether or not state ended up TSS_SORTEDONTAPE due
to using the existing "single run" optimization, or because caller
required randomAccess)
[2]: https://commitfest.postgresql.org/6/294/
[3]: /messages/by-id/52F16843.8080001@wizmail.org -- Peter Geoghegan
--
Peter Geoghegan

Attachments:

0002-Add-cursory-regression-tests-for-sorting.patchtext/x-patch; charset=US-ASCII; name=0002-Add-cursory-regression-tests-for-sorting.patchDownload
From 35842f788f43712f33533a65154aafc700ff6d6f Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghegan86@gmail.com>
Date: Wed, 29 Jul 2015 15:38:12 -0700
Subject: [PATCH 2/2] Add cursory regression tests for sorting

This is not intended to be a formal patch submission.  Tests are added
that happened to be useful during the development of the "quicksort with
spillover" patch, as the regression tests currently have precisely zero
coverage for any variety of external sort operation.  The tests are
provided as a convenience to reviewers of that patch only.

In the long run, there should be comprehensive smoke-testing of these
cases (probably not in the standard regression tests), but this patch
does not pretend to be any kind of basis for that.

The tests added have a number of obvious problems:

* They take far too long to run to be in the standard regression test
suite, and yet they're run as part of that suite.  With a little effort,
they could probably be made to run quicker with no appreciable loss of
coverage, but that didn't happen.

* They're far from comprehensive.

* The tests require about 1.5GB of disk space to run.

* They're not portable.  They might even be extremely non-portable due
to implementation differences across platform pseudo-random number
generators.

The important point is that each query tested gives consistent results.
There is no reason to think that varying work_mem settings will affect
which basic approach to sorting each query takes as compared to during
my original testing (at least assuming a 64-bit platform), which is also
important.
---
 src/test/regress/expected/sorting.out | 210 ++++++++++++++++++++++++++++++++++
 src/test/regress/parallel_schedule    |   1 +
 src/test/regress/serial_schedule      |   1 +
 src/test/regress/sql/sorting.sql      |  82 +++++++++++++
 4 files changed, 294 insertions(+)
 create mode 100644 src/test/regress/expected/sorting.out
 create mode 100644 src/test/regress/sql/sorting.sql

diff --git a/src/test/regress/expected/sorting.out b/src/test/regress/expected/sorting.out
new file mode 100644
index 0000000..3aa30f4
--- /dev/null
+++ b/src/test/regress/expected/sorting.out
@@ -0,0 +1,210 @@
+--
+-- sorting tests
+--
+--
+-- int4 test (10 million tuples, medium cardinality)
+--
+select setseed(1);
+ setseed 
+---------
+ 
+(1 row)
+
+create unlogged table int4_sort_tbl as
+  select (random() * 1000000)::int4 s, 'abcdefghijlmn'::text junk
+  from generate_series(1, 10000000);
+--
+-- int4 test (10 million tuples, high cardinality)
+--
+create unlogged table highcard_int4_sort_tbl as
+  select (random() * 100000000)::int4 s, 'abcdefghijlmn'::text junk
+  from generate_series(1, 10000000);
+--
+-- int4 test (10 million tuples, low cardinality)
+--
+create unlogged table lowcard_int4_sort_tbl as
+  select (random() * 100)::int4 s, 'abcdefghijlmn'::text junk
+  from generate_series(1, 10000000);
+-- Results should be consistent:
+set work_mem = '64MB';
+select count(distinct(s)) from int4_sort_tbl;
+ count  
+--------
+ 999949
+(1 row)
+
+select count(distinct(s)) from highcard_int4_sort_tbl;
+  count  
+---------
+ 9515397
+(1 row)
+
+select count(distinct(s)) from lowcard_int4_sort_tbl;
+ count 
+-------
+   101
+(1 row)
+
+set work_mem = '100MB';
+select count(distinct(s)) from int4_sort_tbl;
+ count  
+--------
+ 999949
+(1 row)
+
+select count(distinct(s)) from highcard_int4_sort_tbl;
+  count  
+---------
+ 9515397
+(1 row)
+
+select count(distinct(s)) from lowcard_int4_sort_tbl;
+ count 
+-------
+   101
+(1 row)
+
+set work_mem = '110MB';
+select count(distinct(s)) from int4_sort_tbl;
+ count  
+--------
+ 999949
+(1 row)
+
+select count(distinct(s)) from highcard_int4_sort_tbl;
+  count  
+---------
+ 9515397
+(1 row)
+
+select count(distinct(s)) from lowcard_int4_sort_tbl;
+ count 
+-------
+   101
+(1 row)
+
+set work_mem = '128MB';
+select count(distinct(s)) from int4_sort_tbl;
+ count  
+--------
+ 999949
+(1 row)
+
+select count(distinct(s)) from highcard_int4_sort_tbl;
+  count  
+---------
+ 9515397
+(1 row)
+
+select count(distinct(s)) from lowcard_int4_sort_tbl;
+ count 
+-------
+   101
+(1 row)
+
+set work_mem = '140MB';
+select count(distinct(s)) from int4_sort_tbl;
+ count  
+--------
+ 999949
+(1 row)
+
+select count(distinct(s)) from highcard_int4_sort_tbl;
+  count  
+---------
+ 9515397
+(1 row)
+
+select count(distinct(s)) from lowcard_int4_sort_tbl;
+ count 
+-------
+   101
+(1 row)
+
+set work_mem = '150MB';
+select count(distinct(s)) from int4_sort_tbl;
+ count  
+--------
+ 999949
+(1 row)
+
+select count(distinct(s)) from highcard_int4_sort_tbl;
+  count  
+---------
+ 9515397
+(1 row)
+
+select count(distinct(s)) from lowcard_int4_sort_tbl;
+ count 
+-------
+   101
+(1 row)
+
+-- should be in-memory:
+set work_mem = '512MB';
+select count(distinct(s)) from int4_sort_tbl;
+ count  
+--------
+ 999949
+(1 row)
+
+select count(distinct(s)) from highcard_int4_sort_tbl;
+  count  
+---------
+ 9515397
+(1 row)
+
+select count(distinct(s)) from lowcard_int4_sort_tbl;
+ count 
+-------
+   101
+(1 row)
+
+--
+-- text test (uses abbreviated keys, 10 million tuples)
+--
+select setseed(1);
+ setseed 
+---------
+ 
+(1 row)
+
+create unlogged table text_sort_tbl as
+  select (random() * 100000)::int4::text s
+  from generate_series(1, 10000000);
+-- Start with sort that results in 3-way final merge:
+set work_mem = '190MB';
+select count(distinct(s)) from text_sort_tbl;
+ count  
+--------
+ 100001
+(1 row)
+
+-- Uses optimization where it's marginal:
+set work_mem = '200MB';
+select count(distinct(s)) from text_sort_tbl;
+ count  
+--------
+ 100001
+(1 row)
+
+-- Uses optimization where it's favorable:
+set work_mem = '450MB';
+select count(distinct(s)) from text_sort_tbl;
+ count  
+--------
+ 100001
+(1 row)
+
+-- internal sort:
+set work_mem = '500MB';
+select count(distinct(s)) from text_sort_tbl;
+ count  
+--------
+ 100001
+(1 row)
+
+drop table int4_sort_tbl;
+drop table highcard_int4_sort_tbl;
+drop table lowcard_int4_sort_tbl;
+drop table text_sort_tbl;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 4df15de..7ff656a 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -37,6 +37,7 @@ test: geometry horology regex oidjoins type_sanity opr_sanity
 # ----------
 test: insert
 test: insert_conflict
+test: sorting
 test: create_function_1
 test: create_type
 test: create_table
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 15d74d4..ebe7de0 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -51,6 +51,7 @@ test: type_sanity
 test: opr_sanity
 test: insert
 test: insert_conflict
+test: sorting
 test: create_function_1
 test: create_type
 test: create_table
diff --git a/src/test/regress/sql/sorting.sql b/src/test/regress/sql/sorting.sql
new file mode 100644
index 0000000..75453fc
--- /dev/null
+++ b/src/test/regress/sql/sorting.sql
@@ -0,0 +1,82 @@
+--
+-- sorting tests
+--
+
+--
+-- int4 test (10 million tuples, medium cardinality)
+--
+select setseed(1);
+create unlogged table int4_sort_tbl as
+  select (random() * 1000000)::int4 s, 'abcdefghijlmn'::text junk
+  from generate_series(1, 10000000);
+
+--
+-- int4 test (10 million tuples, high cardinality)
+--
+create unlogged table highcard_int4_sort_tbl as
+  select (random() * 100000000)::int4 s, 'abcdefghijlmn'::text junk
+  from generate_series(1, 10000000);
+
+--
+-- int4 test (10 million tuples, low cardinality)
+--
+create unlogged table lowcard_int4_sort_tbl as
+  select (random() * 100)::int4 s, 'abcdefghijlmn'::text junk
+  from generate_series(1, 10000000);
+
+-- Results should be consistent:
+set work_mem = '64MB';
+select count(distinct(s)) from int4_sort_tbl;
+select count(distinct(s)) from highcard_int4_sort_tbl;
+select count(distinct(s)) from lowcard_int4_sort_tbl;
+set work_mem = '100MB';
+select count(distinct(s)) from int4_sort_tbl;
+select count(distinct(s)) from highcard_int4_sort_tbl;
+select count(distinct(s)) from lowcard_int4_sort_tbl;
+set work_mem = '110MB';
+select count(distinct(s)) from int4_sort_tbl;
+select count(distinct(s)) from highcard_int4_sort_tbl;
+select count(distinct(s)) from lowcard_int4_sort_tbl;
+set work_mem = '128MB';
+select count(distinct(s)) from int4_sort_tbl;
+select count(distinct(s)) from highcard_int4_sort_tbl;
+select count(distinct(s)) from lowcard_int4_sort_tbl;
+set work_mem = '140MB';
+select count(distinct(s)) from int4_sort_tbl;
+select count(distinct(s)) from highcard_int4_sort_tbl;
+select count(distinct(s)) from lowcard_int4_sort_tbl;
+set work_mem = '150MB';
+select count(distinct(s)) from int4_sort_tbl;
+select count(distinct(s)) from highcard_int4_sort_tbl;
+select count(distinct(s)) from lowcard_int4_sort_tbl;
+-- should be in-memory:
+set work_mem = '512MB';
+select count(distinct(s)) from int4_sort_tbl;
+select count(distinct(s)) from highcard_int4_sort_tbl;
+select count(distinct(s)) from lowcard_int4_sort_tbl;
+
+--
+-- text test (uses abbreviated keys, 10 million tuples)
+--
+select setseed(1);
+create unlogged table text_sort_tbl as
+  select (random() * 100000)::int4::text s
+  from generate_series(1, 10000000);
+
+-- Start with sort that results in 3-way final merge:
+set work_mem = '190MB';
+select count(distinct(s)) from text_sort_tbl;
+-- Uses optimization where it's marginal:
+set work_mem = '200MB';
+select count(distinct(s)) from text_sort_tbl;
+-- Uses optimization where it's favorable:
+set work_mem = '450MB';
+select count(distinct(s)) from text_sort_tbl;
+-- internal sort:
+set work_mem = '500MB';
+select count(distinct(s)) from text_sort_tbl;
+
+drop table int4_sort_tbl;
+drop table highcard_int4_sort_tbl;
+drop table lowcard_int4_sort_tbl;
+drop table text_sort_tbl;
-- 
1.9.1

0001-Quicksort-when-only-one-initial-run-is-produced.patchtext/x-patch; charset=US-ASCII; name=0001-Quicksort-when-only-one-initial-run-is-produced.patchDownload
From e0f8a3c4bd6863c90d4169b5cd6a222c7045d6ad Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghegan86@gmail.com>
Date: Wed, 29 Jul 2015 15:38:12 -0700
Subject: [PATCH 1/2] Quicksort when only one initial run is produced

Problem
=======

Previously, in the event of only producing a single run, by the time
tuplesort_performsort() is called tuplesort had only incrementally
spilled tuples from the memtuples array (which is a heap at that stage)
onto tape.  The incremental spilling occurs by consuming from the heap,
meaning that on-tape tuples are sorted.  Following dumping all tuples at
once (in a similar fashion to how tuples were incrementally spilled
before that point) and observing that there was only ever one run,
tuplesort realized its "one run" optimization was applicable (In
general, tuplesort imagined that once a sort cannot fit entirely in
memory, all tuples must be dumped within tuplesort_performsort()).

All tuples end up sorted and on tape (tuple comparisons occur as the
heap invariant is maintained as the memtuples heap is spilled).  This
approach is useful because no final merge step is required, unlike the
"merge sort" case where more than one run is used and multiple tapes
must be merged.  However, it had a number of significant disadvantages,
including:

* The tuples often don't actually need to be written out to tape/disk,
but that happens anyway.

* The sort step is a degenerate heapsort.  Heapsort is an
algorithm that performs very poorly on modern CPUs with fast caches.

* Since we are not quicksorting, that means we're unable to use the
"onlyKey" optimization.  Single-attribute heaptuple and datumtuple sorts
that don't use abbreviated keys benefit from this optimization
considerably.

Solution
========

We can often (but not always) do better. In the event of:

1. Having only a single conventional run (or, more specifically, only
one run that made it to disk/tape -- a new, broader criteria for
applying a "one run" optimization)

2. The caller not requiring random access (a new, more restrictive
criteria for apply a "one run" optimization)

Then:

1. Split input tuples into two conceptual "subruns":  one on-disk, and
the other in-memory

2. Quicksort the memtuples array subrun (the other subrun is already
sorted when the decision to apply the new optimization is made)

3. Finally, merge the new subruns on-the-fly, much like the existing
"external merge" case

Testing shows that this can perform significantly better than the
existing "one run" optimization.
---
 src/backend/commands/explain.c     |  13 +-
 src/backend/utils/sort/tuplesort.c | 332 +++++++++++++++++++++++++++++++++----
 src/include/utils/tuplesort.h      |   3 +-
 3 files changed, 309 insertions(+), 39 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 5d06fa4..4f61ee2 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2178,20 +2178,27 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 		const char *sortMethod;
 		const char *spaceType;
 		long		spaceUsed;
+		int			rowsQuicksorted;
 
-		tuplesort_get_stats(state, &sortMethod, &spaceType, &spaceUsed);
+		tuplesort_get_stats(state, &sortMethod, &spaceType, &spaceUsed,
+							&rowsQuicksorted);
 
 		if (es->format == EXPLAIN_FORMAT_TEXT)
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
-			appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
-							 sortMethod, spaceType, spaceUsed);
+			appendStringInfo(es->str,
+							 "Sort Method: %s  %s: %ldkB  Rows Quicksorted: %d",
+							 sortMethod,
+							 spaceType,
+							 spaceUsed,
+							 rowsQuicksorted);
 		}
 		else
 		{
 			ExplainPropertyText("Sort Method", sortMethod, es);
 			ExplainPropertyLong("Sort Space Used", spaceUsed, es);
 			ExplainPropertyText("Sort Space Type", spaceType, es);
+			ExplainPropertyInteger("Rows Quicksorted", rowsQuicksorted, es);
 		}
 	}
 }
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 435041a..5371f4f 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -72,6 +72,15 @@
  * to one run per logical tape.  The final merge is then performed
  * on-the-fly as the caller repeatedly calls tuplesort_getXXX; this
  * saves one cycle of writing all the data out to disk and reading it in.
+ * Also, if only one run is spilled to tape so far when
+ * tuplesort_performsort() is reached, and if the caller does not require
+ * random access, then the merge step can take place between still
+ * in-memory tuples, and tuples stored on tape (it does not matter that
+ * there may be a second run in that array -- only that a second one has
+ * spilled).  This ensures that spilling to disk only occurs for a number of
+ * tuples approximately equal to the number of tuples read in after
+ * work_mem was reached and it became apparent that an external sort is
+ * required.
  *
  * Before Postgres 8.2, we always used a seven-tape polyphase merge, on the
  * grounds that 7 is the "sweet spot" on the tapes-to-passes curve according
@@ -186,6 +195,7 @@ typedef enum
 	TSS_BUILDRUNS,				/* Loading tuples; writing to tape */
 	TSS_SORTEDINMEM,			/* Sort completed entirely in memory */
 	TSS_SORTEDONTAPE,			/* Sort completed, final run is on tape */
+	TSS_MEMTAPEMERGE,			/* Performing memory/tape merge on-the-fly */
 	TSS_FINALMERGE				/* Performing final merge on-the-fly */
 } TupSortStatus;
 
@@ -327,12 +337,22 @@ struct Tuplesortstate
 	int			activeTapes;	/* # of active input tapes in merge pass */
 
 	/*
+	 * These variables are used after tuplesort_performsort() for the
+	 * TSS_MEMTAPEMERGE case.  This is a special, optimized final on-the-fly
+	 * merge pass involving merging the result tape with memtuples that were
+	 * quicksorted (but never made it out to a tape).
+	 */
+	SortTuple	tape_cache;		/* cached tape tuple from prior call */
+	bool		cached;			/* tape_cache holds pending tape tuple */
+	bool		just_memtuples;	/* merge only fetching from memtuples */
+
+	/*
 	 * These variables are used after completion of sorting to keep track of
 	 * the next tuple to return.  (In the tape case, the tape's current read
 	 * position is also critical state.)
 	 */
 	int			result_tape;	/* actual tape number of finished output */
-	int			current;		/* array index (only used if SORTEDINMEM) */
+	int			current;		/* array index (only used if quicksorted) */
 	bool		eof_reached;	/* reached EOF (needed for cursors) */
 
 	/* markpos_xxx holds marked position for mark and restore */
@@ -460,6 +480,7 @@ struct Tuplesortstate
 static Tuplesortstate *tuplesort_begin_common(int workMem, bool randomAccess);
 static void puttuple_common(Tuplesortstate *state, SortTuple *tuple);
 static bool consider_abort_common(Tuplesortstate *state);
+static void tuplesort_quicksort(Tuplesortstate *state);
 static void inittapes(Tuplesortstate *state);
 static void selectnewtape(Tuplesortstate *state);
 static void mergeruns(Tuplesortstate *state);
@@ -1549,6 +1570,23 @@ consider_abort_common(Tuplesortstate *state)
 	return false;
 }
 
+static void
+tuplesort_quicksort(Tuplesortstate *state)
+{
+	if (state->memtupcount > 1)
+	{
+		/* Can we use the single-key sort function? */
+		if (state->onlyKey != NULL)
+			qsort_ssup(state->memtuples, state->memtupcount,
+					   state->onlyKey);
+		else
+			qsort_tuple(state->memtuples,
+						state->memtupcount,
+						state->comparetup,
+						state);
+	}
+}
+
 /*
  * All tuples have been provided; finish the sort.
  */
@@ -1569,20 +1607,9 @@ tuplesort_performsort(Tuplesortstate *state)
 
 			/*
 			 * We were able to accumulate all the tuples within the allowed
-			 * amount of memory.  Just qsort 'em and we're done.
+			 * amount of memory.  Just quicksort 'em and we're done.
 			 */
-			if (state->memtupcount > 1)
-			{
-				/* Can we use the single-key sort function? */
-				if (state->onlyKey != NULL)
-					qsort_ssup(state->memtuples, state->memtupcount,
-							   state->onlyKey);
-				else
-					qsort_tuple(state->memtuples,
-								state->memtupcount,
-								state->comparetup,
-								state);
-			}
+			tuplesort_quicksort(state);
 			state->current = 0;
 			state->eof_reached = false;
 			state->markpos_offset = 0;
@@ -1609,12 +1636,111 @@ tuplesort_performsort(Tuplesortstate *state)
 
 			/*
 			 * Finish tape-based sort.  First, flush all tuples remaining in
-			 * memory out to tape; then merge until we have a single remaining
-			 * run (or, if !randomAccess, one run per tape). Note that
-			 * mergeruns sets the correct state->status.
+			 * memory out to tape where that's required (it's required when
+			 * more than one run is represented on tape, or when the caller
+			 * required random access).
+			 *
+			 * We do this before considering which of the three cases are
+			 * taken, because dumping tuples can increase currentRun,
+			 * altering the outcome.
 			 */
-			dumptuples(state, true);
-			mergeruns(state);
+			if (state->currentRun > 0 || state->randomAccess)
+				dumptuples(state, true);
+
+			if (state->currentRun > 1)
+			{
+				/*
+				 * Merge until we have a single remaining run (or, if
+				 * !randomAccess, one run per tape).  Note that mergeruns sets
+				 * the correct state->status.
+				 */
+				mergeruns(state);
+			}
+			else if (state->currentRun == 0)
+			{
+				/*
+				 * Avoid dumping most tuples.
+				 *
+				 * Since we produced only one initial run and the caller
+				 * does not require random access, we can use our result
+				 * tape as part of the finished output.  We quicksort the
+				 * remaining unsorted (although heapified) tuples in the
+				 * memtuples array, and merge that with the sorted-on-tape
+				 * contents of this run.
+				 *
+				 * This is similar to TSS_FINALMERGE, which also has an
+				 * on-the-fly merge step;  the major difference is that we
+				 * merge memtuples with a tape (each is therefore a "subrun").
+				 * Merging is fairly fast when we merge from memory to a
+				 * single tape, and this avoids significant I/O.
+				 *
+				 * Note that there might actually be 2 runs, but only the
+				 * contents of one of them went to tape, and so we can
+				 * safely "pretend" that there is only 1 run (since we're
+				 * about to give up on the idea of the memtuples array being
+				 * a heap).  This means that if our sort happened to require
+				 * random access, the similar "single run" optimization
+				 * below (which sets TSS_SORTEDONTAPE) might not be used at
+				 * all.  This is because dumping all tuples out might have
+				 * forced an otherwise equivalent randomAccess case to
+				 * acknowledge a second run, which we can avoid.
+				 *
+				 * And so, this optimization is more effective than the
+				 * similar 1 run TSS_SORTEDONTAPE optimization below because
+				 * it will be used more frequently.  A further advantage is
+				 * that quicksort is very fast.  Modern CPUs tend to be
+				 * enormously bottlenecked on memory access, which heap sort
+				 * exacerbates, whereas quicksort is a cache-oblivious
+				 * algorithm, and so uses the cache optimally.  The fact
+				 * that the memtuples array has already been heapified is no
+				 * reason to commit to the path of unnecessarily dumping and
+				 * heap-sorting input tuples.
+				 *
+				 * Effectively, our "single" run is "split" into an on-disk
+				 * subrun and a memtuples subrun that stores a significant
+				 * proportion of all tuples to be sorted.  Often, this
+				 * subrun is much larger than the tape subrun, which is
+				 * where the optimization is most effective.
+				 */
+				Assert(!state->randomAccess);
+				markrunend(state, state->currentRun);
+
+				/*
+				 * Note that the quicksort may use abbreviated keys, which
+				 * are no longer available for the tuples that spilled to
+				 * tape.  This is something that the final on-the-fly merge
+				 * step accounts for.
+				 */
+				tuplesort_quicksort(state);
+				state->current = 0;
+
+#ifdef TRACE_SORT
+				if (trace_sort)
+					elog(LOG, "finished quicksort of %d tuples, a large portion of final run: %s",
+						 state->memtupcount,
+						 pg_rusage_show(&state->ru_start));
+#endif
+
+				state->result_tape = state->tp_tapenum[state->destTape];
+				/* Must freeze and rewind the finished output tape */
+				LogicalTapeFreeze(state->tapeset, state->result_tape);
+				state->status = TSS_MEMTAPEMERGE;
+			}
+			else
+			{
+				/*
+				 * Since we produced only one initial run (quite likely if
+				 * the total data volume is between 1X and 2X workMem), we
+				 * can just use that tape as the finished output, rather
+				 * than doing a useless merge.  (This obvious optimization
+				 * is not in Knuth's algorithm.)
+				 */
+				Assert(state->randomAccess);
+				state->result_tape = state->tp_tapenum[state->destTape];
+				/* Must freeze and rewind the finished output tape */
+				LogicalTapeFreeze(state->tapeset, state->result_tape);
+				state->status = TSS_SORTEDONTAPE;
+			}
 			state->eof_reached = false;
 			state->markpos_block = 0L;
 			state->markpos_offset = 0;
@@ -1633,6 +1759,9 @@ tuplesort_performsort(Tuplesortstate *state)
 			elog(LOG, "performsort done (except %d-way final merge): %s",
 				 state->activeTapes,
 				 pg_rusage_show(&state->ru_start));
+		else if (state->status == TSS_MEMTAPEMERGE)
+			elog(LOG, "performsort done (except memory/tape final merge): %s",
+				 pg_rusage_show(&state->ru_start));
 		else
 			elog(LOG, "performsort done: %s",
 				 pg_rusage_show(&state->ru_start));
@@ -1784,6 +1913,138 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 			READTUP(state, stup, state->result_tape, tuplen);
 			return true;
 
+		case TSS_MEMTAPEMERGE:
+			Assert(forward);
+			/* For now, assume tuple should not be freed by caller */
+			*should_free = false;
+
+			if (state->eof_reached)
+				return false;
+
+			if (state->just_memtuples)
+				goto just_memtuples;
+
+			/*
+			 * Should be at least one memtuple, since no "all" dump of
+			 * memtuples occurred (avoiding that is a big reason for the
+			 * existence of the TSS_MEMTAPEMERGE case)
+			 */
+			Assert(state->memtupcount > 0);
+
+			/*
+			 * Merge together quicksorted memtuples array, and sorted tape.
+			 *
+			 * When the decision was taken to apply this optimization, the
+			 * array was heapified.  Some number of tuples were spilled to
+			 * disk from the top of the heap, and are read from tape here in
+			 * fully sorted order.  That does not imply that there are no
+			 * tuples in the memtuples array that are less than any tuple
+			 * stored on tape;  that's why this merge step is necessary.
+			 *
+			 * "stup" is always the current tape tuple, which may be cached
+			 * from previous call, or read from tape in this call (and
+			 * cached for next).
+			 *
+			 * Exhaust the supply of tape tuples first.
+			 */
+			if (state->cached)
+			{
+				*stup = state->tape_cache;
+			}
+			else if ((tuplen = getlen(state, state->result_tape, true)) != 0)
+			{
+				READTUP(state, stup, state->result_tape, tuplen);
+				state->tape_cache = *stup;
+			}
+			else
+			{
+				/*
+				 * Initially, we mostly only returned tape tuples, with a
+				 * few interspersed memtuples.  From here on, we just fetch
+				 * from memtuples array.
+				 *
+				 * We may end up here rather quickly, particularly when
+				 * there are proportionally fewer tape tuples, which is
+				 * common.  This is because the tape tuples overall sorted
+				 * order skews very low (they're not guaranteed to be
+				 * strictly lower than any memtuples tuple, but since they
+				 * were written to tape from the top of a then-heapified
+				 * memtuples array, many will be).
+				 */
+				state->just_memtuples = true;
+just_memtuples:
+				*should_free = false;
+
+				/*
+				 * XXX:  If this routine ever supports memory prefetching,
+				 * it ought to happen here too.  Often, the majority of
+				 * tuples will be returned to caller from here, in a manner
+				 * very similar to the TSS_SORTEDINMEM case.
+				 */
+				if (state->current < state->memtupcount)
+				{
+					*stup = state->memtuples[state->current++];
+					return true;
+				}
+				state->eof_reached = true;
+				return false;
+			}
+
+			/*
+			 * Kludge:  Trigger abbreviated tie-breaker if in-memory tuples
+			 * use abbreviation (writing tuples to tape never preserves
+			 * abbreviated keys).  Do this by assigning in-memory
+			 * abbreviated tuple to tape tuple directly.
+			 *
+			 * It doesn't seem worth generating a new abbreviated key for
+			 * the tape tuple, and this approach is simpler than
+			 * "unabbreviating" the memtuple tuple from a "common" routine
+			 * like this.
+			 *
+			 * In the future, this routine could offer an API that allows
+			 * certain clients (like ordered set aggregate callers) to
+			 * cheaply test *inequality* across adjacent pairs of sorted
+			 * tuples on the basis of simple abbreviated key binary
+			 * inequality.  Another advantage of this approach is that that
+			 * can still work without reporting to clients that abbreviation
+			 * wasn't used.  The tape tuples might only be a small minority
+			 * of all tuples returned.
+			 */
+			if (state->sortKeys != NULL && state->sortKeys->abbrev_converter != NULL)
+				stup->datum1 = state->memtuples[state->current].datum1;
+
+			/*
+			 * Compare a tape tuple to a memtuple.
+			 *
+			 * Since we always start with at least one memtuple, and since tape
+			 * tuples are always returned before equal memtuples, it follows
+			 * that there must be at least one memtuple left to return here.
+			 */
+			Assert(state->current < state->memtupcount);
+
+			if (COMPARETUP(state, stup, &state->memtuples[state->current]) <= 0)
+			{
+				/*
+				 * Tape tuple less than or equal to memtuple array current
+				 * iterator position.  Return it.
+				 */
+				state->cached = false;
+				*should_free = true;
+			}
+			else
+			{
+				/*
+				 * Tape tuple greater than memtuple array's current tuple.
+				 *
+				 * Return current memtuple tuple, and cache tape tuple for
+				 * next call, where it may be returned.
+				 */
+				state->cached = true;
+				*should_free = false;
+				*stup = state->memtuples[state->current++];
+			}
+			return true;
+
 		case TSS_FINALMERGE:
 			Assert(forward);
 			*should_free = true;
@@ -2120,6 +2381,8 @@ inittapes(Tuplesortstate *state)
 	state->tp_runs = (int *) palloc0(maxTapes * sizeof(int));
 	state->tp_dummy = (int *) palloc0(maxTapes * sizeof(int));
 	state->tp_tapenum = (int *) palloc0(maxTapes * sizeof(int));
+	state->cached = false;
+	state->just_memtuples = false;
 
 	/*
 	 * Convert the unsorted contents of memtuples[] into a heap. Each tuple is
@@ -2213,21 +2476,6 @@ mergeruns(Tuplesortstate *state)
 	Assert(state->status == TSS_BUILDRUNS);
 	Assert(state->memtupcount == 0);
 
-	/*
-	 * If we produced only one initial run (quite likely if the total data
-	 * volume is between 1X and 2X workMem), we can just use that tape as the
-	 * finished output, rather than doing a useless merge.  (This obvious
-	 * optimization is not in Knuth's algorithm.)
-	 */
-	if (state->currentRun == 1)
-	{
-		state->result_tape = state->tp_tapenum[state->destTape];
-		/* must freeze and rewind the finished output tape */
-		LogicalTapeFreeze(state->tapeset, state->result_tape);
-		state->status = TSS_SORTEDONTAPE;
-		return;
-	}
-
 	if (state->sortKeys != NULL && state->sortKeys->abbrev_converter != NULL)
 	{
 		/*
@@ -2770,7 +3018,8 @@ void
 tuplesort_get_stats(Tuplesortstate *state,
 					const char **sortMethod,
 					const char **spaceType,
-					long *spaceUsed)
+					long *spaceUsed,
+					int *rowsQuicksorted)
 {
 	/*
 	 * Note: it might seem we should provide both memory and disk usage for a
@@ -2780,6 +3029,10 @@ tuplesort_get_stats(Tuplesortstate *state,
 	 * rely on availMem in a disk sort.  This does not seem worth the overhead
 	 * to fix.  Is it worth creating an API for the memory context code to
 	 * tell us how much is actually used in sortcontext?
+	 *
+	 * XXX:  Revisit this restriction.  TSS_MEMTAPEMERGE may require both
+	 * memory and disk space usage (although since memory used should always
+	 * be work_mem, perhaps not).
 	 */
 	if (state->tapeset)
 	{
@@ -2792,17 +3045,26 @@ tuplesort_get_stats(Tuplesortstate *state,
 		*spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
 	}
 
+	*rowsQuicksorted = 0;
+
 	switch (state->status)
 	{
 		case TSS_SORTEDINMEM:
 			if (state->boundUsed)
 				*sortMethod = "top-N heapsort";
 			else
+			{
 				*sortMethod = "quicksort";
+				*rowsQuicksorted = state->memtupcount;
+			}
 			break;
 		case TSS_SORTEDONTAPE:
 			*sortMethod = "external sort";
 			break;
+		case TSS_MEMTAPEMERGE:
+			*sortMethod = "quicksort with spillover";
+			*rowsQuicksorted = state->memtupcount;
+			break;
 		case TSS_FINALMERGE:
 			*sortMethod = "external merge";
 			break;
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index de6fc56..e5f5501 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -109,7 +109,8 @@ extern void tuplesort_end(Tuplesortstate *state);
 extern void tuplesort_get_stats(Tuplesortstate *state,
 					const char **sortMethod,
 					const char **spaceType,
-					long *spaceUsed);
+					long *spaceUsed,
+					int *rowsQuicksorted);
 
 extern int	tuplesort_merge_order(int64 allowedMem);
 
-- 
1.9.1

#2Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Peter Geoghegan (#1)
Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"

On 07/30/2015 04:05 AM, Peter Geoghegan wrote:

Patch -- "quicksort with spillover"
=========================

With the attached patch, I propose to add an additional, better "one
run special case" optimization. This new special case "splits" the
single run into 2 "subruns". One of the runs is comprised of whatever
tuples were in memory when the caller finished passing tuples to
tuplesort. To sort that, we use quicksort, which in general has
various properties that make it much faster than heapsort -- it's a
cache oblivious algorithm, which is important these days. The other
"subrun" is whatever tuples were on-tape when tuplesort_performsort()
was called. This will often be a minority of the total, but certainly
never much more than half. This is already sorted when
tuplesort_performsort() is reached. This spillover is already inserted
at the front of the sorted-on-tape tuples, and so already has
reasonably good cache characteristics.

With the patch, we perform an on-the-fly merge that is somewhat
similar to the existing (unaffected) "merge sort" TSS_FINALMERGE case,
except that one of the runs is in memory, and is potentially much
larger than the on-tape/disk run (but never much smaller), and is
quicksorted. The existing "merge sort" case similarly is only used
when the caller specified !randomAccess.

Hmm. You don't really need to merge the in-memory array into the tape,
as you know that all the tuples in the in-memory must come after the
tuples already on the tape. You can just return all the tuples from the
tape first, and then all the tuples from the array.

So here's a shorter/different explanation of this optimization: When
it's time to perform the sort, instead of draining the in-memory heap
one tuple at a time to the last tape, you sort the heap with quicksort,
and pretend that the sorted heap belongs to the last tape, after all the
other tuples in the tape.

Some questions/thoughts on that:

Isn't that optimization applicable even when you have multiple runs?
Quicksorting the heap and keeping it as an array in memory is surely
always faster than heapsorting and pushing it to the tape.

I think it'd make sense to structure the code differently, to match the
way I described this optimization above. Instead of adding a new
tuplesort state for this, abstract this in the logical tape code. Add a
function to attach an in-memory "tail" to a tape, and have
LogicalTapeRead() read from the tail after reading the on-disk tape. The
rest of the code wouldn't need to care that sometimes part of the tape
is actually in memory.

It should be pretty easy to support randomAccess too. If you think of
the in-memory heap as a tail of the last tape, you can easily move
backwards from the in-memory heap back to the on-disk tape, too.

+	 * Note that there might actually be 2 runs, but only the
+	 * contents of one of them went to tape, and so we can
+	 * safely "pretend" that there is only 1 run (since we're
+	 * about to give up on the idea of the memtuples array being
+	 * a heap).  This means that if our sort happened to require
+	 * random access, the similar "single run" optimization
+	 * below (which sets TSS_SORTEDONTAPE) might not be used at
+	 * all.  This is because dumping all tuples out might have
+	 * forced an otherwise equivalent randomAccess case to
+	 * acknowledge a second run, which we can avoid.

Is that really true? We don't start a second run until we have to, i.e.
when it's time to dump the first tuple of the second run to tape. So I
don't think the case you describe above, where you have two runs but
only one of them has tuples on disk, can actually happen.

Performance
==========

Impressive!

Predictability
==========

Even more impressive!

Future work
=========

As an extra optimization, you could delay quicksorting the in-memory
array until it's time to read the first tuple from it. If the caller
reads only the top-N tuples from the sort for some reason (other than
LIMIT, which we already optimize for), that could avoid a lot of work.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#2)
Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"

On 30 July 2015 at 08:00, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Hmm. You don't really need to merge the in-memory array into the tape, as
you know that all the tuples in the in-memory must come after the tuples
already on the tape. You can just return all the tuples from the tape
first, and then all the tuples from the array.

Agreed

This is a good optimization for the common case where tuples are mostly
already in order. We could increase the usefulness of this by making UPDATE
pick blocks that are close to a tuple's original block, rather than putting
them near the end of a relation.

So here's a shorter/different explanation of this optimization: When it's
time to perform the sort, instead of draining the in-memory heap one tuple
at a time to the last tape, you sort the heap with quicksort, and pretend
that the sorted heap belongs to the last tape, after all the other tuples
in the tape.

Some questions/thoughts on that:

Isn't that optimization applicable even when you have multiple runs?
Quicksorting the heap and keeping it as an array in memory is surely always
faster than heapsorting and pushing it to the tape.

It's about use of memory. If you have multiple runs on tape, then they will
need to be merged and you need memory to do that efficiently. If there are
tuples in the last batch still in memory then it can work, but it depends
upon how full memory is from the last batch and how many batches there are.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#4Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Simon Riggs (#3)
Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"

On 07/30/2015 01:47 PM, Simon Riggs wrote:

On 30 July 2015 at 08:00, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

So here's a shorter/different explanation of this optimization: When it's
time to perform the sort, instead of draining the in-memory heap one tuple
at a time to the last tape, you sort the heap with quicksort, and pretend
that the sorted heap belongs to the last tape, after all the other tuples
in the tape.

Some questions/thoughts on that:

Isn't that optimization applicable even when you have multiple runs?
Quicksorting the heap and keeping it as an array in memory is surely always
faster than heapsorting and pushing it to the tape.

It's about use of memory. If you have multiple runs on tape, then they will
need to be merged and you need memory to do that efficiently. If there are
tuples in the last batch still in memory then it can work, but it depends
upon how full memory is from the last batch and how many batches there are.

True, you need a heap to hold the next tuple from each tape in the merge
step. To avoid exceeding work_mem, you'd need to push some tuples from
the in-memory array to the tape to make room for that. In practice,
though, the memory needed for the merge step's heap is tiny. Even if you
merge 1000 tapes, you only need memory for 1000 tuples in the heap. But
yeah, you'll need some logic to share/divide the in-memory array between
the heap and the "in-memory tail" of the last tape.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Greg Stark
stark@mit.edu
In reply to: Heikki Linnakangas (#4)
Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"

On Thu, Jul 30, 2015 at 12:09 PM, Heikki Linnakangas <hlinnaka@iki.fi>
wrote:

True, you need a heap to hold the next tuple from each tape in the merge
step. To avoid exceeding work_mem, you'd need to push some tuples from the
in-memory array to the tape to make room for that. In practice, though, the
memory needed for the merge step's heap is tiny. Even if you merge 1000
tapes, you only need memory for 1000 tuples in the heap. But yeah, you'll
need some logic to share/divide the in-memory array between the heap and
the "in-memory tail" of the last tape.

It's a bit worse than that because we buffer up a significant chunk of the
tape to avoid randomly seeking between tapes after every tuple read. But I
think in today's era of large memory we don't need anywhere near the entire
work_mem just to buffer to avoid random access. Something simple like a
fixed buffer size per tape probably much less than 1MB/tape.

I'm a bit confused where the big win comes from though. Is what's going on
that the external sort only exceeded memory by a small amount so nearly
all the tuples are still in memory?

--
greg

#6Simon Riggs
simon@2ndQuadrant.com
In reply to: Greg Stark (#5)
Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"

On 30 July 2015 at 12:26, Greg Stark <stark@mit.edu> wrote:

On Thu, Jul 30, 2015 at 12:09 PM, Heikki Linnakangas <hlinnaka@iki.fi>
wrote:

True, you need a heap to hold the next tuple from each tape in the merge
step. To avoid exceeding work_mem, you'd need to push some tuples from the
in-memory array to the tape to make room for that. In practice, though, the
memory needed for the merge step's heap is tiny. Even if you merge 1000
tapes, you only need memory for 1000 tuples in the heap. But yeah, you'll
need some logic to share/divide the in-memory array between the heap and
the "in-memory tail" of the last tape.

It's a bit worse than that because we buffer up a significant chunk of the
tape to avoid randomly seeking between tapes after every tuple read. But I
think in today's era of large memory we don't need anywhere near the entire
work_mem just to buffer to avoid random access. Something simple like a
fixed buffer size per tape probably much less than 1MB/tape.

MERGE_BUFFER_SIZE is currently 0.25 MB, but there was benefit seen above
that. I'd say we should scale that up to 1 MB if memory allows.

Yes, that could be tiny for small number of runs. I mention it because
Heikki's comment that we could use this in the general case would not be
true for larger numbers of runs. Number of runs decreases quickly with more
memory anyway, so the technique is mostly good for larger memory but
certainly not for small memory allocations.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#7Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#1)
Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"

On Wed, Jul 29, 2015 at 9:05 PM, Peter Geoghegan <pg@heroku.com> wrote:

The behavior of external sorts that do not require any merge step due
to only having one run (what EXPLAIN ANALYZE output shows as an
"external sort", and not a "merge sort") seems like an area that can
be significantly improved upon. As noted in code comments, this
optimization did not appear in The Art of Computer Programming, Volume
III. It's not an unreasonable idea, but it doesn't work well on modern
machines due to using heapsort, which is known to use the cache
ineffectively. It also often implies significant additional I/O for
little or no benefit. I suspect that all the reports we've heard of
smaller work_mem sizes improving sort performance are actually down to
this one-run optimization *hurting* performance.

Very interesting. And great performance numbers. Thanks for taking
the time to investigate this - really cool.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Peter Geoghegan
pg@heroku.com
In reply to: Heikki Linnakangas (#2)
Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"

On Thu, Jul 30, 2015 at 12:00 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Hmm. You don't really need to merge the in-memory array into the tape, as
you know that all the tuples in the in-memory must come after the tuples
already on the tape. You can just return all the tuples from the tape first,
and then all the tuples from the array.

It's more complicated than it appears, I think. Tuples may be variable
sized. WRITETUP() performs a pfree(), and gives us back a variable
amount of availMem. What if we dumped a single, massive, outlier tuple
out when a caller passes it and it goes to the root of the heap? We'd
dump that massive tuple in one go (this would be an incremental
dumptuples() call, which we still do in the patch), making things
!LACKMEM() again, but by an usually comfortable margin. We read in a
few more regular tuples, but we're done consuming tuples before things
ever get LACKMEM() again (no more dumping needed, at least with this
patch applied).

What prevents the tuple at the top of the in-memory heap at the point
of tuplesort_performsort() (say, one of the ones added to the heap as
our glut of memory was *partially* consumed) being less than the
last/greatest tuple on tape? If the answer is "nothing", a merge step
is clearly required.

This is not a problem when every single tuple is dumped, but that
doesn't happen anymore. I probably should have shown more tests, that
tested HeapTuple sorts (not just datum tuple sorts). I agree that
things at least usually happen as you describe, FWIW.

I think it'd make sense to structure the code differently, to match the way
I described this optimization above. Instead of adding a new tuplesort state
for this, abstract this in the logical tape code. Add a function to attach
an in-memory "tail" to a tape, and have LogicalTapeRead() read from the tail
after reading the on-disk tape. The rest of the code wouldn't need to care
that sometimes part of the tape is actually in memory.

I'll need to think about all of that. Certainly, quicksorting runs in
a more general way seems like a very promising idea, and I know that
this patch does not go far enough. But it also add this idea of not
dumping out most tuples, which seems impossible to generalize further,
so maybe it's a useful special case to start from for that reason (and
also because it's where the pain is currently felt most often).

+        * Note that there might actually be 2 runs, but only the
+        * contents of one of them went to tape, and so we can
+        * safely "pretend" that there is only 1 run (since we're
+        * about to give up on the idea of the memtuples array being
+        * a heap).  This means that if our sort happened to require
+        * random access, the similar "single run" optimization
+        * below (which sets TSS_SORTEDONTAPE) might not be used at
+        * all.  This is because dumping all tuples out might have
+        * forced an otherwise equivalent randomAccess case to
+        * acknowledge a second run, which we can avoid.

Is that really true? We don't start a second run until we have to, i.e. when
it's time to dump the first tuple of the second run to tape. So I don't
think the case you describe above, where you have two runs but only one of
them has tuples on disk, can actually happen.

I think we're talking about two slightly different things. I agree
that I am avoiding "starting" a second run because I am avoiding
dumping tuples, just as you say (I describe this as avoiding
"acknowledging" a second run). But there could still be SortTuples
that have a tupindex that is > 0 (they could be 1, to be specific).
It's pretty clear from looking at the TSS_BUILDRUNS case within
puttuple_common() that this is true. So, if instead you define
"starting" a tuple as adding a sortTuple with a tupindex that is > 0,
then yes, this comment is true.

The important thing is that since we're not dumping every tuple, it
doesn't matter whether or not a that TSS_BUILDRUNS case within
puttuple_common() ever took the "currentRun + 1" insertion path (which
can easily happen), provided things aren't so skewed that it ends up
on tape even without dumping all tuples (which seems much less
likely). As I've said, this optimization will occur a lot more often
then the existing one run optimization (assuming !randomAccess), as a
nice side benefit of not dumping every tuple. Quicksort does not use
HEAPCOMPARE(), so clearly having multiple runs in that "subrun" is a
non-issue.

Whether or not we say that a second run "started", or that there was
merely the "unfulfilled intent to start a new, second run" is just
semantics. While I certainly care about semantics, my point is that we
agree that this useful "pretend there is only one run" thing happens
(I think). The existing one run optimization only really occurs when
the range of values in the set of tuples is well characterized by the
tuple values observed during initial heapification, which is bad. Or
would be bad, if the existing optimization was good. :-)

Performance
==========

Impressive!

Predictability
==========

Even more impressive!

Thanks!

As an extra optimization, you could delay quicksorting the in-memory array
until it's time to read the first tuple from it. If the caller reads only
the top-N tuples from the sort for some reason (other than LIMIT, which we
already optimize for), that could avoid a lot of work.

Won't comment on that yet, since it's predicated on the merge step
being unnecessary. I need to think about this some more.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#7)
Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"

On Thu, Jul 30, 2015 at 11:32 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Very interesting. And great performance numbers. Thanks for taking
the time to investigate this - really cool.

Thanks.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Peter Geoghegan
pg@heroku.com
In reply to: Greg Stark (#5)
Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"

On Thu, Jul 30, 2015 at 4:26 AM, Greg Stark <stark@mit.edu> wrote:

I'm a bit confused where the big win comes from though. Is what's going on
that the external sort only exceeded memory by a small amount so nearly all
the tuples are still in memory?

Yes, that's why this can be much faster just as the work_mem threshold
is crossed. You get an "almost internal" sort, which means you can
mostly quicksort, and you can avoid dumping most tuples. It's still a
pretty nice win when less than half of tuples fit in memory, though --
just not as nice. Below that, the optimization isn't used.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Peter Geoghegan
pg@heroku.com
In reply to: Simon Riggs (#3)
Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"

On Thu, Jul 30, 2015 at 3:47 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

This is a good optimization for the common case where tuples are mostly
already in order. We could increase the usefulness of this by making UPDATE
pick blocks that are close to a tuple's original block, rather than putting
them near the end of a relation.

Not sure what you mean here.

So here's a shorter/different explanation of this optimization: When it's
time to perform the sort, instead of draining the in-memory heap one tuple
at a time to the last tape, you sort the heap with quicksort, and pretend
that the sorted heap belongs to the last tape, after all the other tuples in
the tape.

Some questions/thoughts on that:

Isn't that optimization applicable even when you have multiple runs?
Quicksorting the heap and keeping it as an array in memory is surely always
faster than heapsorting and pushing it to the tape.

It's about use of memory. If you have multiple runs on tape, then they will
need to be merged and you need memory to do that efficiently. If there are
tuples in the last batch still in memory then it can work, but it depends
upon how full memory is from the last batch and how many batches there are.

I agree that this optimization has a lot to do with exploiting the
fact that you don't need to free the memtuples array for future runs
because you've already received all tuples (or keep space free for
previous runs). I think that we should still use quicksort on runs
where this optimization doesn't work out, but I also still think that
that's a different idea. Doing what I've proposed here when there are
multiple runs seems less valuable, if only because you're not going to
avoid that much writing.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Peter Geoghegan
pg@heroku.com
In reply to: Peter Geoghegan (#8)
2 attachment(s)
Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"

On Thu, Jul 30, 2015 at 4:01 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Thu, Jul 30, 2015 at 12:00 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Hmm. You don't really need to merge the in-memory array into the tape, as
you know that all the tuples in the in-memory must come after the tuples
already on the tape. You can just return all the tuples from the tape first,
and then all the tuples from the array.

It's more complicated than it appears, I think. Tuples may be variable
sized. WRITETUP() performs a pfree(), and gives us back a variable
amount of availMem. What if we dumped a single, massive, outlier tuple
out when a caller passes it and it goes to the root of the heap? We'd
dump that massive tuple in one go (this would be an incremental
dumptuples() call, which we still do in the patch), making things
!LACKMEM() again, but by an usually comfortable margin. We read in a
few more regular tuples, but we're done consuming tuples before things
ever get LACKMEM() again (no more dumping needed, at least with this
patch applied).

What prevents the tuple at the top of the in-memory heap at the point
of tuplesort_performsort() (say, one of the ones added to the heap as
our glut of memory was *partially* consumed) being less than the
last/greatest tuple on tape? If the answer is "nothing", a merge step
is clearly required.

It's simple to prove this with the attached rough patch, intended to
be applied on top of Postgres with my patch. It hacks
tuplesort_gettuple_common() to always return tape tuples first, before
returning memtuples only when tape tuples have been totally exhausted.
If you run my cursory regression test suite with this, you'll see
serious regressions. I also attach a regression test diff file from my
development system, to save you the trouble of trying this yourself.
Note how the "count(distinct(s))" numbers get closer to being correct
(lower) as work_mem increases make tuplesort approach an internal
sort.

It's possible that we can get away with something cheaper than a merge
step, but my impression right now is that it isn't terribly expensive.
OTOH, if we can make this work with the randomAccess case by being
more clever about merging, that could be worthwhile.
--
Peter Geoghegan

Attachments:

no_merge_step_problematic.patchtext/x-patch; charset=US-ASCII; name=no_merge_step_problematic.patchDownload
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 5371f4f..ce79347 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -2022,7 +2022,6 @@ just_memtuples:
 			 */
 			Assert(state->current < state->memtupcount);
 
-			if (COMPARETUP(state, stup, &state->memtuples[state->current]) <= 0)
 			{
 				/*
 				 * Tape tuple less than or equal to memtuple array current
@@ -2031,18 +2030,6 @@ just_memtuples:
 				state->cached = false;
 				*should_free = true;
 			}
-			else
-			{
-				/*
-				 * Tape tuple greater than memtuple array's current tuple.
-				 *
-				 * Return current memtuple tuple, and cache tape tuple for
-				 * next call, where it may be returned.
-				 */
-				state->cached = true;
-				*should_free = false;
-				*stup = state->memtuples[state->current++];
-			}
 			return true;
 
 		case TSS_FINALMERGE:
regression.diffsapplication/octet-stream; name=regression.diffsDownload
*** /home/pg/postgresql/src/test/regress/expected/sorting.out	2015-07-30 23:31:59.232961039 -0700
--- /home/pg/postgresql/src/test/regress/results/sorting.out	2015-07-30 23:43:17.776936479 -0700
***************
*** 47,143 ****
  
  set work_mem = '100MB';
  select count(distinct(s)) from int4_sort_tbl;
!  count  
! --------
!  999949
  (1 row)
  
  select count(distinct(s)) from highcard_int4_sort_tbl;
    count  
  ---------
!  9515397
  (1 row)
  
  select count(distinct(s)) from lowcard_int4_sort_tbl;
   count 
  -------
!    101
  (1 row)
  
  set work_mem = '110MB';
  select count(distinct(s)) from int4_sort_tbl;
!  count  
! --------
!  999949
  (1 row)
  
  select count(distinct(s)) from highcard_int4_sort_tbl;
    count  
  ---------
!  9515397
  (1 row)
  
  select count(distinct(s)) from lowcard_int4_sort_tbl;
   count 
  -------
!    101
  (1 row)
  
  set work_mem = '128MB';
  select count(distinct(s)) from int4_sort_tbl;
!  count  
! --------
!  999949
  (1 row)
  
  select count(distinct(s)) from highcard_int4_sort_tbl;
    count  
  ---------
!  9515397
  (1 row)
  
  select count(distinct(s)) from lowcard_int4_sort_tbl;
   count 
  -------
!    101
  (1 row)
  
  set work_mem = '140MB';
  select count(distinct(s)) from int4_sort_tbl;
!  count  
! --------
!  999949
  (1 row)
  
  select count(distinct(s)) from highcard_int4_sort_tbl;
    count  
  ---------
!  9515397
  (1 row)
  
  select count(distinct(s)) from lowcard_int4_sort_tbl;
   count 
  -------
!    101
  (1 row)
  
  set work_mem = '150MB';
  select count(distinct(s)) from int4_sort_tbl;
!  count  
! --------
!  999949
  (1 row)
  
  select count(distinct(s)) from highcard_int4_sort_tbl;
    count  
  ---------
!  9515397
  (1 row)
  
  select count(distinct(s)) from lowcard_int4_sort_tbl;
   count 
  -------
!    101
  (1 row)
  
  -- should be in-memory:
--- 47,143 ----
  
  set work_mem = '100MB';
  select count(distinct(s)) from int4_sort_tbl;
!   count  
! ---------
!  1713686
  (1 row)
  
  select count(distinct(s)) from highcard_int4_sort_tbl;
    count  
  ---------
!  9666796
  (1 row)
  
  select count(distinct(s)) from lowcard_int4_sort_tbl;
   count 
  -------
!    184
  (1 row)
  
  set work_mem = '110MB';
  select count(distinct(s)) from int4_sort_tbl;
!   count  
! ---------
!  1619700
  (1 row)
  
  select count(distinct(s)) from highcard_int4_sort_tbl;
    count  
  ---------
!  9644304
  (1 row)
  
  select count(distinct(s)) from lowcard_int4_sort_tbl;
   count 
  -------
!    175
  (1 row)
  
  set work_mem = '128MB';
  select count(distinct(s)) from int4_sort_tbl;
!   count  
! ---------
!  1469829
  (1 row)
  
  select count(distinct(s)) from highcard_int4_sort_tbl;
    count  
  ---------
!  9608055
  (1 row)
  
  select count(distinct(s)) from lowcard_int4_sort_tbl;
   count 
  -------
!    160
  (1 row)
  
  set work_mem = '140MB';
  select count(distinct(s)) from int4_sort_tbl;
!   count  
! ---------
!  1382216
  (1 row)
  
  select count(distinct(s)) from highcard_int4_sort_tbl;
    count  
  ---------
!  9587591
  (1 row)
  
  select count(distinct(s)) from lowcard_int4_sort_tbl;
   count 
  -------
!    151
  (1 row)
  
  set work_mem = '150MB';
  select count(distinct(s)) from int4_sort_tbl;
!   count  
! ---------
!  1315037
  (1 row)
  
  select count(distinct(s)) from highcard_int4_sort_tbl;
    count  
  ---------
!  9572239
  (1 row)
  
  select count(distinct(s)) from lowcard_int4_sort_tbl;
   count 
  -------
!    144
  (1 row)
  
  -- should be in-memory:
***************
*** 185,191 ****
  select count(distinct(s)) from text_sort_tbl;
   count  
  --------
!  100001
  (1 row)
  
  -- Uses optimization where it's favorable:
--- 185,191 ----
  select count(distinct(s)) from text_sort_tbl;
   count  
  --------
!  198926
  (1 row)
  
  -- Uses optimization where it's favorable:
***************
*** 193,199 ****
  select count(distinct(s)) from text_sort_tbl;
   count  
  --------
!  100001
  (1 row)
  
  -- internal sort:
--- 193,199 ----
  select count(distinct(s)) from text_sort_tbl;
   count  
  --------
!  114913
  (1 row)
  
  -- internal sort:

======================================================================

#13Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Peter Geoghegan (#8)
Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"

On 07/31/2015 02:01 AM, Peter Geoghegan wrote:

What prevents the tuple at the top of the in-memory heap at the point
of tuplesort_performsort() (say, one of the ones added to the heap as
our glut of memory was*partially* consumed) being less than the
last/greatest tuple on tape? If the answer is "nothing", a merge step
is clearly required.

Oh, ok, I was confused on how the heap works. You could still abstract
this as "in-memory tails" of the tapes, but it's more complicated than I
thought at first:

When it's time to drain the heap, in performsort, divide the array into
two arrays, based on the run number of each tuple, and then quicksort
the arrays separately. The first array becomes the in-memory tail of the
current tape, and the second array becomes the in-memory tail of the
next tape.

You wouldn't want to actually allocate two arrays and copy SortTuples
around, but keep using the single large array, just logically divided
into two. So the bookkeeping isn't trivial, but seems doable.

Hmm, I can see another possible optimization here, in the way the heap
is managed in TSS_BUILDRUNS state. Instead of keeping a single heap,
with tupindex as the leading key, it would be more cache efficient to
keep one heap for the currentRun, and an unsorted array of tuples
belonging to currentRun + 1. When the heap becomes empty, and currentRun
is incemented, quicksort the unsorted array to become the new heap.

That's a completely separate idea from your patch, although if you did
it that way, you wouldn't need the extra step to divide the large array
into two, as you'd maintain that division all the time.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Jeremy Harris
jgh@wizmail.org
In reply to: Peter Geoghegan (#1)
Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"

On 30/07/15 02:05, Peter Geoghegan wrote:

Since heapification is now a big fraction of the total cost of a sort
sometimes, even where the heap invariant need not be maintained for
any length of time afterwards, it might be worth revisiting the patch
to make that an O(n) rather than a O(log n) operation [3].

O(n log n) ?

Heapification is O(n) already, whether siftup (existing) or down.

It might be worthwhile comparing actual times with a quicksort, given
that a sorted array is trivially a well-formed heap (the reverse is not
true) and that quicksort seems to be cache-friendly. Presumably there
will be a crossover N where the cache-friendliness k reduction
loses out to the log n penalty for doing a full sort; below this
it would be useful.

You could then declare the tape buffer to be the leading tranche of
work-mem (and dump it right away) and the heap to start with the
remainder.
--
Cheers,
Jeremy

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Robert Haas
robertmhaas@gmail.com
In reply to: Jeremy Harris (#14)
Re: Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"

On Fri, Jul 31, 2015 at 7:21 AM, Jeremy Harris <jgh@wizmail.org> wrote:

Heapification is O(n) already, whether siftup (existing) or down.

That's not my impression, or what Wikipedia says. Source?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Jeremy Harris
jgh@wizmail.org
In reply to: Robert Haas (#15)
Re: Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"

On 31/07/15 18:31, Robert Haas wrote:

On Fri, Jul 31, 2015 at 7:21 AM, Jeremy Harris <jgh@wizmail.org> wrote:

Heapification is O(n) already, whether siftup (existing) or down.

That's not my impression, or what Wikipedia says. Source?

Measurements done last year:

/messages/by-id/52F35462.3030306@wizmail.org
(spreadsheet attachment)

/messages/by-id/52F40CE9.1070509@wizmail.org
(measurement procedure and spreadsheet explanation)

--
Cheers,
Jeremy

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#15)
Re: Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"

On 2015-07-31 13:31:54 -0400, Robert Haas wrote:

On Fri, Jul 31, 2015 at 7:21 AM, Jeremy Harris <jgh@wizmail.org> wrote:

Heapification is O(n) already, whether siftup (existing) or down.

That's not my impression, or what Wikipedia says. Source?

Building a binary heap via successive insertions is O(n log n), but
building it directly is O(n). Looks like wikipedia agrees too
https://en.wikipedia.org/wiki/Binary_heap#Building_a_heap
I'm pretty sure that there's a bunch of places where we intentionally
build a heap at once instead successively. At least reorderbuffer.c does
so, and it looks like nodeMergeAppend as well (that's why they use
binaryheap_add_unordered and then binaryheap_build).

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Peter Geoghegan
pg@heroku.com
In reply to: Heikki Linnakangas (#13)
Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"

On Fri, Jul 31, 2015 at 12:59 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Oh, ok, I was confused on how the heap works. You could still abstract this
as "in-memory tails" of the tapes, but it's more complicated than I thought
at first:

When it's time to drain the heap, in performsort, divide the array into two
arrays, based on the run number of each tuple, and then quicksort the arrays
separately. The first array becomes the in-memory tail of the current tape,
and the second array becomes the in-memory tail of the next tape.

You wouldn't want to actually allocate two arrays and copy SortTuples
around, but keep using the single large array, just logically divided into
two. So the bookkeeping isn't trivial, but seems doable.

Since you're talking about the case where we must drain all tuples
within tuplesort_performsort(), I think you're talking about a
distinct idea here (since surely you're not suggesting giving up on my
idea of avoiding dumping most tuples, which is a key strength of the
patch). That's fine, but I just want to be clear on that. Importantly,
my optimization does not care about the fact that the array may have
two runs in it, because it quicksorts the array and forgets about it
being a heap. It only cares whether or not more than one run made it
out to tape, which makes it more widely applicable than it would
otherwise be. Also, the fact that much I/O can be avoided is clearly
something that can only happen when work_mem is at least ~50% of a
work_mem setting that would have resulted in an (entirely) internal
sort.

You're talking about a new thing here, that happens when it is
necessary to dump everything and do a conventional merging of on-tape
runs. IOW, we cannot fit a significant fraction of overall tuples in
memory, and we need much of the memtuples array for the next run
(usually this ends as a TSS_FINALMERGE). That being the case
(...right?), I'm confused that you're talking about doing something
clever within tuplesort_performsort(). In the case you're targeting,
won't the vast majority of tuple dumping (through calls to
dumptuples()) occur within puttuple_common()?

I think that my optimization should probably retain it's own
state.status even if we do this (i.e. TSS_MEMTAPEMERGE should stay).

Hmm, I can see another possible optimization here, in the way the heap is
managed in TSS_BUILDRUNS state. Instead of keeping a single heap, with
tupindex as the leading key, it would be more cache efficient to keep one
heap for the currentRun, and an unsorted array of tuples belonging to
currentRun + 1. When the heap becomes empty, and currentRun is incemented,
quicksort the unsorted array to become the new heap.

That's a completely separate idea from your patch, although if you did it
that way, you wouldn't need the extra step to divide the large array into
two, as you'd maintain that division all the time.

This sounds to me like a refinement of your first idea (the idea that
I just wrote about).

I think the biggest problem with tuplesort after this patch of mine is
committed is that it is still too attached to the idea of
incrementally spilling and sifting. It makes sense to some degree
where it makes my patch possible...if we hang on to the idea of
incrementally spilling tuples on to tape in sorted order for a while,
then maybe we can hang on for long enough to quicksort most tuples,
*and* to avoid actually dumping most tuples (incremental spills make
the latter possible). But when work_mem is only, say, 10% of the
setting required for a fully internal sort, then the incremental
spilling and sifting starts to look dubious, at least to me, because
the TSS_MEMTAPEMERGE optimization added by my patch could not possibly
apply, and dumping and merging many times is inevitable.

What I think you're getting at here is that we still have a heap, but
we don't use the heap to distinguish between tuples within a run. In
other words, HEAPCOMPARE() often/always only cares about run number.
We quicksort after a deferred period of time, probably just before
dumping everything. Perhaps I've misunderstood, but I don't see much
point in quicksorting a run before being sure that you're sorting as
opposed to heapifying at that point (you're not clear on what we've
decided on once we quicksort). I think it could make sense to make
HEAPCOMPARE() not care about tuples within a run that is not
currentRun, though.

I think that anything that gives up on replacement selection's ability
to generate large runs, particularly for already sorted inputs will be
too hard a sell (not that I think that's what you proposed). That's
way, way less of an advantage than it was in the past (back when
external sorts took place using actual magnetic tape, it was a huge),
but the fact remains that it is an advantage. And so, I've been
prototyping an approach where we don't heapify once it is established
that this TSS_MEMTAPEMERGE optimization of mine cannot possibly apply.
We quicksort in batches rather than heapify.

With this approach, tuples are just added arbitrarily to the memtuples
array when status is TSS_BUILDRUNS and we decided not to heapify. When
this path is first taken (when we first decide not to heapify
memtuples anymore), we quicksort and dump the first half of memtuples
in a batch. Now, when puttuple_common() once again fills the first
half of memtuples (in the style of TSS_INITIAL), we quicksort again,
and dump the first half again. Repeat until done.

This is something that you could perhaps call "batch replacement
selection". This is not anticipated to affect the finished runs one
bit as compared to tuplesort today, because we still create a new run
(state->currentRun++) any time a value in a now-sorted batch is less
than the last (greatest) on-tape value in the currentRun. I am just
batching things -- it's the same basic algorithm with minimal use of a
heap.

I think this will help appreciably because we quicksort, but I'm not
sure yet. I think this may also help with introducing asynchronous I/O
(maybe we can reuse memtuples for that, with clever staggering of
stages during dumping, but that can come later). This is all fairly
hand-wavey, but I think it could help appreciably for the case where
sorts must produce runs to merge and avoiding dumping all tuples is
impossible.
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Robert Haas
robertmhaas@gmail.com
In reply to: Jeremy Harris (#16)
Re: Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"

On Sat, Aug 1, 2015 at 9:49 AM, Jeremy Harris <jgh@wizmail.org> wrote:

On 31/07/15 18:31, Robert Haas wrote:

On Fri, Jul 31, 2015 at 7:21 AM, Jeremy Harris <jgh@wizmail.org> wrote:

Heapification is O(n) already, whether siftup (existing) or down.

That's not my impression, or what Wikipedia says. Source?

Measurements done last year:

/messages/by-id/52F35462.3030306@wizmail.org
(spreadsheet attachment)

/messages/by-id/52F40CE9.1070509@wizmail.org
(measurement procedure and spreadsheet explanation)

I don't think that running benchmarks is the right way to establish
the asymptotic runtime of an algorithm. I mean, if you test
quicksort, it will run in much less than O(n^2) time on almost any
input. But that does not mean that the worst-case run time is
anything other than O(n^2).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#17)
Re: Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"

On Sat, Aug 1, 2015 at 9:56 AM, Andres Freund <andres@anarazel.de> wrote:

On 2015-07-31 13:31:54 -0400, Robert Haas wrote:

On Fri, Jul 31, 2015 at 7:21 AM, Jeremy Harris <jgh@wizmail.org> wrote:

Heapification is O(n) already, whether siftup (existing) or down.

That's not my impression, or what Wikipedia says. Source?

Building a binary heap via successive insertions is O(n log n), but
building it directly is O(n). Looks like wikipedia agrees too
https://en.wikipedia.org/wiki/Binary_heap#Building_a_heap

That doesn't really address the sift-up vs. sift-down question. Maybe
I'm just confused about the terminology. I think that Wikipedia
article is saying that if you iterate from the middle element of an
unsorted array towards the beginning, establishing the heap invariant
for every item as you reach it, you will take only O(n) time. But
that is not what inittapes() does. It instead starts at the beginning
of the array and inserts each element one after the other. If this is
any different from building the heap via successive insertions, I
don't understand how.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#18)
Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"

On Mon, Aug 3, 2015 at 3:33 PM, Peter Geoghegan <pg@heroku.com> wrote:

When it's time to drain the heap, in performsort, divide the array into two
arrays, based on the run number of each tuple, and then quicksort the arrays
separately. The first array becomes the in-memory tail of the current tape,
and the second array becomes the in-memory tail of the next tape.

You wouldn't want to actually allocate two arrays and copy SortTuples
around, but keep using the single large array, just logically divided into
two. So the bookkeeping isn't trivial, but seems doable.

You're talking about a new thing here, that happens when it is
necessary to dump everything and do a conventional merging of on-tape
runs. IOW, we cannot fit a significant fraction of overall tuples in
memory, and we need much of the memtuples array for the next run
(usually this ends as a TSS_FINALMERGE). That being the case
(...right?),

I don't think that's what Heikki is talking about. He can correct me
if I'm wrong, but what I think he's saying is that we should try to
exploit the fact that we've already determined which in-memory tuples
can be part of the current run and which in-memory tuples must become
part of the next run. Suppose half the tuples in memory can become
part of the current run and the other half must become part of the
next run. If we throw all of those tuples into a single bucket and
quicksort it, we're losing the benefit of the comparisons we did to
figure out which tuples go in which runs. Instead, we could quicksort
the current-run tuples and, separately, quick-sort the next-run
tuples. Ignore the current-run tuples completely until the tape is
empty, and then merge them with any next-run tuples that remain.

I'm not sure if there's any reason to believe that would be faster
than your approach. In general, sorting is O(n lg n) so sorting two
arrays that are each half as large figures to be slightly faster than
sorting one big array. But the difference may not amount to much.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#21)
Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"

On Mon, Aug 3, 2015 at 1:36 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I don't think that's what Heikki is talking about. He can correct me
if I'm wrong, but what I think he's saying is that we should try to
exploit the fact that we've already determined which in-memory tuples
can be part of the current run and which in-memory tuples must become
part of the next run. Suppose half the tuples in memory can become
part of the current run and the other half must become part of the
next run. If we throw all of those tuples into a single bucket and
quicksort it, we're losing the benefit of the comparisons we did to
figure out which tuples go in which runs. Instead, we could quicksort
the current-run tuples and, separately, quick-sort the next-run
tuples. Ignore the current-run tuples completely until the tape is
empty, and then merge them with any next-run tuples that remain.

Oh. Well, the benefit of "the comparisons we did to figure out which
tuples go in which runs" is that we can determine the applicability of
this optimization. Also, by keeping run 1 (if any) usually in memory,
and run 0 partially on disk we avoid having to worry about run 1 as a
thing that spoils the optimization (in the current "single run
optimization", dumping all tuples can make us "acknowledge" run 1
(i.e. currentRun++), preventing single run optimization, which we
handily avoid in the patch). Finally, it saves us a bunch of real
COMPARETUP() comparisons as HEAPCOMPARE() is called as tuples are
inserted into the still-heapified memtuples array.

I'm not sure if there's any reason to believe that would be faster
than your approach. In general, sorting is O(n lg n) so sorting two
arrays that are each half as large figures to be slightly faster than
sorting one big array. But the difference may not amount to much.

IMV, the smart way of avoiding wasting "the comparisons we did to
figure out which tuples go in which runs" is to rig HEAPCOMPARE() to
only do a COMPARETUP() for the currentRun, and make sure that we don't
mess up and forget that if we don't end up quicksorting.

The second run that is in memory can only consist of whatever tuples
were added after heapification that were less than any of those
previously observed tuples (a big majority, usually). So like you, I
can't see any of these techniques helping much, even my "smart"
technique. Maybe I should look at a case involving text or something
to be sure.

Thinking about it some more, I don't think it would be easy to
maintain a clear separation between run 0 and run 1 in the memtuples
array in terms of a cutoff point. It's still a heap at that stage, of
course. You'd have to rig each tuple comparator so that COMPARETUP()
cared about tupindex before comparing datum1 just for this, which
seems rather unappealing.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23Jeff Janes
jeff.janes@gmail.com
In reply to: Jeremy Harris (#14)
Re: Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"

On Jul 31, 2015 4:22 AM, "Jeremy Harris" <jgh@wizmail.org> wrote:

On 30/07/15 02:05, Peter Geoghegan wrote:

Since heapification is now a big fraction of the total cost of a sort
sometimes, even where the heap invariant need not be maintained for
any length of time afterwards, it might be worth revisiting the patch
to make that an O(n) rather than a O(log n) operation [3].

O(n log n) ?

Heapification is O(n) already, whether siftup (existing) or down.

They are both linear on average, but the way we currently do it has an
NlogN worst case, while the other way is linear even in the worst case.

Cheers, Jeff

#24Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Robert Haas (#21)
Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"

On 08/03/2015 11:36 PM, Robert Haas wrote:

On Mon, Aug 3, 2015 at 3:33 PM, Peter Geoghegan <pg@heroku.com> wrote:

When it's time to drain the heap, in performsort, divide the array into two
arrays, based on the run number of each tuple, and then quicksort the arrays
separately. The first array becomes the in-memory tail of the current tape,
and the second array becomes the in-memory tail of the next tape.

You wouldn't want to actually allocate two arrays and copy SortTuples
around, but keep using the single large array, just logically divided into
two. So the bookkeeping isn't trivial, but seems doable.

You're talking about a new thing here, that happens when it is
necessary to dump everything and do a conventional merging of on-tape
runs. IOW, we cannot fit a significant fraction of overall tuples in
memory, and we need much of the memtuples array for the next run
(usually this ends as a TSS_FINALMERGE). That being the case
(...right?),

I don't think that's what Heikki is talking about. He can correct me
if I'm wrong, but what I think he's saying is that we should try to
exploit the fact that we've already determined which in-memory tuples
can be part of the current run and which in-memory tuples must become
part of the next run. Suppose half the tuples in memory can become
part of the current run and the other half must become part of the
next run. If we throw all of those tuples into a single bucket and
quicksort it, we're losing the benefit of the comparisons we did to
figure out which tuples go in which runs. Instead, we could quicksort
the current-run tuples and, separately, quick-sort the next-run
tuples. Ignore the current-run tuples completely until the tape is
empty, and then merge them with any next-run tuples that remain.

Yeah, something like that. To paraphrase, if I'm now understanding it
correctly, Peter's idea is:

When all the tuples have been fed to tuplesort, and it's time to perform
the sort, quicksort all the tuples currently in the heap, ignoring the
run numbers, and turn the resulting array into another tape. That tape
is special: it's actually stored completely in memory. It is merged with
the "real" tapes when tuples are returned from the tuplesort, just like
regular tapes in TSS_FINALMERGE.

And my idea is:

When all the tuples have been fed to tuplesort, and it's time to perform
the sort, take all the tuples in the heap belonging to currentRun,
quicksort them, and make them part of the current tape. They're not just
pushed to the tape as usual, however, but attached as in-memory tail of
the current tape. The logical tape abstraction will return them after
all the tuples already in the tape, as if they were pushed to the tape
as usual. Then take all the remaining tuples in the heap (if any),
belonging to next tape, and do the same for them. They become an
in-memory tail of the next tape.

I'm not sure if there's any reason to believe that would be faster
than your approach. In general, sorting is O(n lg n) so sorting two
arrays that are each half as large figures to be slightly faster than
sorting one big array. But the difference may not amount to much.

Yeah, I don't think there's a big performance difference between the two
approaches. I'm not wedded to either approach. Whichever approach we
use, my main point was that it would be better to handle this in the
logical tape abstraction. In my approach, you would have those
"in-memory tails" attached to the last two tapes. In Peter's approach,
you would have one tape that's completely in memory, backed by the
array. In either case, the tapes would look like normal tapes to most of
tuplesort.c. There would be no extra TSS state, it would be
TSS_SORTEDONTAPE or TSS_FINALMERGE as usual.

The logical tape abstraction is currently too low-level for that. It's
just a tape of bytes, and tuplesort.c determines where a tuple begins
and ends. That needs to be changed so that the logical tape abstraction
works tuple-at-a-time instead. For example, instead of
LogicalTapeRead(N) which reads N bytes, you would have
LogicalTapeReadTuple(), which reads next tuple, and returns its length
and the tuple itself. But that would be quite sensible anyway.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25Peter Geoghegan
pg@heroku.com
In reply to: Heikki Linnakangas (#24)
Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"

On Tue, Aug 4, 2015 at 1:24 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Yeah, something like that. To paraphrase, if I'm now understanding it
correctly, Peter's idea is:

When all the tuples have been fed to tuplesort, and it's time to perform the
sort, quicksort all the tuples currently in the heap, ignoring the run
numbers, and turn the resulting array into another tape. That tape is
special: it's actually stored completely in memory. It is merged with the
"real" tapes when tuples are returned from the tuplesort, just like regular
tapes in TSS_FINALMERGE.

Yeah. I imagine that we'll want to put memory prefetch hints for the
new case, since I've independently shown that that works well for the
in-memory case, which this can be very close to.

My next patch will also include quicksorting of runs after we give up
on heapification (after there is more than one run and it is
established that we cannot use my "quicksort with spillover"
optimization, so there are two or more "real" runs on tape). Once
there is clearly not going to be one huge run (which can happen due to
everything largely being in order, even when work_mem is small), and
once incrementally spilling does not end in time to do a "quicksort
with spillover", then the replacement selection thing isn't too
valuable. Especially with large memory sizes but memory bandwidth +
latency as a bottleneck, which is the norm these days.

This seems simpler than my earlier idea of reusing half the memtuples
array only, and resorting the entire array each time, to have
something that consistently approximates replacement selection but
with quicksorting + batching, which I discussed before.

I have this working, and it takes about a good chunk of the runtime
off a sort that merges 3 runs on one reasonable case tested where
work_mem was 300MB. It went from about 56.6 seconds with master to
35.8 seconds with this new approach when tested just now (this
approach saves no writing of tuples, so it's not as effective as the
original "quicksort with spillover" patch can be, but covers a
fundamentally different case). I just need to clean up the patch, and
see if I missed any further optimizations, but this feels like the way
forward multi-run wise. I think it's worth giving up on replacement
selection style runs after the first run is produced, because that's
where the benefits are, if anywhere.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26Peter Geoghegan
pg@heroku.com
In reply to: Heikki Linnakangas (#24)
Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"

On Tue, Aug 4, 2015 at 1:24 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Yeah, I don't think there's a big performance difference between the two
approaches. I'm not wedded to either approach. Whichever approach we use, my
main point was that it would be better to handle this in the logical tape
abstraction. In my approach, you would have those "in-memory tails" attached
to the last two tapes. In Peter's approach, you would have one tape that's
completely in memory, backed by the array. In either case, the tapes would
look like normal tapes to most of tuplesort.c. There would be no extra TSS
state, it would be TSS_SORTEDONTAPE or TSS_FINALMERGE as usual.

TBH, it's not clear to me why making the logical tape abstraction
manage some fraction of memtuples has any advantage at all. The only
case that I can imagine where this could be useful is where a logical
tape does asynchronous I/O, and has the benefit of some portion of
memtuples (probably half) as scratch space (most I/O probably occurs
while tuplesort quicksorts the other half of memtuples). But that has
nothing to do with building a better abstraction.

The more expansive patch that I'm currently working on, that covers
every external sorting case -- the patch to *also* quicksort runs past
the first run, regardless of whether or not we can get away with a
"quicksort with spillover" -- is going very well. I haven't had a
solid block of time to work on it this week due to other commitments,
but it isn't very complicated. Performance benefits are considerable
even without saving any I/O. I can be as much as twice as fast for
"merge sort" sorts in some cases. So not quite as nice as "quicksort
with spillover", but still a significant improvement considering
writing everything out is inevitable for the cases helped.

As I said before, the upcoming patch has tuplesort give up on
memtuples *ever* being a heap after the first run, whatever happens. I
just quicksort and dump in batches past the first run. Since I give up
on replacement selection sort only after the first run, I still have
most of the upside of replacement selection, but little of the big
downside of heap maintenance. This will result in smaller runs on
average past the first run. I can give you at least 3 very strong
arguments for why this is totally worth it in every case, but I'll
wait till I'm asked for them. :-)

One useful saving made in this upcoming multi-tape-run patch is that
it never treats any non-current run as part of the heap beyond its run
number, even when currentRun is the first (run 0). So no comparisons
occur beyond the first run to maintain the heap invariant *even when
the first run is current* -- tuples are simply appended that belong
to the second run (we only do an extra comparison to determine that
that's the run they belong in). So the second run (run 1) is not
trusted to be heapified by dumptuples(), and is quicksorted (either by
"quicksort with spillover", along with much of the first run, or on
its own, when there are multiple conventional on-tape runs; it doesn't
matter which way it is quicksorted). From there on, every run is
quicksorted when memtuples fills, and written out entirely in memtuple
sized batches.

The logical tape abstraction is currently too low-level for that. It's just
a tape of bytes, and tuplesort.c determines where a tuple begins and ends.
That needs to be changed so that the logical tape abstraction works
tuple-at-a-time instead. For example, instead of LogicalTapeRead(N) which
reads N bytes, you would have LogicalTapeReadTuple(), which reads next
tuple, and returns its length and the tuple itself. But that would be quite
sensible anyway.

Why would it be sensible? I honestly wonder why you want to do things
that way. What is the advantage of not having what I call the
in-memory "subrun" managed by a logical tape? It's already nothing
like any other type of run in several ways. Aside from being all
in-memory, it is often much larger. It's special in that it kind of
rejects the preliminary determination that some tuples within
memtuples need to be in a second, traditional, on-tape run (because we
can just quicksort everything and merge with the existing single
on-tape run). Also, we now need tuplesort_gettuple_common() to ask
READTUP() what to tell its caller about freeing memory that is
allocated in within tuplesort.c directly.

The memtuples array is already treated as an array, a heap, the head
of each tape that is merged, and maybe one other thing that I forgot
about offhand. The field SortTuple.tupindex has a total of 3
context-dependent meanings. Playing these kind of games with the
memtuples array is very much something that happens already.

More than anything else, I think that the new TSS_MEMTAPEMERGE state
is justified as a special case because "quicksort with spillover" is
legitimately a special case. Users will want to know how close they
were to an internal sort when looking at EXPLAIN ANALYZE and so on.
When cost_sort() is fixed to be a continuous function (which I think
will be pretty nice for certain other problems), the reader of that
code will want to know more about this "quicksort with spillover"
special case that can save 99% of I/O for what is still classified as
an external sort -- they will look in tuplesort.c for it, not the
logical tape code.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27Peter Geoghegan
pg@heroku.com
In reply to: Heikki Linnakangas (#13)
Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"

On Fri, Jul 31, 2015 at 12:59 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 07/31/2015 02:01 AM, Peter Geoghegan wrote:

What prevents the tuple at the top of the in-memory heap at the point
of tuplesort_performsort() (say, one of the ones added to the heap as
our glut of memory was*partially* consumed) being less than the
last/greatest tuple on tape? If the answer is "nothing", a merge step
is clearly required.

Oh, ok, I was confused on how the heap works.

I think I explained this badly, by referencing a secondary reason why
we must do a merge. I will now do a better job of explaining why a
merge of in-memory and on disk tuples is necessary, for the benefit of
other people (I think you get it).

The main reason why a merge step is required is that the memtuples
array will contain some tuples that were classified as belonging to a
second run. Therefore, those tuples may well be lower than the highest
on-tape tuples in terms of sort order (in fact, they may be lower than
any on-tape tuple). I cannot simply return all tape tuples followed by
all in-memory tuples to the caller, and so I must merge, and so only
!randomAccess callers may get a "quicksort with spillover". I can only
get away with **avoiding dumping all tuples** and just merging instead
because I "reject" this determination that a second *traditional/tape*
run is needed. I am therefore free of any obligation to merge this
would-be traditional second run separately.

Another way of explaining it is that I do an all-in-memory merge of
some part of the first run, and all the second run (by quicksorting).
I then merge this with the original chunk of the first run that is
sorted on tape (that was sorted by incremental spilling from the
heap).

The next version of the patch (the patch may be split in two --
"quicksort with spillover", and "merge sort" optimization) will make
sure that any comparisons that go into maintaining the heap invariant
are not wasted on the second run, since it will always be quicksorted.
We only need to compare the second run tuples pre-quicksort in order
to determine that they belong to that run and not the current (first)
run.

Does that make sense? Have I explained that well?

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers