The case for removing replacement selection sort
There was a number of improvements to tuplesort.c external sort
merging made for Postgres 10. One in particular, the changes to merge
heap maintenance that occurred for commit 24598337c8d, really helped
with presorted cases -- cases when there was an (inverse)
physical/logical correlation.
Replacement selection sort for run generation was not killed as part
of the improvements to external sorting in Postgres 9.6, although
there was a reasonably good case for that to be made at the time those
enhancements went in. This was because it was still possible for its
best case to come out ahead. The best case is the case where it
manages to produce a single run, all in one go, by exploiting
presortedness. The examples where this was possible were fairly
narrow, but they existed.
With the additional enhancements made to Postgres 10, I doubt that
there are any remaining cases where it wins. In addition to the point
about heap management and presortedness, you also have to consider
that TSS_SORTEDONTAPE processing is optimized for random access. This
means that one-big-replacement-selection-run cases will not take
advantage of available memory, improvements in Postgres 10 by Heikki
to tape preloading for merging, OS readahead, and so on. Merging is
often quite I/O bound these days, especially when merging naturally
requires few comparisons. TSS_SORTEDONTAPE processing is
inappropriate.
I think we should remove the replacement_sort_tuples GUC, and kill
replacement selection entirely. There is no need to do this for
Postgres 10. I don't feel very strongly about it. It just doesn't make
sense to continue to support replacement selection.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Jul 14, 2017 at 6:20 PM, Peter Geoghegan <pg@bowt.ie> wrote:
With the additional enhancements made to Postgres 10, I doubt that
there are any remaining cases where it wins.
The thing to do about that would be to come up with some cases where
someone might plausibly think it would win and benchmark them to find
out what happens. I find it really hard to believe that sorting a
long presorted stream of tuples (or, say, 2-1-4-3-6-5-8-7-10-9 etc.)
is ever going to be as fast with any other algorithm as it is with
replacement selection.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Aug 30, 2017 at 12:51 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Jul 14, 2017 at 6:20 PM, Peter Geoghegan <pg@bowt.ie> wrote:
With the additional enhancements made to Postgres 10, I doubt that
there are any remaining cases where it wins.The thing to do about that would be to come up with some cases where
someone might plausibly think it would win and benchmark them to find
out what happens. I find it really hard to believe that sorting a
long presorted stream of tuples (or, say, 2-1-4-3-6-5-8-7-10-9 etc.)
is ever going to be as fast with any other algorithm as it is with
replacement selection.
Replacement selection as implemented in Postgres is supposed to be
about the "single run, no merge" best case. This must use
TSS_SORTEDONTAPE processing, which is optimized for random access,
which is usually the wrong thing.
In general, sorting is only one cost that is involved here, and is not
the predominant cost with presorted input.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Aug 30, 2017 at 4:18 PM, Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Aug 30, 2017 at 12:51 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Jul 14, 2017 at 6:20 PM, Peter Geoghegan <pg@bowt.ie> wrote:
With the additional enhancements made to Postgres 10, I doubt that
there are any remaining cases where it wins.The thing to do about that would be to come up with some cases where
someone might plausibly think it would win and benchmark them to find
out what happens. I find it really hard to believe that sorting a
long presorted stream of tuples (or, say, 2-1-4-3-6-5-8-7-10-9 etc.)
is ever going to be as fast with any other algorithm as it is with
replacement selection.Replacement selection as implemented in Postgres is supposed to be
about the "single run, no merge" best case. This must use
TSS_SORTEDONTAPE processing, which is optimized for random access,
which is usually the wrong thing.In general, sorting is only one cost that is involved here, and is not
the predominant cost with presorted input.
That may all be true, but my point is that if it wins in some cases,
we should keep it -- and proving it no longer wins in those cases will
require running tests.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Aug 30, 2017 at 3:01 PM, Robert Haas <robertmhaas@gmail.com> wrote:
That may all be true, but my point is that if it wins in some cases,
we should keep it -- and proving it no longer wins in those cases will
require running tests.
That's not hard. On my laptop:
$ pgbench -i -s 100
...
postgres=# set work_mem = '2MB';
SET
postgres=# show replacement_sort_tuples ;
replacement_sort_tuples
─────────────────────────
150000
(1 row)
(30784) /postgres=# select count(distinct aid) from pgbench_accounts ;
count
────────────
10,000,000
(1 row)
Time: 4157.267 ms (00:04.157)
(30784) /postgres=# set replacement_sort_tuples = 0;
SET
(30784) /postgres=# select count(distinct aid) from pgbench_accounts ;
count
────────────
10,000,000
(1 row)
Time: 3343.789 ms (00:03.344)
This is significantly faster, in a way that's clearly reproducible and
consistent, despite the fact that we need about 10 merge passes
without replacement selection, and only have enough memory for 7
tapes. I think that I could find a case that makes replacement
selection look much worse, if I tried.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Aug 30, 2017 at 3:14 PM, Peter Geoghegan <pg@bowt.ie> wrote:
This is significantly faster, in a way that's clearly reproducible and
consistent, despite the fact that we need about 10 merge passes
without replacement selection, and only have enough memory for 7
tapes. I think that I could find a case that makes replacement
selection look much worse, if I tried.
Just to see where we end up, I quickly wrote a patch to remove
replacement selection + replacement_sort_tuples. The LOC impact worked
out like this:
$ git diff master --stat
doc/src/sgml/config.sgml | 39 ----
src/backend/utils/init/globals.c | 1 -
src/backend/utils/misc/guc.c | 10 -
src/backend/utils/misc/postgresql.conf.sample | 1 -
src/backend/utils/sort/tuplesort.c | 403
+++++------------------------------
src/include/miscadmin.h | 1 -
src/test/regress/expected/cluster.out | 17 +-
src/test/regress/sql/cluster.sql | 14 +-
8 files changed, 52 insertions(+), 434 deletions(-)
It was nice to be able to remove comments that make certain
distinctions that replacement selection cares about. These were
tedious to read. I managed to remove several paragraphs within
tuplesort.c's header comments.
Another advantage of the changes made that I noticed in passing is
that SortTuple.tupindex is now only used while merging. It would be
quite possible to remove SortTuple.tupindex entirely, as an additional
piece of work, by providing some other indirection to get that
information when merging. From there, we could replace SortTuple.tuple
with a bitfield, that has one bit for NULLness, and treats other bits
as a 63-bit or 31-bit integer. This integer would be used an offset
into a buffer that we repalloc(), thus removing all retail palloc()s,
even during run generation. Using one big buffer for tuples would make
tuplesort.c quite a bit more memory efficient (SortTuple will only be
16 bytes, we won't waste memory on palloc() headers, and sequential
access is made more cache efficient in presorted pass-by-reference
cases).
I'm not planning to work on this myself. It is similar to something
that Heikki described last year, but goes a bit further by eliminating
all palloc() header overhead, while also not playing tricks with
reclaiming bits from pointers (I recall that Tom didn't like that
aspect of Heikki's proposal at all). There would be new infrastructure
required to make this work -- we'd have to be able to ask routines
like ExecCopySlotMinimalTuple() and heap_copytuple() to copy into our
own large tuple buffer, rather than doing their own palloc() on
tuplesort's behalf. But that seems like a good idea anyway.
I may submit the simple patch to remove replacement selection, if
other contributors are receptive. Apart from everything else, the
"incrementalism" of replacement selection works against cleverer batch
memory management of the type I just outlined, which seems like a
useful direction to take tuplesort in.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Aug 30, 2017 at 6:14 PM, Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Aug 30, 2017 at 3:01 PM, Robert Haas <robertmhaas@gmail.com> wrote:
That may all be true, but my point is that if it wins in some cases,
we should keep it -- and proving it no longer wins in those cases will
require running tests.That's not hard. On my laptop:
$ pgbench -i -s 100
...postgres=# set work_mem = '2MB';
SET
postgres=# show replacement_sort_tuples ;
replacement_sort_tuples
─────────────────────────
150000
(1 row)
(30784) /postgres=# select count(distinct aid) from pgbench_accounts ;
count
────────────
10,000,000
(1 row)Time: 4157.267 ms (00:04.157)
(30784) /postgres=# set replacement_sort_tuples = 0;
SET
(30784) /postgres=# select count(distinct aid) from pgbench_accounts ;
count
────────────
10,000,000
(1 row)Time: 3343.789 ms (00:03.344)
This is significantly faster, in a way that's clearly reproducible and
consistent, despite the fact that we need about 10 merge passes
without replacement selection, and only have enough memory for 7
tapes. I think that I could find a case that makes replacement
selection look much worse, if I tried.
Wow. Just to be clear, I am looking for the BEST case for replacement
selection, not the worst case. But I would have expected that case to
be a win for replacement selection, and it clearly isn't. I can
reproduce your results here.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Aug 30, 2017 at 5:38 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Wow. Just to be clear, I am looking for the BEST case for replacement
selection, not the worst case. But I would have expected that case to
be a win for replacement selection, and it clearly isn't. I can
reproduce your results here.
But I *was* trying to get a best case. That's why it isn't even worse.
That's what the docs say the best case is, after all.
This is the kind of thing that replacement selection actually did do
better with on 9.6. I clearly remember Tomas Vondra doing lots of
benchmarking, showing some benefit with RS with a work_mem of 8MB or
less. As I said in my introduction on this thread, we weren't wrong to
add replacement_sort_tuples to 9.6, given where things were with
merging at the time. But, it does very much appear to create less than
zero benefit these days. The picture changed.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Aug 30, 2017 at 4:59 PM, Peter Geoghegan <pg@bowt.ie> wrote:
I may submit the simple patch to remove replacement selection, if
other contributors are receptive. Apart from everything else, the
"incrementalism" of replacement selection works against cleverer batch
memory management of the type I just outlined, which seems like a
useful direction to take tuplesort in.
I attach a patch to remove replacement selection, which I'll submit to CF 1.
--
Peter Geoghegan
Attachments:
0001-Remove-replacement-selection-sort.patchtext/x-patch; charset=US-ASCII; name=0001-Remove-replacement-selection-sort.patchDownload
From 6ccad74113d3a13264a19653e13ef897be5c902d Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 30 Aug 2017 16:14:36 -0700
Subject: [PATCH] Remove replacement selection sort.
This simplifies tuplesort.c, since heap maintenance routines need no
longer concern themselves with cases where two runs are represented
within a heap.
Replacement selection sort's traditional advantages no longer allow it
to outperform a simple sort-merge strategy, even in sympathetic cases.
This is due to long term trends in hardware performance characteristics,
and enhancements to the tuplesort.c merge code in the past couple of
years.
---
doc/src/sgml/config.sgml | 39 ---
src/backend/utils/init/globals.c | 1 -
src/backend/utils/misc/guc.c | 10 -
src/backend/utils/misc/postgresql.conf.sample | 1 -
src/backend/utils/sort/tuplesort.c | 415 +++-----------------------
src/include/miscadmin.h | 1 -
src/test/regress/expected/cluster.out | 17 +-
src/test/regress/sql/cluster.sql | 14 +-
8 files changed, 51 insertions(+), 447 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2b6255e..7c625e2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1513,45 +1513,6 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
- <varlistentry id="guc-replacement-sort-tuples" xreflabel="replacement_sort_tuples">
- <term><varname>replacement_sort_tuples</varname> (<type>integer</type>)
- <indexterm>
- <primary><varname>replacement_sort_tuples</> configuration parameter</primary>
- </indexterm>
- </term>
- <listitem>
- <para>
- When the number of tuples to be sorted is smaller than this number,
- a sort will produce its first output run using replacement selection
- rather than quicksort. This may be useful in memory-constrained
- environments where tuples that are input into larger sort operations
- have a strong physical-to-logical correlation. Note that this does
- not include input tuples with an <emphasis>inverse</emphasis>
- correlation. It is possible for the replacement selection algorithm
- to generate one long run that requires no merging, where use of the
- default strategy would result in many runs that must be merged
- to produce a final sorted output. This may allow sort
- operations to complete sooner.
- </para>
- <para>
- The default is 150,000 tuples. Note that higher values are typically
- not much more effective, and may be counter-productive, since the
- priority queue is sensitive to the size of available CPU cache, whereas
- the default strategy sorts runs using a <firstterm>cache
- oblivious</firstterm> algorithm. This property allows the default sort
- strategy to automatically and transparently make effective use
- of available CPU cache.
- </para>
- <para>
- Setting <varname>maintenance_work_mem</varname> to its default
- value usually prevents utility command external sorts (e.g.,
- sorts used by <command>CREATE INDEX</> to build B-Tree
- indexes) from ever using replacement selection sort, unless the
- input tuples are quite wide.
- </para>
- </listitem>
- </varlistentry>
-
<varlistentry id="guc-autovacuum-work-mem" xreflabel="autovacuum_work_mem">
<term><varname>autovacuum_work_mem</varname> (<type>integer</type>)
<indexterm>
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 7c09498..9680a4b 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -112,7 +112,6 @@ bool enableFsync = true;
bool allowSystemTableMods = false;
int work_mem = 1024;
int maintenance_work_mem = 16384;
-int replacement_sort_tuples = 150000;
/*
* Primary determinants of sizes of shared-memory structures.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 246fea8..a459672 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1932,16 +1932,6 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
- {
- {"replacement_sort_tuples", PGC_USERSET, RESOURCES_MEM,
- gettext_noop("Sets the maximum number of tuples to be sorted using replacement selection."),
- gettext_noop("When more tuples than this are present, quicksort will be used.")
- },
- &replacement_sort_tuples,
- 150000, 0, INT_MAX,
- NULL, NULL, NULL
- },
-
/*
* We use the hopefully-safely-small value of 100kB as the compiled-in
* default for max_stack_depth. InitializeGUCOptions will increase it if
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index df5d2f3..d0177d1 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -121,7 +121,6 @@
# you actively intend to use prepared transactions.
#work_mem = 4MB # min 64kB
#maintenance_work_mem = 64MB # min 1MB
-#replacement_sort_tuples = 150000 # limits use of replacement selection sort
#autovacuum_work_mem = -1 # min 1MB, or -1 to use maintenance_work_mem
#max_stack_depth = 2MB # min 100kB
#dynamic_shared_memory_type = posix # the default is the first option
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 17e1b68..60522cb 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -13,47 +13,11 @@
* See Knuth, volume 3, for more than you want to know about the external
* sorting algorithm. Historically, we divided the input into sorted runs
* using replacement selection, in the form of a priority tree implemented
- * as a heap (essentially his Algorithm 5.2.3H), but now we only do that
- * for the first run, and only if the run would otherwise end up being very
- * short. We merge the runs using polyphase merge, Knuth's Algorithm
- * 5.4.2D. The logical "tapes" used by Algorithm D are implemented by
- * logtape.c, which avoids space wastage by recycling disk space as soon
- * as each block is read from its "tape".
- *
- * We do not use Knuth's recommended data structure (Algorithm 5.4.1R) for
- * the replacement selection, because it uses a fixed number of records
- * in memory at all times. Since we are dealing with tuples that may vary
- * considerably in size, we want to be able to vary the number of records
- * kept in memory to ensure full utilization of the allowed sort memory
- * space. So, we keep the tuples in a variable-size heap, with the next
- * record to go out at the top of the heap. Like Algorithm 5.4.1R, each
- * record is stored with the run number that it must go into, and we use
- * (run number, key) as the ordering key for the heap. When the run number
- * at the top of the heap changes, we know that no more records of the prior
- * run are left in the heap. Note that there are in practice only ever two
- * distinct run numbers, because since PostgreSQL 9.6, we only use
- * replacement selection to form the first run.
- *
- * In PostgreSQL 9.6, a heap (based on Knuth's Algorithm H, with some small
- * customizations) is only used with the aim of producing just one run,
- * thereby avoiding all merging. Only the first run can use replacement
- * selection, which is why there are now only two possible valid run
- * numbers, and why heapification is customized to not distinguish between
- * tuples in the second run (those will be quicksorted). We generally
- * prefer a simple hybrid sort-merge strategy, where runs are sorted in much
- * the same way as the entire input of an internal sort is sorted (using
- * qsort()). The replacement_sort_tuples GUC controls the limited remaining
- * use of replacement selection for the first run.
- *
- * There are several reasons to favor a hybrid sort-merge strategy.
- * Maintaining a priority tree/heap has poor CPU cache characteristics.
- * Furthermore, the growth in main memory sizes has greatly diminished the
- * value of having runs that are larger than available memory, even in the
- * case where there is partially sorted input and runs can be made far
- * larger by using a heap. In most cases, a single-pass merge step is all
- * that is required even when runs are no larger than available memory.
- * Avoiding multiple merge passes was traditionally considered to be the
- * major advantage of using replacement selection.
+ * as a heap (essentially his Algorithm 5.2.3H), but now we always use
+ * quicksort for run generation. We merge the runs using polyphase merge,
+ * Knuth's Algorithm 5.4.2D. The logical "tapes" used by Algorithm D are
+ * implemented by logtape.c, which avoids space wastage by recycling disk
+ * space as soon as each block is read from its "tape".
*
* The approximate amount of memory allowed for any one sort operation
* is specified in kilobytes by the caller (most pass work_mem). Initially,
@@ -64,9 +28,8 @@
* workMem, we begin to emit tuples into sorted runs in temporary tapes.
* When tuples are dumped in batch after quicksorting, we begin a new run
* with a new output tape (selected per Algorithm D). After the end of the
- * input is reached, we dump out remaining tuples in memory into a final run
- * (or two, when replacement selection is still used), then merge the runs
- * using Algorithm D.
+ * input is reached, we dump out remaining tuples in memory into a final run,
+ * then merge the runs using Algorithm D.
*
* When merging runs, we use a heap containing just the frontmost tuple from
* each source run; we repeatedly output the smallest tuple and replace it
@@ -188,13 +151,8 @@ bool optimize_bounded_sort = true;
* described above. Accordingly, "tuple" is always used in preference to
* datum1 as the authoritative value for pass-by-reference cases.
*
- * While building initial runs, tupindex holds the tuple's run number.
- * Historically, the run number could meaningfully distinguish many runs, but
- * it now only distinguishes RUN_FIRST and HEAP_RUN_NEXT, since replacement
- * selection is always abandoned after the first run; no other run number
- * should be represented here. During merge passes, we re-use it to hold the
- * input tape number that each tuple in the heap was read from. tupindex goes
- * unused if the sort occurs entirely in memory.
+ * tupindex holds the input tape number that each tuple in the heap was read
+ * from during merge passes.
*/
typedef struct
{
@@ -253,15 +211,6 @@ typedef enum
#define TAPE_BUFFER_OVERHEAD BLCKSZ
#define MERGE_BUFFER_SIZE (BLCKSZ * 32)
- /*
- * Run numbers, used during external sort operations.
- *
- * HEAP_RUN_NEXT is only used for SortTuple.tupindex, never state.currentRun.
- */
-#define RUN_FIRST 0
-#define HEAP_RUN_NEXT INT_MAX
-#define RUN_SECOND 1
-
typedef int (*SortTupleComparator) (const SortTuple *a, const SortTuple *b,
Tuplesortstate *state);
@@ -381,16 +330,8 @@ struct Tuplesortstate
void *lastReturnedTuple;
/*
- * While building initial runs, this indicates if the replacement
- * selection strategy is in use. When it isn't, then a simple hybrid
- * sort-merge strategy is in use instead (runs are quicksorted).
- */
- bool replaceActive;
-
- /*
- * While building initial runs, this is the current output run number
- * (starting at RUN_FIRST). Afterwards, it is the number of initial runs
- * we made.
+ * While building initial runs, this is the current output run number.
+ * Afterwards, it is the number of initial runs we made.
*/
int currentRun;
@@ -583,7 +524,6 @@ struct Tuplesortstate
static Tuplesortstate *tuplesort_begin_common(int workMem, bool randomAccess);
static void puttuple_common(Tuplesortstate *state, SortTuple *tuple);
static bool consider_abort_common(Tuplesortstate *state);
-static bool useselection(Tuplesortstate *state);
static void inittapes(Tuplesortstate *state);
static void selectnewtape(Tuplesortstate *state);
static void init_slab_allocator(Tuplesortstate *state, int numSlots);
@@ -592,15 +532,12 @@ static void mergeonerun(Tuplesortstate *state);
static void beginmerge(Tuplesortstate *state);
static bool mergereadnext(Tuplesortstate *state, int srcTape, SortTuple *stup);
static void dumptuples(Tuplesortstate *state, bool alltuples);
-static void dumpbatch(Tuplesortstate *state, bool alltuples);
static void make_bounded_heap(Tuplesortstate *state);
static void sort_bounded_heap(Tuplesortstate *state);
static void tuplesort_sort_memtuples(Tuplesortstate *state);
-static void tuplesort_heap_insert(Tuplesortstate *state, SortTuple *tuple,
- bool checkIndex);
-static void tuplesort_heap_replace_top(Tuplesortstate *state, SortTuple *tuple,
- bool checkIndex);
-static void tuplesort_heap_delete_top(Tuplesortstate *state, bool checkIndex);
+static void tuplesort_heap_insert(Tuplesortstate *state, SortTuple *tuple);
+static void tuplesort_heap_replace_top(Tuplesortstate *state, SortTuple *tuple);
+static void tuplesort_heap_delete_top(Tuplesortstate *state);
static void reversedirection(Tuplesortstate *state);
static unsigned int getlen(Tuplesortstate *state, int tapenum, bool eofOK);
static void markrunend(Tuplesortstate *state, int tapenum);
@@ -738,7 +675,7 @@ tuplesort_begin_common(int workMem, bool randomAccess)
if (LACKMEM(state))
elog(ERROR, "insufficient memory allowed for sort");
- state->currentRun = RUN_FIRST;
+ state->currentRun = 0;
/*
* maxTapes, tapeRange, and Algorithm D variables will be initialized by
@@ -1622,7 +1559,7 @@ puttuple_common(Tuplesortstate *state, SortTuple *tuple)
inittapes(state);
/*
- * Dump tuples until we are back under the limit.
+ * Dump all tuples.
*/
dumptuples(state, false);
break;
@@ -1647,74 +1584,20 @@ puttuple_common(Tuplesortstate *state, SortTuple *tuple)
{
/* discard top of heap, replacing it with the new tuple */
free_sort_tuple(state, &state->memtuples[0]);
- tuple->tupindex = 0; /* not used */
- tuplesort_heap_replace_top(state, tuple, false);
+ tuplesort_heap_replace_top(state, tuple);
}
break;
case TSS_BUILDRUNS:
/*
- * Insert the tuple into the heap, with run number currentRun if
- * it can go into the current run, else HEAP_RUN_NEXT. The tuple
- * can go into the current run if it is >= the first
- * not-yet-output tuple. (Actually, it could go into the current
- * run if it is >= the most recently output tuple ... but that
- * would require keeping around the tuple we last output, and it's
- * simplest to let writetup free each tuple as soon as it's
- * written.)
- *
- * Note that this only applies when:
- *
- * - currentRun is RUN_FIRST
- *
- * - Replacement selection is in use (typically it is never used).
- *
- * When these two conditions are not both true, all tuples are
- * appended indifferently, much like the TSS_INITIAL case.
- *
- * There should always be room to store the incoming tuple.
+ * Save the tuple into the unsorted array (there must be
+ * space)
*/
- Assert(!state->replaceActive || state->memtupcount > 0);
- if (state->replaceActive &&
- COMPARETUP(state, tuple, &state->memtuples[0]) >= 0)
- {
- Assert(state->currentRun == RUN_FIRST);
-
- /*
- * Insert tuple into first, fully heapified run.
- *
- * Unlike classic replacement selection, which this module was
- * previously based on, only RUN_FIRST tuples are fully
- * heapified. Any second/next run tuples are appended
- * indifferently. While HEAP_RUN_NEXT tuples may be sifted
- * out of the way of first run tuples, COMPARETUP() will never
- * be called for the run's tuples during sifting (only our
- * initial COMPARETUP() call is required for the tuple, to
- * determine that the tuple does not belong in RUN_FIRST).
- */
- tuple->tupindex = state->currentRun;
- tuplesort_heap_insert(state, tuple, true);
- }
- else
- {
- /*
- * Tuple was determined to not belong to heapified RUN_FIRST,
- * or replacement selection not in play. Append the tuple to
- * memtuples indifferently.
- *
- * dumptuples() does not trust that the next run's tuples are
- * heapified. Anything past the first run will always be
- * quicksorted even when replacement selection is initially
- * used. (When it's never used, every tuple still takes this
- * path.)
- */
- tuple->tupindex = HEAP_RUN_NEXT;
- state->memtuples[state->memtupcount++] = *tuple;
- }
+ state->memtuples[state->memtupcount++] = *tuple;
/*
- * If we are over the memory limit, dump tuples till we're under.
+ * If we are over the memory limit, dump all tuples.
*/
dumptuples(state, false);
break;
@@ -2068,7 +1951,7 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
* If no more data, we've reached end of run on this tape.
* Remove the top node from the heap.
*/
- tuplesort_heap_delete_top(state, false);
+ tuplesort_heap_delete_top(state);
/*
* Rewind to free the read buffer. It'd go away at the
@@ -2079,7 +1962,7 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
return true;
}
newtup.tupindex = srcTape;
- tuplesort_heap_replace_top(state, &newtup, false);
+ tuplesort_heap_replace_top(state, &newtup);
return true;
}
return false;
@@ -2337,28 +2220,6 @@ tuplesort_merge_order(int64 allowedMem)
}
/*
- * useselection - determine algorithm to use to sort first run.
- *
- * It can sometimes be useful to use the replacement selection algorithm if it
- * results in one large run, and there is little available workMem. See
- * remarks on RUN_SECOND optimization within dumptuples().
- */
-static bool
-useselection(Tuplesortstate *state)
-{
- /*
- * memtupsize might be noticeably higher than memtupcount here in atypical
- * cases. It seems slightly preferable to not allow recent outliers to
- * impact this determination. Note that caller's trace_sort output
- * reports memtupcount instead.
- */
- if (state->memtupsize <= replacement_sort_tuples)
- return true;
-
- return false;
-}
-
-/*
* inittapes - initialize for tape sorting.
*
* This is called only if we have found we don't have room to sort in memory.
@@ -2413,44 +2274,7 @@ inittapes(Tuplesortstate *state)
state->tp_dummy = (int *) palloc0(maxTapes * sizeof(int));
state->tp_tapenum = (int *) palloc0(maxTapes * sizeof(int));
- /*
- * Give replacement selection a try based on user setting. There will be
- * a switch to a simple hybrid sort-merge strategy after the first run
- * (iff we could not output one long run).
- */
- state->replaceActive = useselection(state);
-
- if (state->replaceActive)
- {
- /*
- * Convert the unsorted contents of memtuples[] into a heap. Each
- * tuple is marked as belonging to run number zero.
- *
- * NOTE: we pass false for checkIndex since there's no point in
- * comparing indexes in this step, even though we do intend the
- * indexes to be part of the sort key...
- */
- int ntuples = state->memtupcount;
-
-#ifdef TRACE_SORT
- if (trace_sort)
- elog(LOG, "replacement selection will sort %d first run tuples",
- state->memtupcount);
-#endif
- state->memtupcount = 0; /* make the heap empty */
-
- for (j = 0; j < ntuples; j++)
- {
- /* Must copy source tuple to avoid possible overwrite */
- SortTuple stup = state->memtuples[j];
-
- stup.tupindex = RUN_FIRST;
- tuplesort_heap_insert(state, &stup, false);
- }
- Assert(state->memtupcount == ntuples);
- }
-
- state->currentRun = RUN_FIRST;
+ state->currentRun = 0;
/*
* Initialize variables of Algorithm D (step D1).
@@ -2625,22 +2449,6 @@ mergeruns(Tuplesortstate *state)
init_slab_allocator(state, 0);
/*
- * If we produced only one initial run (quite likely if the total data
- * volume is between 1X and 2X workMem when replacement selection is used,
- * but something we particularly count on when input is presorted), we can
- * just use that tape as the finished output, rather than doing a useless
- * merge. (This obvious optimization is not in Knuth's algorithm.)
- */
- if (state->currentRun == RUN_SECOND)
- {
- state->result_tape = state->tp_tapenum[state->destTape];
- /* must freeze and rewind the finished output tape */
- LogicalTapeFreeze(state->tapeset, state->result_tape);
- state->status = TSS_SORTEDONTAPE;
- return;
- }
-
- /*
* Allocate a new 'memtuples' array, for the heap. It will hold one tuple
* from each input tape.
*/
@@ -2826,11 +2634,11 @@ mergeonerun(Tuplesortstate *state)
if (mergereadnext(state, srcTape, &stup))
{
stup.tupindex = srcTape;
- tuplesort_heap_replace_top(state, &stup, false);
+ tuplesort_heap_replace_top(state, &stup);
}
else
- tuplesort_heap_delete_top(state, false);
+ tuplesort_heap_delete_top(state);
}
/*
@@ -2892,7 +2700,7 @@ beginmerge(Tuplesortstate *state)
if (mergereadnext(state, srcTape, &tup))
{
tup.tupindex = srcTape;
- tuplesort_heap_insert(state, &tup, false);
+ tuplesort_heap_insert(state, &tup);
}
}
}
@@ -2922,125 +2730,26 @@ mergereadnext(Tuplesortstate *state, int srcTape, SortTuple *stup)
}
/*
- * dumptuples - remove tuples from memtuples and write to tape
+ * dumptuples - remove tuples from memtuples and write initial run to tape
*
- * This is used during initial-run building, but not during merging.
- *
- * When alltuples = false and replacement selection is still active, dump
- * only enough tuples to get under the availMem limit (and leave at least
- * one tuple in memtuples, since puttuple will then assume it is a heap that
- * has a tuple to compare to). We always insist there be at least one free
- * slot in the memtuples[] array.
- *
- * When alltuples = true, dump everything currently in memory. (This
- * case is only used at end of input data, although in practice only the
- * first run could fail to dump all tuples when we LACKMEM(), and only
- * when replacement selection is active.)
- *
- * If, when replacement selection is active, we see that the tuple run
- * number at the top of the heap has changed, start a new run. This must be
- * the first run, because replacement selection is always abandoned for all
- * further runs.
+ * When alltuples = true, dump everything currently in memory. (This case is
+ * only used at end of input data.)
*/
static void
dumptuples(Tuplesortstate *state, bool alltuples)
{
- while (alltuples ||
- (LACKMEM(state) && state->memtupcount > 1) ||
- state->memtupcount >= state->memtupsize)
- {
- if (state->replaceActive)
- {
- /*
- * Still holding out for a case favorable to replacement
- * selection. Still incrementally spilling using heap.
- *
- * Dump the heap's frontmost entry, and remove it from the heap.
- */
- Assert(state->memtupcount > 0);
- WRITETUP(state, state->tp_tapenum[state->destTape],
- &state->memtuples[0]);
- tuplesort_heap_delete_top(state, true);
- }
- else
- {
- /*
- * Once committed to quicksorting runs, never incrementally spill
- */
- dumpbatch(state, alltuples);
- break;
- }
-
- /*
- * If top run number has changed, we've finished the current run (this
- * can only be the first run), and will no longer spill incrementally.
- */
- if (state->memtupcount == 0 ||
- state->memtuples[0].tupindex == HEAP_RUN_NEXT)
- {
- markrunend(state, state->tp_tapenum[state->destTape]);
- Assert(state->currentRun == RUN_FIRST);
- state->currentRun++;
- state->tp_runs[state->destTape]++;
- state->tp_dummy[state->destTape]--; /* per Alg D step D2 */
-
-#ifdef TRACE_SORT
- if (trace_sort)
- elog(LOG, "finished incrementally writing %s run %d to tape %d: %s",
- (state->memtupcount == 0) ? "only" : "first",
- state->currentRun, state->destTape,
- pg_rusage_show(&state->ru_start));
-#endif
-
- /*
- * Done if heap is empty, which is possible when there is only one
- * long run.
- */
- Assert(state->currentRun == RUN_SECOND);
- if (state->memtupcount == 0)
- {
- /*
- * Replacement selection best case; no final merge required,
- * because there was only one initial run (second run has no
- * tuples). See RUN_SECOND case in mergeruns().
- */
- break;
- }
-
- /*
- * Abandon replacement selection for second run (as well as any
- * subsequent runs).
- */
- state->replaceActive = false;
-
- /*
- * First tuple of next run should not be heapified, and so will
- * bear placeholder run number. In practice this must actually be
- * the second run, which just became the currentRun, so we're
- * clear to quicksort and dump the tuples in batch next time
- * memtuples becomes full.
- */
- Assert(state->memtuples[0].tupindex == HEAP_RUN_NEXT);
- selectnewtape(state);
- }
- }
-}
-
-/*
- * dumpbatch - sort and dump all memtuples, forming one run on tape
- *
- * Second or subsequent runs are never heapified by this module (although
- * heapification still respects run number differences between the first and
- * second runs), and a heap (replacement selection priority queue) is often
- * avoided in the first place.
- */
-static void
-dumpbatch(Tuplesortstate *state, bool alltuples)
-{
int memtupwrite;
int i;
/*
+ * Nothing to do if we still fit in available memory and have array
+ * slots, unless this is the final call during initial run generation.
+ */
+ if (state->memtupcount < state->memtupsize && !LACKMEM(state) &&
+ !alltuples)
+ return;
+
+ /*
* Final call might require no sorting, in rare cases where we just so
* happen to have previously LACKMEM()'d at the point where exactly all
* remaining tuples are loaded into memory, just before input was
@@ -3308,21 +3017,8 @@ tuplesort_space_type_name(TuplesortSpaceType t)
/*
* Heap manipulation routines, per Knuth's Algorithm 5.2.3H.
- *
- * Compare two SortTuples. If checkIndex is true, use the tuple index
- * as the front of the sort key; otherwise, no.
- *
- * Note that for checkIndex callers, the heap invariant is never
- * maintained beyond the first run, and so there are no COMPARETUP()
- * calls needed to distinguish tuples in HEAP_RUN_NEXT.
*/
-#define HEAPCOMPARE(tup1,tup2) \
- (checkIndex && ((tup1)->tupindex != (tup2)->tupindex || \
- (tup1)->tupindex == HEAP_RUN_NEXT) ? \
- ((tup1)->tupindex) - ((tup2)->tupindex) : \
- COMPARETUP(state, tup1, tup2))
-
/*
* Convert the existing unordered array of SortTuples to a bounded heap,
* discarding all but the smallest "state->bound" tuples.
@@ -3331,10 +3027,6 @@ tuplesort_space_type_name(TuplesortSpaceType t)
* at the root (array entry zero), instead of the smallest as in the normal
* sort case. This allows us to discard the largest entry cheaply.
* Therefore, we temporarily reverse the sort direction.
- *
- * We assume that all entries in a bounded heap will always have tupindex
- * zero; it therefore doesn't matter that HEAPCOMPARE() doesn't reverse
- * the direction of comparison for tupindexes.
*/
static void
make_bounded_heap(Tuplesortstate *state)
@@ -3358,8 +3050,7 @@ make_bounded_heap(Tuplesortstate *state)
/* Must copy source tuple to avoid possible overwrite */
SortTuple stup = state->memtuples[i];
- stup.tupindex = 0; /* not used */
- tuplesort_heap_insert(state, &stup, false);
+ tuplesort_heap_insert(state, &stup);
}
else
{
@@ -3374,7 +3065,7 @@ make_bounded_heap(Tuplesortstate *state)
CHECK_FOR_INTERRUPTS();
}
else
- tuplesort_heap_replace_top(state, &state->memtuples[i], false);
+ tuplesort_heap_replace_top(state, &state->memtuples[i]);
}
}
@@ -3404,7 +3095,7 @@ sort_bounded_heap(Tuplesortstate *state)
SortTuple stup = state->memtuples[0];
/* this sifts-up the next-largest entry and decreases memtupcount */
- tuplesort_heap_delete_top(state, false);
+ tuplesort_heap_delete_top(state);
state->memtuples[state->memtupcount] = stup;
}
state->memtupcount = tupcount;
@@ -3422,10 +3113,7 @@ sort_bounded_heap(Tuplesortstate *state)
/*
* Sort all memtuples using specialized qsort() routines.
*
- * Quicksort is used for small in-memory sorts. Quicksort is also generally
- * preferred to replacement selection for generating runs during external sort
- * operations, although replacement selection is sometimes used for the first
- * run.
+ * Quicksort is used for small in-memory sorts, and external sort runs.
*/
static void
tuplesort_sort_memtuples(Tuplesortstate *state)
@@ -3454,15 +3142,13 @@ tuplesort_sort_memtuples(Tuplesortstate *state)
* is, it might get overwritten before being moved into the heap!
*/
static void
-tuplesort_heap_insert(Tuplesortstate *state, SortTuple *tuple,
- bool checkIndex)
+tuplesort_heap_insert(Tuplesortstate *state, SortTuple *tuple)
{
SortTuple *memtuples;
int j;
memtuples = state->memtuples;
Assert(state->memtupcount < state->memtupsize);
- Assert(!checkIndex || tuple->tupindex == RUN_FIRST);
CHECK_FOR_INTERRUPTS();
@@ -3475,7 +3161,7 @@ tuplesort_heap_insert(Tuplesortstate *state, SortTuple *tuple,
{
int i = (j - 1) >> 1;
- if (HEAPCOMPARE(tuple, &memtuples[i]) >= 0)
+ if (COMPARETUP(state, tuple, &memtuples[i]) >= 0)
break;
memtuples[j] = memtuples[i];
j = i;
@@ -3491,12 +3177,11 @@ tuplesort_heap_insert(Tuplesortstate *state, SortTuple *tuple,
* if necessary.
*/
static void
-tuplesort_heap_delete_top(Tuplesortstate *state, bool checkIndex)
+tuplesort_heap_delete_top(Tuplesortstate *state)
{
SortTuple *memtuples = state->memtuples;
SortTuple *tuple;
- Assert(!checkIndex || state->currentRun == RUN_FIRST);
if (--state->memtupcount <= 0)
return;
@@ -3505,7 +3190,7 @@ tuplesort_heap_delete_top(Tuplesortstate *state, bool checkIndex)
* current top node with it.
*/
tuple = &memtuples[state->memtupcount];
- tuplesort_heap_replace_top(state, tuple, checkIndex);
+ tuplesort_heap_replace_top(state, tuple);
}
/*
@@ -3516,14 +3201,12 @@ tuplesort_heap_delete_top(Tuplesortstate *state, bool checkIndex)
* Heapsort, steps H3-H8).
*/
static void
-tuplesort_heap_replace_top(Tuplesortstate *state, SortTuple *tuple,
- bool checkIndex)
+tuplesort_heap_replace_top(Tuplesortstate *state, SortTuple *tuple)
{
SortTuple *memtuples = state->memtuples;
unsigned int i,
n;
- Assert(!checkIndex || state->currentRun == RUN_FIRST);
Assert(state->memtupcount >= 1);
CHECK_FOR_INTERRUPTS();
@@ -3542,9 +3225,9 @@ tuplesort_heap_replace_top(Tuplesortstate *state, SortTuple *tuple,
if (j >= n)
break;
if (j + 1 < n &&
- HEAPCOMPARE(&memtuples[j], &memtuples[j + 1]) > 0)
+ COMPARETUP(state, &memtuples[j], &memtuples[j + 1]) > 0)
j++;
- if (HEAPCOMPARE(tuple, &memtuples[j]) <= 0)
+ if (COMPARETUP(state, tuple, &memtuples[j]) <= 0)
break;
memtuples[i] = memtuples[j];
i = j;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index dad98de..3950054 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -241,7 +241,6 @@ extern bool enableFsync;
extern bool allowSystemTableMods;
extern PGDLLIMPORT int work_mem;
extern PGDLLIMPORT int maintenance_work_mem;
-extern PGDLLIMPORT int replacement_sort_tuples;
extern int VacuumCostPageHit;
extern int VacuumCostPageMiss;
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 097ac2b..82713bf 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -444,22 +444,8 @@ create table clstr_4 as select * from tenk1;
create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
-- ensure we don't use the index in CLUSTER nor the checking SELECTs
set enable_indexscan = off;
--- Use external sort that only ever uses quicksort to sort runs:
+-- Use external sort:
set maintenance_work_mem = '1MB';
-set replacement_sort_tuples = 0;
-cluster clstr_4 using cluster_sort;
-select * from
-(select hundred, lag(hundred) over () as lhundred,
- thousand, lag(thousand) over () as lthousand,
- tenthous, lag(tenthous) over () as ltenthous from clstr_4) ss
-where row(hundred, thousand, tenthous) <= row(lhundred, lthousand, ltenthous);
- hundred | lhundred | thousand | lthousand | tenthous | ltenthous
----------+----------+----------+-----------+----------+-----------
-(0 rows)
-
--- Replacement selection will now be forced. It should only produce a single
--- run, due to the fact that input is found to be presorted:
-set replacement_sort_tuples = 150000;
cluster clstr_4 using cluster_sort;
select * from
(select hundred, lag(hundred) over () as lhundred,
@@ -472,7 +458,6 @@ where row(hundred, thousand, tenthous) <= row(lhundred, lthousand, ltenthous);
reset enable_indexscan;
reset maintenance_work_mem;
-reset replacement_sort_tuples;
-- clean up
DROP TABLE clustertest;
DROP TABLE clstr_1;
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index 8dd9459..a6c2757 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -203,19 +203,8 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
-- ensure we don't use the index in CLUSTER nor the checking SELECTs
set enable_indexscan = off;
--- Use external sort that only ever uses quicksort to sort runs:
+-- Use external sort:
set maintenance_work_mem = '1MB';
-set replacement_sort_tuples = 0;
-cluster clstr_4 using cluster_sort;
-select * from
-(select hundred, lag(hundred) over () as lhundred,
- thousand, lag(thousand) over () as lthousand,
- tenthous, lag(tenthous) over () as ltenthous from clstr_4) ss
-where row(hundred, thousand, tenthous) <= row(lhundred, lthousand, ltenthous);
-
--- Replacement selection will now be forced. It should only produce a single
--- run, due to the fact that input is found to be presorted:
-set replacement_sort_tuples = 150000;
cluster clstr_4 using cluster_sort;
select * from
(select hundred, lag(hundred) over () as lhundred,
@@ -225,7 +214,6 @@ where row(hundred, thousand, tenthous) <= row(lhundred, lthousand, ltenthous);
reset enable_indexscan;
reset maintenance_work_mem;
-reset replacement_sort_tuples;
-- clean up
DROP TABLE clustertest;
--
2.7.4
On Fri, Sep 1, 2017 at 6:00 AM, Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Aug 30, 2017 at 4:59 PM, Peter Geoghegan <pg@bowt.ie> wrote:
I may submit the simple patch to remove replacement selection, if
other contributors are receptive. Apart from everything else, the
"incrementalism" of replacement selection works against cleverer batch
memory management of the type I just outlined, which seems like a
useful direction to take tuplesort in.I attach a patch to remove replacement selection, which I'll submit to CF 1.
This breaks the documentation build, because
doc/src/sgml/release-9.6.sgml still contains <xref
linkend="guc-replacement-sort-tuples"> but you removed that id.
--
Thomas Munro
http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Sep 6, 2017 at 2:47 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
I attach a patch to remove replacement selection, which I'll submit to CF 1.
This breaks the documentation build, because
doc/src/sgml/release-9.6.sgml still contains <xref
linkend="guc-replacement-sort-tuples"> but you removed that id.
Thanks for looking into it.
I suppose that the solution is to change the 9.6 release notes to no
longer have a replacement_sort_tuples link. Anyone else have an
opinion on that?
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
On 08/31/2017 02:56 AM, Peter Geoghegan wrote:
On Wed, Aug 30, 2017 at 5:38 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Wow. Just to be clear, I am looking for the BEST case for replacement
selection, not the worst case. But I would have expected that case to
be a win for replacement selection, and it clearly isn't. I can
reproduce your results here.But I *was* trying to get a best case. That's why it isn't even worse.
That's what the docs say the best case is, after all.This is the kind of thing that replacement selection actually did do
better with on 9.6. I clearly remember Tomas Vondra doing lots of
benchmarking, showing some benefit with RS with a work_mem of 8MB or
less. As I said in my introduction on this thread, we weren't wrong to
add replacement_sort_tuples to 9.6, given where things were with
merging at the time. But, it does very much appear to create less than
zero benefit these days. The picture changed.
Do we need/want to repeat some of that benchmarking on these patches? I
don't recall how much this code changed since those benchmarks were done
in the 9.6 cycle.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 7, 2017 at 5:38 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
Do we need/want to repeat some of that benchmarking on these patches? I
don't recall how much this code changed since those benchmarks were done
in the 9.6 cycle.
+1 for some new benchmarking. I'm all for removing this code if we
don't need it any more, but it'd be a lot better to have more numbers
before deciding that.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 7, 2017 at 2:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Sep 7, 2017 at 5:38 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:Do we need/want to repeat some of that benchmarking on these patches? I
don't recall how much this code changed since those benchmarks were done
in the 9.6 cycle.+1 for some new benchmarking. I'm all for removing this code if we
don't need it any more, but it'd be a lot better to have more numbers
before deciding that.
It isn't always a win. For example, if I alter the pgbench_accounts
table so that the column "aid" is of type numeric, the picture changes
for my test case -- replacement selection is still somewhat faster for
the "select count(distinct aid) from pgbench_accounts" query with
work_mem=2MB. It takes about 5.4 seconds with replacement selection,
versus 7.4 seconds for quicksorting. But, I still think we should
proceed with my patch, because:
* It's still faster with int4/int8 comparisons on modern hardware, and
I think that most ordered inputs will be of those types in practice.
The trade-off between those two properties (faster for int4/int8 when
ordered, slower for numeric) recommends killing replacement selection
entirely. It's not a slam dunk, but it makes sense on performance
grounds, IMV.
* The comparator numeric_fast_cmp() is not well optimized -- it
doesn't avoid allocating memory with each call, for example, unlike
varstrfastcmp_locale(). This could and should be improved, and might
even change the picture for this second test case.
* With the default work_mem of 8MB, we never use replacement selection
anyway. Whatever its merits, it's pretty rare to use replacement
selection simply because the default replacement_sort_tuples just
isn't that many tuples (150,000).
* The upside of my patch is not inconsiderable: We can remove a lot of
code, which will enable further improvements in the future, which are
far more compelling (cleaner memory management, the use of memory
batches during run generation).
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 8 September 2017 at 18:06, Peter Geoghegan <pg@bowt.ie> wrote:
* It's still faster with int4/int8 comparisons on modern hardware, and
I think that most ordered inputs will be of those types in practice.
This may be a bit "how long is a piece of string" but how do those two
compare with string sorting in an interesting encoding/locale -- say
/usr/share/dict/polish in pl_PL for example. It's certainly true that
people do sort text as well as numbers. Also, people often sort on
keys of more than one column....
--
greg
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Sep 9, 2017 at 8:28 AM, Greg Stark <stark@mit.edu> wrote:
This may be a bit "how long is a piece of string" but how do those two
compare with string sorting in an interesting encoding/locale -- say
/usr/share/dict/polish in pl_PL for example. It's certainly true that
people do sort text as well as numbers.
I haven't looked at text, because I presume that it's very rare for
tuples within a table to be more or less ordered by a text attribute.
While the traditional big advantage of replacement selection has
always been its ability to double initial run length on average, where
text performance is quite interesting because localized clustering
still happens, that doesn't really seem relevant here. The remaining
use of replacement selection is expressly only about the "single run,
no merge" best case, and in particular about avoiding regressions when
upgrading from versions prior to 9.6 (9.6 is the version where we
began to generally prefer quicksort).
Also, people often sort on
keys of more than one column....
Very true. I should test this.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 09/11/2017 01:03 AM, Peter Geoghegan wrote:
On Sat, Sep 9, 2017 at 8:28 AM, Greg Stark <stark@mit.edu> wrote:
This may be a bit "how long is a piece of string" but how do those two
compare with string sorting in an interesting encoding/locale -- say
/usr/share/dict/polish in pl_PL for example. It's certainly true that
people do sort text as well as numbers.I haven't looked at text, because I presume that it's very rare for
tuples within a table to be more or less ordered by a text
attribute. While the traditional big advantage of replacement
selection has always been its ability to double initial run length on
average, where text performance is quite interesting because
localized clustering still happens, that doesn't really seem relevant
here. The remaining use of replacement selection is expressly only
about the "single run, no merge" best case, and in particular about
avoiding regressions when upgrading from versions prior to 9.6 (9.6
is the version where we began to generally prefer quicksort).Also, people often sort on keys of more than one column....
Very true. I should test this.
I'm currently re-running the benchmarks we did in 2016 for 9.6, but
those are all sorts with a single column (see the attached script). But
it'd be good to add a few queries testing sorts with multiple keys. We
can either tweak some of the existing data sets + queries, or come up
with entirely new tests.
The current tests construct data sets with different features (unique or
low/high-cardinality, random/sorted/almost-sorted). How should that work
for multi-key sorts? Same features for all columns, or some mix?
For the existing queries, I should have some initial results tomorrow,
at least for the data sets with 100k and 1M rows. The tests with 10M
rows will take much more time (it takes 1-2hours for a single work_mem
value, and we're testing 6 of them).
But perhaps we can reduce the number of tests somehow, only testing the
largest/smallest work_mem values? So that we could get some numbers now,
possibly running the complete test suite later.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
On Sun, Sep 10, 2017 at 5:07 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
I'm currently re-running the benchmarks we did in 2016 for 9.6, but
those are all sorts with a single column (see the attached script). But
it'd be good to add a few queries testing sorts with multiple keys. We
can either tweak some of the existing data sets + queries, or come up
with entirely new tests.
I see that work_mem is set like this in the script:
"for wm in '1MB' '8MB' '32MB' '128MB' '512MB' '1GB'; do"
I suggest that we forget about values over 32MB, since the question of
how well quicksort does there was settled by your tests in 2016. I
would also add '4MB' to the list of wm values that you'll actually
test.
Any case with input that is initially in random order or DESC sort
order is not interesting, either. I suggest you remove those, too.
I think we're only interested in benchmarks where replacement
selection really does get its putative best case (no merge needed in
the end). Any (almost) sorted cases (the only cases that you are
interesting to test now) will always manage that, once you set
replacement_sort_tuples high enough, and provided there isn't even a
single tuple that is completely out of order. The "before" cases here
should have a replacement_sort_tuples of 1 billion (so that we're sure
to not have the limit prevent the use of replacement selection in the
first place), versus the "after" cases, which should have a
replacement_sort_tuples of 0 to represent my proposal (to represent
performance in a world where replacement selection is totally
removed).
For the existing queries, I should have some initial results tomorrow,
at least for the data sets with 100k and 1M rows. The tests with 10M
rows will take much more time (it takes 1-2hours for a single work_mem
value, and we're testing 6 of them).
I myself don't see that much value in a 10M row test.
Thanks for volunteering to test!
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 09/11/2017 02:22 AM, Peter Geoghegan wrote:
On Sun, Sep 10, 2017 at 5:07 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:I'm currently re-running the benchmarks we did in 2016 for 9.6, but
those are all sorts with a single column (see the attached script). But
it'd be good to add a few queries testing sorts with multiple keys. We
can either tweak some of the existing data sets + queries, or come up
with entirely new tests.I see that work_mem is set like this in the script:
"for wm in '1MB' '8MB' '32MB' '128MB' '512MB' '1GB'; do"
I suggest that we forget about values over 32MB, since the question of
how well quicksort does there was settled by your tests in 2016. I
would also add '4MB' to the list of wm values that you'll actually
test.
OK, so 1MB, 4MB, 8MB, 32MB?
Any case with input that is initially in random order or DESC sort
order is not interesting, either. I suggest you remove those, too.
OK.
I think we're only interested in benchmarks where replacement
selection really does get its putative best case (no merge needed in
the end). Any (almost) sorted cases (the only cases that you are
interesting to test now) will always manage that, once you set
replacement_sort_tuples high enough, and provided there isn't even a
single tuple that is completely out of order. The "before" cases here
should have a replacement_sort_tuples of 1 billion (so that we're sure
to not have the limit prevent the use of replacement selection in the
first place), versus the "after" cases, which should have a
replacement_sort_tuples of 0 to represent my proposal (to represent
performance in a world where replacement selection is totally
removed).
Ah, so you suggest doing all the tests on current master, by only
tweaking the replacement_sort_tuples value? I've been testing master vs.
your patch, but I guess setting replacement_sort_tuples=0 should have
the same effect.
I probably won't eliminate the random/DESC data sets, though. At least
not from the two smaller data sets - I want to do a bit of benchmarking
on Heikki's polyphase merge removal patch, and for that patch those data
sets are still relevant. Also, it's useful to have a subset of results
where we know we don't expect any change.
For the existing queries, I should have some initial results
tomorrow, at least for the data sets with 100k and 1M rows. The
tests with 10M rows will take much more time (it takes 1-2hours for
a single work_mem value, and we're testing 6 of them).I myself don't see that much value in a 10M row test.
Meh, more data is probably better. And with the reduced work_mem values
and skipping of random/DESC data sets it should complete much faster.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Sep 10, 2017 at 5:59 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
OK, so 1MB, 4MB, 8MB, 32MB?
Right.
Ah, so you suggest doing all the tests on current master, by only
tweaking the replacement_sort_tuples value? I've been testing master vs.
your patch, but I guess setting replacement_sort_tuples=0 should have
the same effect.
I think that there may be a very slight advantage to RS-free
performance with my patch actually applied, because it removes the
number of instructions that heap maintenance (merging) routines
consist of. This is due to my removing HEAPCOMPARE()/tupindex
handling. However, it's probably a low single digit percentage
difference -- not enough to be worth taking into account, probably. I
assume that just setting replacement_sort_tuples to zero is more
convenient for you (that's what I did).
To be clear, you'll still need to set replacement_sort_tuples high
when testing RS, to make sure that we really use it for at least the
first run when we're expected to. (There is no easy way to have
testing mechanically verify that we really do only have one run in the
end with RS, but I assume that such paranoia is unneeded.)
I probably won't eliminate the random/DESC data sets, though. At least
not from the two smaller data sets - I want to do a bit of benchmarking
on Heikki's polyphase merge removal patch, and for that patch those data
sets are still relevant. Also, it's useful to have a subset of results
where we know we don't expect any change.
Okay. The DESC dataset is going to make my patch look good, because it
won't change anything for merging (same number of runs in the end),
but sorting will be slower for the first run with RS.
Meh, more data is probably better. And with the reduced work_mem values
and skipping of random/DESC data sets it should complete much faster.
Note that my own test case had a much higher number of tuples than
even 10 million -- it had 100 million tuples. I think that if any of
your test cases bring about a new insight, it will not be due to the
number of distinct tuples. But, if the extra time it takes doesn't
matter to you, then it doesn't matter to me either.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi, attached are some initial numbers from the two machines I'm using
for testing (the usual ones - old i5-2500k, new e5-2620), for the two
smaller data sets.
Overall I think the results show quite significant positive impact of
the patch. There are a few cases of regression, but ISTM those may
easily be noise as it's usually 0.03 vs 0.04 second, or something. I'll
switch to the \timing (instead of /usr/bin/time) to get more accurate
results, and rerun those tests.
FWIW, I've been running the tests with trace_sort, so we can inspect the
server log if needed for individual cases. I've pushed the data to
https://bitbucket.org/tvondra/sort-benchmarks-2017/src
On 09/11/2017 03:39 AM, Peter Geoghegan wrote:
On Sun, Sep 10, 2017 at 5:59 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:[snip]
To be clear, you'll still need to set replacement_sort_tuples high
when testing RS, to make sure that we really use it for at least the
first run when we're expected to. (There is no easy way to have
testing mechanically verify that we really do only have one run in the
end with RS, but I assume that such paranoia is unneeded.)
OK. I'll probably try both, at least with the two small datasets.
Doesn't hurt, and perhaps the numbers will be interesting.
I probably won't eliminate the random/DESC data sets, though. At
least not from the two smaller data sets - I want to do a bit of
benchmarking on Heikki's polyphase merge removal patch, and for
that patch those data sets are still relevant. Also, it's useful to
have a subset of results where we know we don't expect any change.Okay. The DESC dataset is going to make my patch look good, because
it won't change anything for merging (same number of runs in the
end), but sorting will be slower for the first run with RS.
Well, then I think it's a useful test and we should not exclude it. I
assume there will be a few cases where the patch causes regression, and
to judge the overall impact of the patch it's useful to also quantify
the positive cases (even if we expect the improvements).
Meh, more data is probably better. And with the reduced work_mem
values and skipping of random/DESC data sets it should complete
much faster.Note that my own test case had a much higher number of tuples than
even 10 million -- it had 100 million tuples. I think that if any of
your test cases bring about a new insight, it will not be due to the
number of distinct tuples. But, if the extra time it takes doesn't
matter to you, then it doesn't matter to me either.
I wouldn't say the extra time does not matter, but I think it would be
good to get some initial results quickly, and then perhaps run the
larger tests. So I'll focus on the two smaller data sets for now, and
then perhaps run the larger tests.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
e5-2620-v4.odsapplication/vnd.oasis.opendocument.spreadsheet; name=e5-2620-v4.odsDownload
PK :y+K�l9�. . mimetypeapplication/vnd.oasis.opendocument.spreadsheetPK :y+Kq��e e Thumbnails/thumbnail.png�PNG
IHDR � � {�N ZPLTE)))666===AAAIIISSS[[[dddnnnsss{{{������������������������������������������������ ���={lV �IDATx����n�6����C��y�����b�:���u���� Q��!�|z:���'=���76l���6K}p�Tc����l�jc�������}���]�Pmv!%/�����/$�]Mij�D��t�K3EL����9����S��B���n�s���c��Kg���"��&���)�xjk���c���a�"���h��������T��\|�R~��2f�����c����l���������N��������5y(\�C���4�ZOc}���2z�56��6[���
-����Q���K���Q���eUz�1o39�������"�����Y}�����.��n��x&YS�h�c�����d�m}� (�-&O������}�!�����PV��00�:�������t�fK��Z_kW]��/����������������o�i������+���Zm�^Z%�� �'kH���?�P������}�����35����^FX�WZC{���ubby�}�}{������������m7��n�t(�b���s�_�5����6K��>���\��rir�{���f�[��K��Cuz_��z��7f���������K���a���A� ��Vk���������{�1����m�����
���O?J���o�p��Jo$L8o�n���r��X��5�u�r��I���.��c7�fh�SP�\�D��E��������l�f��FI����m���.t�Si����+�;EBVKb�[b�������������~5o������s�
!�������,��qo���\n��Rd�`��fXxw���fX�bsa�b�3��w���@���2<������?Fp�WWO���h[c�=N
9���0�������)9�%7�-�W��t�����|>�����r.����ofmp��!;���uC\� ��69iu���.�z�M�9����SQ���qV����gy���^O�"��*����=�vK�Vjy�E^���v��������g `�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`����A}���C��
�l���`x������-� IEND�B`�PK :y+K settings.xml�[[s�:~?����L�!��L�c\ni�&68 o�-@�,�H2���#ch�b�N�1/`K�v�Z}��p�e���2�(i���J��C]D&��p�9�\���������u�:��8�PY��ds��qq�0R��#^'���.�:�!Y7�?�]_
���1"��T�^��axR6�TUU�,K�UJ�h�����SQ������2Ka5E9�����J�'����k;���<[ ����^d���u�Z�,E�g�?�V���yqdc�1�/�������rS9��y��|�o�+���������{M�[U?>R?WOvp�� ���)���i�F:[��2����\0��f��wi�n�9 �$o)���C=0�)��
��q��)GB���m&�������:b�2�H� ��1����c0�lG�L '/�
���r�>�?|�h�� ������������F����-��Q[T�e<��H�L=:� 6Qc�P�� L`D����
6���]�6���C �S� nG����������l8:�
�.(������~��K%��(������?$U�5� ���� -1�c��h���#P0v��c@�.�����f�)C���~:*��`�����8��T�<��uj-�+DL�M?���%�CW?�|*�l_?���F��I>7�'8�.bF������N��yA���o!G�:��m@c��j���V;Iy,�l\s�Jp�z���=\��bN!�nr@������@s��sn��L�,]�'�:�"���6��3����e}n���9f/ �}�/��&��\:W�]����t� �5���l�OT?#����u�^�#n.�3e���d/�<�X���^�C����E�@4����'�-��\�1oO
R�}�uu���������5$S�9�k��"m#����@�=]}�a29�����<(�dD����8"