Using quicksort for every external sort run
I'll start a new thread for this, since my external sorting patch has
now evolved well past the original "quicksort with spillover"
idea...although not quite how I anticipated it would. It seems like
I've reached a good point to get some feedback.
I attach a patch series featuring a new, more comprehensive approach
to quicksorting runs during external sorts. What I have now still
includes "quicksort with spillover", but it's just a part of a larger
project. I am quite happy with the improvements in performance shown
by my testing, which I go into below.
Controversy
=========
A few weeks ago, I did not anticipate that I'd propose that
replacement selection sort be used far less (only somewhat less, since
I was only somewhat doubtful about the algorithm at the time). I had
originally planned on continuing to *always* use it for the first run,
to make "quicksort with spillover" possible (thereby sometimes
avoiding significant I/O by not spilling most tuples), but also to
make cases always considered sympathetic to replacement selection
continue to happen. I thought that second or subsequent runs could
still be quicksorted, but that I must still care about this latter
category, the traditional sympathetic cases. This latter category is
mostly just one important property of replacement selection: even
without a strong logical/physical correlation, the algorithm tends to
produce runs that are about twice the size of work_mem. (It's also
notable that replacement selection only produces one run with mostly
presorted input, even where input far exceeds work_mem, which is a
neat trick.)
I wanted to avoid controversy, but the case for the controversy is too
strong for me to ignore: despite these upsides, replacement selection
is obsolete, and should usually be avoided.
Replacement selection sort still has a role to play in making
"quicksort with spillover" possible (when a sympathetic case is
*anticipated*), but other than that it seems generally inferior to a
simple hybrid sort-merge strategy on modern hardware. By modern
hardware, I mean anything manufactured in at least the last 20 years.
We've already seen that the algorithm's use of a heap works badly with
modern CPU caches, but that is just one factor contributing to its
obsolescence.
The big selling point of replacement selection sort in the 20th
century was that it sometimes avoided multi-pass sorts as compared to
a simple sort-merge strategy (remember when tuplesort.c always used 7
tapes? When you need to use 7 actual magnetic tapes, rewinding is
expensive and in general this matters a lot!). We all know that memory
capacity has grown enormously since then, but we must also consider
another factor: At the same time, a simple hybrid sort-merge
strategy's capacity to more or less get the important detail here
right -- to avoid a multi-pass sort -- has increased quadratically
(relative to work_mem/memory capacity). As an example, testing shows
that for a datum tuplesort that requires about 2300MB of work_mem to
be completed as a simple internal sort this patch only needs 30MB to
just do one pass (see benchmark query below). I've mostly regressed
that particular property of tuplesort (it used to be less than 30MB),
but that's clearly the wrong thing to worry about for all kinds of
reasons, probably even in the unimportant cases now forced to do
multiple passes.
Multi-pass sorts
---------------------
I believe, in general, that we should consider a multi-pass sort to be
a kind of inherently suspect thing these days, in the same way that
checkpoints occurring 5 seconds apart are: not actually abnormal, but
something that we should regard suspiciously. Can you really not
afford enough work_mem to only do one pass? Does it really make sense
to add far more I/O and CPU costs to avoid that other tiny memory
capacity cost?
In theory, the answer could be "yes", but it seems highly unlikely.
Not only is very little memory required to avoid a multi-pass merge
step, but as described above the amount required grows very slowly
relative to linear growth in input. I propose to add a
checkpoint_warning style warning (with a checkpoint_warning style GUC
to control it). ISTM that these days, multi-pass merges are like
saving $2 on replacing a stairwell light bulb, at the expense of
regularly stumbling down the stairs in the dark. It shouldn't matter
if you have a 50 terabyte decision support database or if you're
paying Heroku a small monthly fee to run a database backing your web
app: simply avoiding multi-pass merges is probably always the most
economical solution, and by a wide margin.
Note that I am not skeptical of polyphase merging itself, even though
it is generally considered to be a complimentary technique to
replacement selection (some less formal writing on external sorting
seemingly fails to draw a sharp distinction). Nothing has changed
there.
Patch, performance
===============
Let's focus on a multi-run sort, that does not use "quicksort with
spillover", since that is all new, and is probably the most compelling
case for very large databases with hundreds of gigabytes of data to
sort.
I think that this patch requires a machine with more I/O bandwidth
than my laptop to get a proper sense of the improvement made. I've
been using a tmpfs temp_tablespace for testing, to simulate this. That
may leave me slightly optimistic about I/O costs, but you can usually
get significantly more sequential I/O bandwidth by adding additional
disks, whereas you cannot really buy new hardware to improve the
situation with excessive CPU cache misses.
Benchmark
---------------
-- Setup, 100 million tuple table with high cardinality int4 column (2
billion possible int4 values)
create table big_high_cardinality_int4 as
select (random() * 2000000000)::int4 s,
'abcdefghijlmn'::text junk
from generate_series(1, 100000000);
-- Make cost model hinting accurate:
analyze big_high_cardinality_int4;
checkpoint;
Let's start by comparing an external sort that uses 1/3 the memory of
an internal sort against the master branch. That's completely unfair
on the patch, of course, but it is a useful indicator of how well
external sorts do overall. Although an external sort surely cannot be
as fast as an internal sort, it might be able to approach an internal
sort's speed when there is plenty of I/O bandwidth. That's a good
thing to aim for, I think.
-- Master (just enough memory for internal sort):
set work_mem = '2300MB';
select count(distinct(s)) from big_high_cardinality;
***** Runtime after stabilization: ~33.6 seconds *****
-- Patch series, but with just over 1/3 the memory:
set work_mem = '800MB';
select count(distinct(s)) from big_high_cardinality;
***** Runtime after stabilization: ~37.1 seconds *****
The patch only takes ~10% more time to execute this query, which seems
very good considering that ~1/3 the work_mem has been put to use.
trace_sort output for patch during execution of this case:
LOG: begin datum sort: workMem = 819200, randomAccess = f
LOG: switching to external sort with 2926 tapes: CPU 0.39s/2.66u sec
elapsed 3.06 sec
LOG: replacement selection avg tuple size 24.00 crossover: 0.85
LOG: hybrid sort-merge in use from row 34952532 with 100000000.00 total rows
LOG: finished quicksorting run 1: CPU 0.39s/8.84u sec elapsed 9.24 sec
LOG: finished writing quicksorted run 1 to tape 0: CPU 0.60s/9.61u
sec elapsed 10.22 sec
LOG: finished quicksorting run 2: CPU 0.87s/18.61u sec elapsed 19.50 sec
LOG: finished writing quicksorted run 2 to tape 1: CPU 1.07s/19.38u
sec elapsed 20.46 sec
LOG: performsort starting: CPU 1.27s/21.79u sec elapsed 23.07 sec
LOG: finished quicksorting run 3: CPU 1.27s/27.07u sec elapsed 28.35 sec
LOG: finished writing quicksorted run 3 to tape 2: CPU 1.47s/27.69u
sec elapsed 29.18 sec
LOG: performsort done (except 3-way final merge): CPU 1.51s/28.54u
sec elapsed 30.07 sec
LOG: external sort ended, 146625 disk blocks used: CPU 1.76s/35.32u
sec elapsed 37.10 sec
Note that the on-tape runs are small relative to CPU costs, so this
query is a bit sympathetic (consider the time spent writing batches
that trace_sort indicates here). CREATE INDEX would not compare so
well with an internal sort, for example, especially if it was a
composite index or something. I've sized work_mem here in a deliberate
way, to make sure there are 3 runs of similar size by the time the
merge step is reached, which makes a small difference in the patch's
favor. All told, this seems like a very significant overall
improvement.
Now, consider master's performance with the same work_mem setting (a
fair test with comparable resource usage for master and patch):
-- Master
set work_mem = '800MB';
select count(distinct(s)) from big_high_cardinality;
***** Runtime after stabilization: ~120.9 seconds *****
The patch is ~3.25x faster than master here, which also seems like a
significant improvement. That's pretty close to the improvement
previously seen for good "quicksort with spillover" cases, but
suitable for every external sort case that doesn't use "quicksort with
spillover". In other words, no variety of external sort is not
significantly improved by the patch.
I think it's safe to suppose that there are also big benefits when
multiple concurrent sort operations run on the same system. For
example, when pg_restore has multiple jobs.
Worst case
---------------
Even with a traditionally sympathetic case for replacement selection
sort, the patch beats replacement selection with multiple on-tape
runs. When experimenting here, I did not forget to account for our
qsort()'s behavior in the event of *perfectly* presorted input
("Bubble sort best case" behavior [1]Commit a3f0b3d6 -- Peter Geoghegan). Other than that, I have a hard
time thinking of an unsympathetic case for the patch, and could not
find any actual regressions with a fair amount of effort.
Abbreviated keys are not used when merging, but that doesn't seem to
be something that notably counts against the new approach (which will
have shorter runs on average). After all, the reason why abbreviated
keys aren't saved on disk for merging is that they're probably not
very useful when merging. They would resolve far fewer comparisons if
they were used during merging, and having somewhat smaller runs does
not result in significantly more non-abbreviated comparisons, even
when sorting random noise strings.
Avoiding replacement selection *altogether*
=================================
Assuming you agree with my conclusions on replacement selection sort
mostly not being worth it, we need to avoid replacement selection
except when it'll probably allow a "quicksort with spillover". In my
mind, that's now the *only* reason to use replacement selection.
Callers pass a hint to tuplesort indicating how many tuples it is
estimated will ultimately be passed before a sort is performed.
(Typically, this comes from a scan plan node's row estimate, or more
directly from the relcache for things like CREATE INDEX.)
Cost model -- details
----------------------------
Second or subsequent runs *never* use replacement selection -- it is
only *considered* for the first run, right before the possible point
of initial heapification within inittapes(). The cost model is
contained within the new function useselection(). See the second patch
in the series for full details. That's where this is added.
I have a fairly high bar for even using replacement selection for the
first run -- several factors can result in a simple hybrid sort-merge
strategy being used instead of a "quicksort with spillover", because
in general most of the benefit seems to be around CPU cache misses
rather than savings in I/O. Consider my benchmark query above once
more -- with replacement selection used for the first run in the
benchmark case above (e.g., with just the first patch in the series
applied, or setting the "optimize_avoid_selection" debug GUC to
"off"), I found that it took over twice as long to execute, even
though the second-or-subsequent (now smaller) runs were quicksorted
just the same, and were all merged just the same.
The numbers should make it obvious why I gave in to the temptation of
adding an ad-hoc, tuplesort-private cost model. At this point, I'd
rather scrap "quicksort with spillover" (and the use of replacement
selection under all possible circumstances) than scrap the idea of a
cost model. That would make more sense, even though it would give up
on the idea of saving most I/O where the work_mem threshold is only
crossed by a small amount.
Future work
=========
I anticipate a number of other things within the first patch in the
series, some of which are already worked out to some degree.
Asynchronous I/O
-------------------------
This patch leaves open the possibility of using something like
libaio/librt for sorting. That would probably use half of memtuples as
scratch space, while the other half is quicksorted.
Memory prefetching
---------------------------
To test what role memory prefetching is likely to have here, I attach
a custom version of my tuplesort/tuplestore prefetch patch, with
prefetching added to the "quicksort with spillover" and batch dumping
runs WRITETUP()-calling code. This seems to help performance
measurably. However, I guess it shouldn't really be considered as part
of this patch. It can follow the initial commit of the big, base patch
(or will becomes part of the base patch if and when prefetching is
committed first).
cost_sort() changes
--------------------------
I had every intention of making cost_sort() a continuous cost function
as part of this work. This could be justified by "quicksort with
spillover" allowing tuplesort to "blend" from internal to external
sorting as input size is gradually increased. This seemed like
something that would have significant non-obvious benefits in several
other areas. However, I've put off dealing with making any change to
cost_sort() because of concerns about the complexity of overlaying
such changes on top of the tuplesort-private cost model.
I think that this will need to be discussed in a lot more detail. As a
further matter, materialization of sort nodes will probably also
require tweaks to the costing for "quicksort with spillover". Recall
that "quicksort with spillover" can only work for !randomAccess
tuplesort callers.
Run size
------------
This patch continues to have tuplesort determine run size based on the
availability of work_mem only. It does not entirely fix the problem of
having work_mem sizing impact performance in counter-intuitive ways.
In other words, smaller work_mem sizes can still be faster. It does
make that general situation much better, though, because quicksort is
a cache oblivious algorithm. Smaller work_mem sizes are sometimes a
bit faster, but never dramatically faster.
In general, the whole idea of making run size as big as possible is
bogus, unless that enables or is likely to enable a "quicksort with
spillover". The caller-supplied row count hint I've added may in the
future be extended to determine optimal run size ahead of time, when
it's perfectly clear (leaving aside misestimation) that a fully
internal sort (or "quicksort with spillover") will not occur. This
will result in faster external sorts where additional work_mem cannot
be put to good use. As a side benefit, external sorts will not be
effectively wasting a large amount of memory.
The cost model we eventually come up with to determine optimal run
size ought to balance certain things. Assuming a one-pass merge step,
then we should balance the time lost waiting on the first run and time
quicksorting the last run with the gradual increase in the cost during
the merge step. Maybe the non-use of abbreviated keys during the merge
step should also be considered. Alternatively, the run size may be
determined by a GUC that is typically sized at drive controller cache
size (e.g. 1GB) when any kind of I/O avoidance for the sort appears
impossible.
[1]: Commit a3f0b3d6 -- Peter Geoghegan
--
Peter Geoghegan
Attachments:
0001-Quicksort-when-performing-external-sorts.patchtext/x-patch; charset=US-ASCII; name=0001-Quicksort-when-performing-external-sorts.patchDownload
From 1f5dd12fa0bf632598ca1e7e890a7ee581af9a9b Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghegan86@gmail.com>
Date: Wed, 29 Jul 2015 15:38:12 -0700
Subject: [PATCH 1/5] Quicksort when performing external sorts
Add "quicksort with spillover". This allows an external sort that is
within about 2x of work_mem to avoid writing out most tuples, and to
quicksort perhaps almost all tuples rather than performing a degenerate
heapsort. Often, an external sort is now only marginally more expensive
than an internal sort, which is a significant improvement, and sort
performance is made much more predictable.
In addition, have tuplesort give up on replacement selection when it
appears to not be promising. Most of the benefits of replacement
selection are seen where only one (or at most two) runs are ultimately
produced, where that would not occur with a simple hybrid sort-merge
strategy, or where incremental spilling rather than spilling in batches
allows a "quicksort with spillover" to ultimately write almost no tuples
out. It can be helpful to only produce one run in the common case where
input is already in mostly sorted order.
These cases are important, so replacement selection is still relevant.
However, since, in general, maintaining runs as a heap has been shown to
interact very negatively with modern CPUs with fast caches, it doesn't
seem worth holding on when a non-sympathetic case for replacement
selection is encountered. Therefore, when a second or subsequent run is
necessary (rather than preliminarily appearing necessary, something a
"quicksort with spillover" is often able to safely disregard), the
second and subsequent runs are also quicksorted, but dumped in batch.
Testing has shown this to be much faster in many realistic cases,
although there is no saving in I/O.
The replacement selection run-building heap is maintained inexpensively.
There is no need to distinguish between tuples that belong to the second
run as the heap property is initially maintained (they are destined to
be quicksorted along with any first run tuples still in memory -- this
is effectively an all-in-memory merge), and so second-or-subsequent-run
tuples are appended to the end of memtuples indifferently during tuple
copying, which is cache friendly. After the last tuple from the first
(on-tape) run must be dumped incrementally to the first tape, memtuples
ceases to be a heap, and although I/O cannot be avoided, everything is
still quicksorted (and dumped in batch).
---
src/backend/commands/explain.c | 13 +-
src/backend/utils/sort/tuplesort.c | 544 +++++++++++++++++++++++++++++++++----
src/include/utils/tuplesort.h | 3 +-
3 files changed, 496 insertions(+), 64 deletions(-)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 5d06fa4..94b1f77 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2178,20 +2178,27 @@ show_sort_info(SortState *sortstate, ExplainState *es)
const char *sortMethod;
const char *spaceType;
long spaceUsed;
+ int rowsSortedMem;
- tuplesort_get_stats(state, &sortMethod, &spaceType, &spaceUsed);
+ tuplesort_get_stats(state, &sortMethod, &spaceType, &spaceUsed,
+ &rowsSortedMem);
if (es->format == EXPLAIN_FORMAT_TEXT)
{
appendStringInfoSpaces(es->str, es->indent * 2);
- appendStringInfo(es->str, "Sort Method: %s %s: %ldkB\n",
- sortMethod, spaceType, spaceUsed);
+ appendStringInfo(es->str,
+ "Sort Method: %s %s: %ldkB Rows In Memory: %d\n",
+ sortMethod,
+ spaceType,
+ spaceUsed,
+ rowsSortedMem);
}
else
{
ExplainPropertyText("Sort Method", sortMethod, es);
ExplainPropertyLong("Sort Space Used", spaceUsed, es);
ExplainPropertyText("Sort Space Type", spaceType, es);
+ ExplainPropertyInteger("Rows In Memory", rowsSortedMem, es);
}
}
}
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index d532e87..fc4ac90 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -13,10 +13,11 @@
* See Knuth, volume 3, for more than you want to know about the external
* sorting algorithm. We divide the input into sorted runs using replacement
* selection, in the form of a priority tree implemented as a heap
- * (essentially his Algorithm 5.2.3H), then merge the runs using polyphase
- * merge, Knuth's Algorithm 5.4.2D. The logical "tapes" used by Algorithm D
- * are implemented by logtape.c, which avoids space wastage by recycling
- * disk space as soon as each block is read from its "tape".
+ * (essentially his Algorithm 5.2.3H -- although that strategy can be
+ * abandoned where it does not appear to help), then merge the runs using
+ * polyphase merge, Knuth's Algorithm 5.4.2D. The logical "tapes" used by
+ * Algorithm D are implemented by logtape.c, which avoids space wastage by
+ * recycling disk space as soon as each block is read from its "tape".
*
* We do not form the initial runs using Knuth's recommended replacement
* selection data structure (Algorithm 5.4.1R), because it uses a fixed
@@ -72,6 +73,15 @@
* to one run per logical tape. The final merge is then performed
* on-the-fly as the caller repeatedly calls tuplesort_getXXX; this
* saves one cycle of writing all the data out to disk and reading it in.
+ * Also, if only one run is spilled to tape so far when
+ * tuplesort_performsort() is reached, and if the caller does not require
+ * random access, then the merge step can take place between still
+ * in-memory tuples, and tuples stored on tape (it does not matter that
+ * there may be a second run in that array -- only that a second one has
+ * spilled). This ensures that spilling to disk only occurs for a number of
+ * tuples approximately equal to the number of tuples read in after
+ * work_mem was reached and it became apparent that an external sort is
+ * required.
*
* Before Postgres 8.2, we always used a seven-tape polyphase merge, on the
* grounds that 7 is the "sweet spot" on the tapes-to-passes curve according
@@ -86,6 +96,23 @@
* we preread from a tape, so as to maintain the locality of access described
* above. Nonetheless, with large workMem we can have many tapes.
*
+ * Before Postgres 9.6, we always used a heap for replacement selection when
+ * building runs. However, Knuth does not consider the influence of memory
+ * access on overall performance, which is a crucial consideration on modern
+ * machines; replacement selection is only really of value where a single
+ * run or two runs can be produced, sometimes avoiding a merge step
+ * entirely. Replacement selection makes this likely when tuples are read
+ * in approximately logical order, even if work_mem is only a small fraction
+ * of the requirement for an internal sort, but large main memory sizes
+ * don't benefit from tiny, incremental spills, even with enormous datasets.
+ * If, having maintained a replacement selection priority queue (heap) for
+ * the first run it transpires that there will be multiple on-tape runs
+ * anyway, we abandon treating memtuples as a heap, and quicksort and write
+ * in memtuples-sized batches. This gives us most of the advantages of
+ * always quicksorting and batch dumping runs, which can perform much better
+ * than heap sorting and incrementally spilling tuples, without giving up on
+ * replacement selection in cases where it remains compelling.
+ *
*
* Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
@@ -160,7 +187,10 @@ bool optimize_bounded_sort = true;
* described above. Accordingly, "tuple" is always used in preference to
* datum1 as the authoritative value for pass-by-reference cases.
*
- * While building initial runs, tupindex holds the tuple's run number. During
+ * While building initial runs, tupindex holds the tuple's run number.
+ * Historically, the run number could meaningfully distinguish many runs, but
+ * currently it only meaningfully distinguishes the first run with any other
+ * run, since replacement selection is abandoned after the first run. During
* merge passes, we re-use it to hold the input tape number that each tuple in
* the heap was read from, or to hold the index of the next tuple pre-read
* from the same tape in the case of pre-read entries. tupindex goes unused
@@ -186,6 +216,7 @@ typedef enum
TSS_BUILDRUNS, /* Loading tuples; writing to tape */
TSS_SORTEDINMEM, /* Sort completed entirely in memory */
TSS_SORTEDONTAPE, /* Sort completed, final run is on tape */
+ TSS_MEMTAPEMERGE, /* Performing memory/tape merge on-the-fly */
TSS_FINALMERGE /* Performing final merge on-the-fly */
} TupSortStatus;
@@ -280,6 +311,13 @@ struct Tuplesortstate
bool growmemtuples; /* memtuples' growth still underway? */
/*
+ * While building initial runs, this indicates if the replacement
+ * selection strategy or simple hybrid sort-merge strategy is in use.
+ * Replacement selection is abandoned after first run.
+ */
+ bool replaceActive;
+
+ /*
* While building initial runs, this is the current output run number
* (starting at 0). Afterwards, it is the number of initial runs we made.
*/
@@ -327,12 +365,22 @@ struct Tuplesortstate
int activeTapes; /* # of active input tapes in merge pass */
/*
+ * These variables are used after tuplesort_performsort() for the
+ * TSS_MEMTAPEMERGE case. This is a special, optimized final on-the-fly
+ * merge pass involving merging the result tape with memtuples that were
+ * quicksorted (but never made it out to a tape).
+ */
+ SortTuple tape_cache; /* cached tape tuple from prior call */
+ bool cached; /* tape_cache holds pending tape tuple */
+ bool just_memtuples; /* merge only fetching from memtuples */
+
+ /*
* These variables are used after completion of sorting to keep track of
* the next tuple to return. (In the tape case, the tape's current read
* position is also critical state.)
*/
int result_tape; /* actual tape number of finished output */
- int current; /* array index (only used if SORTEDINMEM) */
+ int current; /* memtuples array index */
bool eof_reached; /* reached EOF (needed for cursors) */
/* markpos_xxx holds marked position for mark and restore */
@@ -464,12 +512,15 @@ static void inittapes(Tuplesortstate *state);
static void selectnewtape(Tuplesortstate *state);
static void mergeruns(Tuplesortstate *state);
static void mergeonerun(Tuplesortstate *state);
+static void mergememruns(Tuplesortstate *state);
static void beginmerge(Tuplesortstate *state);
static void mergepreread(Tuplesortstate *state);
static void mergeprereadone(Tuplesortstate *state, int srcTape);
static void dumptuples(Tuplesortstate *state, bool alltuples);
+static void dumpbatch(Tuplesortstate *state, bool alltuples);
static void make_bounded_heap(Tuplesortstate *state);
static void sort_bounded_heap(Tuplesortstate *state);
+static void tuplesort_quicksort(Tuplesortstate *state);
static void tuplesort_heap_insert(Tuplesortstate *state, SortTuple *tuple,
int tupleindex, bool checkIndex);
static void tuplesort_heap_siftup(Tuplesortstate *state, bool checkIndex);
@@ -1486,22 +1537,60 @@ puttuple_common(Tuplesortstate *state, SortTuple *tuple)
/*
* Insert the tuple into the heap, with run number currentRun if
- * it can go into the current run, else run number currentRun+1.
- * The tuple can go into the current run if it is >= the first
- * not-yet-output tuple. (Actually, it could go into the current
- * run if it is >= the most recently output tuple ... but that
- * would require keeping around the tuple we last output, and it's
- * simplest to let writetup free each tuple as soon as it's
- * written.)
+ * it can go into the current run, else run number INT_MAX (some
+ * later run). The tuple can go into the current run if it is
+ * >= the first not-yet-output tuple. (Actually, it could go
+ * into the current run if it is >= the most recently output
+ * tuple ... but that would require keeping around the tuple we
+ * last output, and it's simplest to let writetup free each
+ * tuple as soon as it's written.)
*
- * Note there will always be at least one tuple in the heap at
- * this point; see dumptuples.
+ * Note that this only applies if the currentRun is 0 (prior to
+ * giving up on heapification). There is no meaningful
+ * distinction between any two runs in memory except the first
+ * and second run. When the currentRun is not 0, there is no
+ * guarantee that any tuples are already stored in memory here,
+ * and if there are any they're in no significant order.
*/
- Assert(state->memtupcount > 0);
- if (COMPARETUP(state, tuple, &state->memtuples[0]) >= 0)
+ if (state->replaceActive &&
+ COMPARETUP(state, tuple, &state->memtuples[0]) >= 0)
+ {
+ /*
+ * Unlike classic replacement selection, which this module was
+ * previously based on, only run 0 is treated as a priority
+ * queue through heapification. The second run (run 1) is
+ * appended indifferently below, and will never be trusted to
+ * maintain the heap invariant beyond simply not getting in
+ * the way of spilling run 0 incrementally. In other words,
+ * second run tuples may be sifted out of the way of first
+ * run tuples; COMPARETUP() will never be called for run
+ * 1 tuples. However, not even HEAPCOMPARE() will be
+ * called for a subsequent run's tuples.
+ */
tuplesort_heap_insert(state, tuple, state->currentRun, true);
+ }
else
- tuplesort_heap_insert(state, tuple, state->currentRun + 1, true);
+ {
+ /*
+ * Note that unlike Knuth, we do not care about the second
+ * run's tuples when loading runs. After the first run is
+ * complete, tuples will not be dumped incrementally at all,
+ * but as long as the first run (run 0) is current it will
+ * be maintained. dumptuples does not trust that the second
+ * or subsequent runs are heapified (beyond merely not
+ * getting in the way of the first, fully heapified run,
+ * which only matters for the second run, run 1). Anything
+ * past the first run will be quicksorted.
+ *
+ * Past the first run, there is no need to differentiate runs
+ * in memory (only the first and second runs will ever be
+ * usefully differentiated). Use a generic INT_MAX run
+ * number (just to be tidy). There should always be room to
+ * store the incoming tuple.
+ */
+ tuple->tupindex = INT_MAX;
+ state->memtuples[state->memtupcount++] = *tuple;
+ }
/*
* If we are over the memory limit, dump tuples till we're under.
@@ -1576,20 +1665,9 @@ tuplesort_performsort(Tuplesortstate *state)
/*
* We were able to accumulate all the tuples within the allowed
- * amount of memory. Just qsort 'em and we're done.
+ * amount of memory. Just quicksort 'em and we're done.
*/
- if (state->memtupcount > 1)
- {
- /* Can we use the single-key sort function? */
- if (state->onlyKey != NULL)
- qsort_ssup(state->memtuples, state->memtupcount,
- state->onlyKey);
- else
- qsort_tuple(state->memtuples,
- state->memtupcount,
- state->comparetup,
- state);
- }
+ tuplesort_quicksort(state);
state->current = 0;
state->eof_reached = false;
state->markpos_offset = 0;
@@ -1616,12 +1694,26 @@ tuplesort_performsort(Tuplesortstate *state)
/*
* Finish tape-based sort. First, flush all tuples remaining in
- * memory out to tape; then merge until we have a single remaining
- * run (or, if !randomAccess, one run per tape). Note that
- * mergeruns sets the correct state->status.
+ * memory out to tape where that's required (when more than one
+ * run's tuples made it to tape, or when the caller required
+ * random access). Then, either merge until we have a single
+ * remaining run on tape, or merge runs in memory by sorting
+ * them into one single in-memory run. Note that
+ * mergeruns/mergememruns sets the correct state->status.
*/
- dumptuples(state, true);
- mergeruns(state);
+ if (state->currentRun > 0 || state->randomAccess)
+ {
+ dumptuples(state, true);
+ mergeruns(state);
+ }
+ else
+ {
+ /*
+ * Only possible for !randomAccess callers, just as with
+ * tape based on-the-fly merge
+ */
+ mergememruns(state);
+ }
state->eof_reached = false;
state->markpos_block = 0L;
state->markpos_offset = 0;
@@ -1640,6 +1732,9 @@ tuplesort_performsort(Tuplesortstate *state)
elog(LOG, "performsort done (except %d-way final merge): %s",
state->activeTapes,
pg_rusage_show(&state->ru_start));
+ else if (state->status == TSS_MEMTAPEMERGE)
+ elog(LOG, "performsort done (except memory/tape final merge): %s",
+ pg_rusage_show(&state->ru_start));
else
elog(LOG, "performsort done: %s",
pg_rusage_show(&state->ru_start));
@@ -1791,6 +1886,118 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
READTUP(state, stup, state->result_tape, tuplen);
return true;
+ case TSS_MEMTAPEMERGE:
+ Assert(forward);
+ /* For now, assume tuple returned from memory */
+ *should_free = false;
+
+ /*
+ * Should be at least one memtuple (work_mem should be roughly
+ * fully consumed)
+ */
+ Assert(state->memtupcount > 0);
+
+ if (state->eof_reached)
+ return false;
+
+ if (state->just_memtuples)
+ goto just_memtuples;
+
+ /*
+ * Merge together quicksorted memtuples array, and sorted tape.
+ *
+ * When this optimization was initially applied, the array was
+ * heapified. Some number of tuples were spilled to disk from the
+ * top of the heap irregularly, and are read from tape here in
+ * fully sorted order. memtuples usually originally contains 2
+ * runs, though, so we merge it with the on-tape run.
+ * (Quicksorting effectively merged the 2 in-memory runs into one
+ * in-memory run already)
+ *
+ * Exhaust the supply of tape tuples first.
+ *
+ * "stup" is always initially set to the current tape tuple if
+ * any remain, which may be cached from previous call, or read
+ * from tape when nothing cached.
+ */
+ if (state->cached)
+ *stup = state->tape_cache;
+ else if ((tuplen = getlen(state, state->result_tape, true)) != 0)
+ READTUP(state, stup, state->result_tape, tuplen);
+ else
+ {
+ /* Supply of tape tuples was just exhausted */
+ state->just_memtuples = true;
+ goto just_memtuples;
+ }
+
+ /*
+ * Kludge: Trigger abbreviated tie-breaker if in-memory tuples
+ * use abbreviation (writing tuples to tape never preserves
+ * abbreviated keys). Do this by assigning in-memory
+ * abbreviated tuple to tape tuple directly.
+ *
+ * It doesn't seem worth generating a new abbreviated key for
+ * the tape tuple, and this approach is simpler than
+ * "unabbreviating" the memtuple tuple from a "common" routine
+ * like this.
+ *
+ * In the future, this routine could offer an API that allows
+ * certain clients (like ordered set aggregate callers) to
+ * cheaply test *inequality* across adjacent pairs of sorted
+ * tuples on the basis of simple abbreviated key binary
+ * inequality. Another advantage of this approach is that that
+ * can still work without reporting to clients that abbreviation
+ * wasn't used. The tape tuples might only be a small minority
+ * of all tuples returned.
+ */
+ if (state->sortKeys != NULL && state->sortKeys->abbrev_converter != NULL)
+ stup->datum1 = state->memtuples[state->current].datum1;
+
+ /*
+ * Compare current tape tuple to current memtuple.
+ *
+ * Since we always start with at least one memtuple, and since tape
+ * tuples are always returned before equal memtuples, it follows
+ * that there must be at least one memtuple left to return here.
+ */
+ Assert(state->current < state->memtupcount);
+
+ if (COMPARETUP(state, stup, &state->memtuples[state->current]) <= 0)
+ {
+ /*
+ * Tape tuple less than or equal to memtuple array current
+ * position. Return it.
+ */
+ state->cached = false;
+ /* Caller can free tape tuple memory */
+ *should_free = true;
+ }
+ else
+ {
+ /*
+ * Tape tuple greater than memtuple array's current tuple.
+ *
+ * Return current memtuple tuple, and cache tape tuple for
+ * next call. It will be returned on next or subsequent
+ * call.
+ */
+ state->tape_cache = *stup;
+ state->cached = true;
+ *stup = state->memtuples[state->current++];
+ }
+ return true;
+
+just_memtuples:
+ /* Just return memtuples -- merging done */
+ if (state->current < state->memtupcount)
+ {
+ *stup = state->memtuples[state->current++];
+ return true;
+ }
+ state->eof_reached = true;
+ return false;
+
case TSS_FINALMERGE:
Assert(forward);
*should_free = true;
@@ -2000,6 +2207,7 @@ tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples, bool forward)
return false;
case TSS_SORTEDONTAPE:
+ case TSS_MEMTAPEMERGE:
case TSS_FINALMERGE:
/*
@@ -2129,6 +2337,15 @@ inittapes(Tuplesortstate *state)
state->tp_tapenum = (int *) palloc0(maxTapes * sizeof(int));
/*
+ * Give replacement selection a try. There will be a switch to a simple
+ * hybrid sort-merge strategy after the first run (iff there is to be a
+ * second on-tape run).
+ */
+ state->replaceActive = true;
+ state->cached = false;
+ state->just_memtuples = false;
+
+ /*
* Convert the unsorted contents of memtuples[] into a heap. Each tuple is
* marked as belonging to run number zero.
*
@@ -2426,6 +2643,67 @@ mergeonerun(Tuplesortstate *state)
}
/*
+ * mergememruns -- merge runs in memory into a new in-memory run.
+ *
+ * This allows tuplesort to avoid dumping many tuples in the common case
+ * where work_mem is less than 2x the amount required for an internal sort
+ * ("quicksort with spillover"). This optimization does not appear in
+ * Knuth's algorithm.
+ *
+ * Merging here actually means quicksorting, without regard to the run
+ * number of each memtuple. Note that this in-memory merge is distinct from
+ * the final on-the-fly merge step that follows. This routine merges what
+ * was originally the second part of the first run with what was originally
+ * the entire second run in advance of the on-the-fly merge step (sometimes,
+ * there will only be one run in memory, but sorting is still required).
+ * The final on-the-fly merge occurs between the new all-in-memory run
+ * created by this routine, and what was originally the first part of the
+ * first run, which is already sorted on tape.
+ *
+ * The fact that the memtuples array has already been heapified (within the
+ * first run) is no reason to commit to the path of unnecessarily dumping
+ * and heapsorting input tuples. Often, memtuples will be much larger than
+ * the final on-tape run, which is where this optimization is most
+ * effective.
+ */
+static void
+mergememruns(Tuplesortstate *state)
+{
+ Assert(state->replaceActive);
+ Assert(!state->randomAccess);
+
+ /*
+ * It doesn't seem worth being more clever in the relatively rare case
+ * where there was no second run pending (i.e. no tuple that didn't
+ * belong to the original/first "currentRun") just to avoid an
+ * on-the-fly merge step, although that might be possible with care.
+ */
+ markrunend(state, state->currentRun);
+
+ /*
+ * The usual path for quicksorting runs (quicksort just before dumping
+ * all tuples) was avoided by caller, so quicksort to merge.
+ *
+ * Note that this may use abbreviated keys, which are no longer
+ * available for the tuples that spilled to tape. This is something
+ * that the final on-the-fly merge step accounts for.
+ */
+ tuplesort_quicksort(state);
+ state->current = 0;
+
+#ifdef TRACE_SORT
+ if (trace_sort)
+ elog(LOG, "finished quicksort of %d tuples to create single in-memory run: %s",
+ state->memtupcount, pg_rusage_show(&state->ru_start));
+#endif
+
+ state->result_tape = state->tp_tapenum[state->destTape];
+ /* Must freeze and rewind the finished output tape */
+ LogicalTapeFreeze(state->tapeset, state->result_tape);
+ state->status = TSS_MEMTAPEMERGE;
+}
+
+/*
* beginmerge - initialize for a merge pass
*
* We decrease the counts of real and dummy runs for each tape, and mark
@@ -2604,14 +2882,16 @@ mergeprereadone(Tuplesortstate *state, int srcTape)
}
/*
- * dumptuples - remove tuples from heap and write to tape
+ * dumptuples - remove tuples from memtuples and write to tape
*
* This is used during initial-run building, but not during merging.
*
* When alltuples = false, dump only enough tuples to get under the
- * availMem limit (and leave at least one tuple in the heap in any case,
- * since puttuple assumes it always has a tuple to compare to). We also
- * insist there be at least one free slot in the memtuples[] array.
+ * availMem limit (and leave at least one tuple in memtuples where necessary,
+ * since puttuple sometimes assumes it is a heap that has a tuple to compare
+ * to, and the final on-the-fly in-memory merge requires one in-memory tuple at
+ * a minimum). We also insist there be at least one free slot in the
+ * memtuples[] array.
*
* When alltuples = true, dump everything currently in memory.
* (This case is only used at end of input data.)
@@ -2627,21 +2907,39 @@ dumptuples(Tuplesortstate *state, bool alltuples)
(LACKMEM(state) && state->memtupcount > 1) ||
state->memtupcount >= state->memtupsize)
{
+ if (state->replaceActive && !alltuples)
+ {
+ /*
+ * Still holding out for a case favorable to replacement selection;
+ * perhaps there will be a single-run in the event of almost sorted
+ * input, or perhaps work_mem hasn't been exceeded by too much, and
+ * a "quicksort with spillover" remains possible.
+ *
+ * Dump the heap's frontmost entry, and sift up to remove it from
+ * the heap.
+ */
+ Assert(state->memtupcount > 1);
+ WRITETUP(state, state->tp_tapenum[state->destTape],
+ &state->memtuples[0]);
+ tuplesort_heap_siftup(state, true);
+ }
+ else
+ {
+ /*
+ * Once committed to quicksorting runs, never incrementally
+ * spill
+ */
+ dumpbatch(state, alltuples);
+ break;
+ }
+
/*
- * Dump the heap's frontmost entry, and sift up to remove it from the
- * heap.
+ * If top run number has changed, we've finished the current run
+ * (this can only be the first run, run 0), and will no longer spill
+ * incrementally.
*/
Assert(state->memtupcount > 0);
- WRITETUP(state, state->tp_tapenum[state->destTape],
- &state->memtuples[0]);
- tuplesort_heap_siftup(state, true);
-
- /*
- * If the heap is empty *or* top run number has changed, we've
- * finished the current run.
- */
- if (state->memtupcount == 0 ||
- state->currentRun != state->memtuples[0].tupindex)
+ if (state->memtuples[0].tupindex != 0)
{
markrunend(state, state->tp_tapenum[state->destTape]);
state->currentRun++;
@@ -2650,24 +2948,87 @@ dumptuples(Tuplesortstate *state, bool alltuples)
#ifdef TRACE_SORT
if (trace_sort)
- elog(LOG, "finished writing%s run %d to tape %d: %s",
- (state->memtupcount == 0) ? " final" : "",
+ elog(LOG, "finished writing heapsorted run %d to tape %d: %s",
state->currentRun, state->destTape,
pg_rusage_show(&state->ru_start));
#endif
/*
- * Done if heap is empty, else prepare for new run.
+ * Heap cannot be empty, so prepare for new run and give up on
+ * replacement selection.
*/
- if (state->memtupcount == 0)
- break;
- Assert(state->currentRun == state->memtuples[0].tupindex);
selectnewtape(state);
+ /* All future runs will only use dumpbatch/quicksort */
+ state->replaceActive = false;
}
}
}
/*
+ * dumpbatch - sort and dump all memtuples, forming one run on tape
+ *
+ * Unlike classic replacement selection sort, second or subsequent runs are
+ * never heapified by this module (although heapification still respects run
+ * number differences between the first and second runs). This helper
+ * handles the case where replacement selection is abandoned, and all tuples
+ * are quicksorted and dumped in memtuples-sized batches. This alternative
+ * strategy is a simple hybrid sort-merge strategy, with quicksorting of
+ * memtuples-sized runs.
+ *
+ * In rare cases, this routine may add to an on-tape run already storing
+ * tuples.
+ */
+static void
+dumpbatch(Tuplesortstate *state, bool alltuples)
+{
+ int memtupwrite;
+ int i;
+
+ Assert(state->status == TSS_BUILDRUNS);
+
+ /* Final call might be unnecessary */
+ if (state->memtupcount == 0)
+ {
+ Assert(alltuples);
+ return;
+ }
+ tuplesort_quicksort(state);
+ state->currentRun++;
+
+#ifdef TRACE_SORT
+ if (trace_sort)
+ elog(LOG, "finished quicksorting run %d: %s",
+ state->currentRun, pg_rusage_show(&state->ru_start));
+#endif
+
+ /*
+ * This should be adopted to perform asynchronous I/O one day, as
+ * dumping in batch represents a good opportunity to overlap I/O
+ * and computation.
+ */
+ memtupwrite = state->memtupcount;
+ for (i = 0; i < memtupwrite; i++)
+ {
+ WRITETUP(state, state->tp_tapenum[state->destTape],
+ &state->memtuples[i]);
+ state->memtupcount--;
+ }
+ markrunend(state, state->tp_tapenum[state->destTape]);
+ state->tp_runs[state->destTape]++;
+ state->tp_dummy[state->destTape]--; /* per Alg D step D2 */
+
+#ifdef TRACE_SORT
+ if (trace_sort)
+ elog(LOG, "finished writing quicksorted run %d to tape %d: %s",
+ state->currentRun, state->destTape,
+ pg_rusage_show(&state->ru_start));
+#endif
+
+ if (!alltuples)
+ selectnewtape(state);
+}
+
+/*
* tuplesort_rescan - rewind and replay the scan
*/
void
@@ -2777,7 +3138,8 @@ void
tuplesort_get_stats(Tuplesortstate *state,
const char **sortMethod,
const char **spaceType,
- long *spaceUsed)
+ long *spaceUsed,
+ int *rowsSortedMem)
{
/*
* Note: it might seem we should provide both memory and disk usage for a
@@ -2806,15 +3168,23 @@ tuplesort_get_stats(Tuplesortstate *state,
*sortMethod = "top-N heapsort";
else
*sortMethod = "quicksort";
+ *rowsSortedMem = state->memtupcount;
break;
case TSS_SORTEDONTAPE:
*sortMethod = "external sort";
+ *rowsSortedMem = 0;
+ break;
+ case TSS_MEMTAPEMERGE:
+ *sortMethod = "quicksort with spillover";
+ *rowsSortedMem = state->memtupcount;
break;
case TSS_FINALMERGE:
*sortMethod = "external merge";
+ *rowsSortedMem = 0;
break;
default:
*sortMethod = "still in progress";
+ *rowsSortedMem = -1;
break;
}
}
@@ -2825,10 +3195,19 @@ tuplesort_get_stats(Tuplesortstate *state,
*
* Compare two SortTuples. If checkIndex is true, use the tuple index
* as the front of the sort key; otherwise, no.
+ *
+ * Note that for checkIndex callers, the heap invariant is never maintained
+ * beyond the first run, and so there are no COMPARETUP() calls beyond the
+ * first run. It is assumed that checkIndex callers are maintaining the
+ * heap invariant for a replacement selection priority queue, but those
+ * callers do not go on to trust the heap to be fully-heapified past the
+ * first run. Once currentRun isn't the first, memtuples is no longer a
+ * heap at all.
*/
#define HEAPCOMPARE(tup1,tup2) \
- (checkIndex && ((tup1)->tupindex != (tup2)->tupindex) ? \
+ (checkIndex && ((tup1)->tupindex != (tup2)->tupindex || \
+ (tup1)->tupindex != 0) ? \
((tup1)->tupindex) - ((tup2)->tupindex) : \
COMPARETUP(state, tup1, tup2))
@@ -2927,6 +3306,33 @@ sort_bounded_heap(Tuplesortstate *state)
}
/*
+ * Sort all memtuples using quicksort.
+ *
+ * Quicksort is tuplesort's internal sort algorithm. It is also generally
+ * preferred to replacement selection of runs during external sorts, except
+ * where incrementally spilling may be particularly beneficial. Quicksort
+ * will generally be much faster than replacement selection's heapsort
+ * because modern CPUs are usually bottlenecked on memory access, and
+ * quicksort is a cache-oblivious algorithm.
+ */
+static void
+tuplesort_quicksort(Tuplesortstate *state)
+{
+ if (state->memtupcount > 1)
+ {
+ /* Can we use the single-key sort function? */
+ if (state->onlyKey != NULL)
+ qsort_ssup(state->memtuples, state->memtupcount,
+ state->onlyKey);
+ else
+ qsort_tuple(state->memtuples,
+ state->memtupcount,
+ state->comparetup,
+ state);
+ }
+}
+
+/*
* Insert a new tuple into an empty or existing heap, maintaining the
* heap invariant. Caller is responsible for ensuring there's room.
*
@@ -2954,6 +3360,17 @@ tuplesort_heap_insert(Tuplesortstate *state, SortTuple *tuple,
memtuples = state->memtuples;
Assert(state->memtupcount < state->memtupsize);
+ /*
+ * Once incremental heap spilling is abandoned, this routine should not be
+ * called when loading runs. memtuples will be an array of tuples in no
+ * significant order, so calling here is inappropriate. Even when
+ * incremental spilling is still in progress, this routine does not handle
+ * the second run's tuples (those are heapified to a limited extent that
+ * they are appended, and thus kept away from those tuples in the first
+ * run).
+ */
+ Assert(!checkIndex || tupleindex == 0);
+
CHECK_FOR_INTERRUPTS();
/*
@@ -2985,6 +3402,13 @@ tuplesort_heap_siftup(Tuplesortstate *state, bool checkIndex)
int i,
n;
+ /*
+ * Once incremental heap spilling is abandoned, this routine should not be
+ * called when loading runs. memtuples will be an array of tuples in no
+ * significant order, so calling here is inappropriate.
+ */
+ Assert(!checkIndex || state->currentRun == 0);
+
if (--state->memtupcount <= 0)
return;
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index de6fc56..3679815 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -109,7 +109,8 @@ extern void tuplesort_end(Tuplesortstate *state);
extern void tuplesort_get_stats(Tuplesortstate *state,
const char **sortMethod,
const char **spaceType,
- long *spaceUsed);
+ long *spaceUsed,
+ int *rowsSortedMem);
extern int tuplesort_merge_order(int64 allowedMem);
--
1.9.1
0005-Add-cursory-regression-tests-for-sorting.patchtext/x-patch; charset=US-ASCII; name=0005-Add-cursory-regression-tests-for-sorting.patchDownload
From d861c20bc2dab02d81040c9a40252b2e5ad08ef3 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghegan86@gmail.com>
Date: Wed, 29 Jul 2015 15:38:12 -0700
Subject: [PATCH 5/5] Add cursory regression tests for sorting
This is not intended to be a formal patch submission. Tests are added
that happened to be useful during the development of the "quicksort with
spillover" patch, as the regression tests currently have precisely zero
coverage for any variety of external sort operation. The tests are
provided as a convenience to reviewers of that patch only.
In the long run, there should be comprehensive smoke-testing of these
cases (probably not in the standard regression tests), but this patch
does not pretend to be any kind of basis for that.
The tests added have a number of obvious problems:
* They take far too long to run to be in the standard regression test
suite, and yet they're run as part of that suite. With a little effort,
they could be made to run faster with no appreciable loss of coverage,
but that didn't happen.
* They're far from comprehensive.
* The tests require several GB of disk space to run.
* They're not portable. They might even be extremely non-portable due
to implementation differences across platform pseudo-random number
generators.
The important point is that each query tested gives consistent results.
There is no reason to think that varying work_mem settings will affect
which basic approach to sorting each query takes as compared to during
my original testing (at least assuming a 64-bit platform), which is also
important.
---
src/test/regress/expected/sorting.out | 309 ++++++++++++++++++++++++++++++++++
src/test/regress/parallel_schedule | 1 +
src/test/regress/serial_schedule | 1 +
src/test/regress/sql/sorting.sql | 115 +++++++++++++
4 files changed, 426 insertions(+)
create mode 100644 src/test/regress/expected/sorting.out
create mode 100644 src/test/regress/sql/sorting.sql
diff --git a/src/test/regress/expected/sorting.out b/src/test/regress/expected/sorting.out
new file mode 100644
index 0000000..6db00b7
--- /dev/null
+++ b/src/test/regress/expected/sorting.out
@@ -0,0 +1,309 @@
+--
+-- sorting tests
+--
+-- Seed PRNG; this probably isn't portable
+select setseed(1);
+ setseed
+---------
+
+(1 row)
+
+--
+-- int4 test (10 million tuples, medium cardinality)
+--
+create unlogged table int4_sort_tbl as
+ select (random() * 1000000)::int4 s, 'abcdefghijlmn'::text junk
+ from generate_series(1, 10000000);
+--
+-- int4 test (10 million tuples, high cardinality)
+--
+create unlogged table highcard_int4_sort_tbl as
+ select (random() * 100000000)::int4 s, 'abcdefghijlmn'::text junk
+ from generate_series(1, 10000000);
+--
+-- int4 test (10 million tuples, low cardinality)
+--
+create unlogged table lowcard_int4_sort_tbl as
+ select (random() * 100)::int4 s, 'abcdefghijlmn'::text junk
+ from generate_series(1, 10000000);
+--
+-- int4 test (10 million tuples, medium cardinality, correlated)
+--
+create unlogged table int4_sort_tbl_correlated as
+ select (random() * 1000000)::int4 s, 'abcdefghijlmn'::text junk
+ from generate_series(1, 10000000) order by 1 asc;
+--
+-- int4 test (10 million tuples, medium cardinality, inverse correlated)
+--
+create unlogged table int4_sort_tbl_inverse as
+ select (random() * 1000000)::int4 s, 'abcdefghijlmn'::text junk
+ from generate_series(1, 10000000) order by 1 desc;
+-- Results should be consistent:
+set work_mem = '64MB';
+select count(distinct(s)) from int4_sort_tbl;
+ count
+--------
+ 999949
+(1 row)
+
+select count(distinct(s)) from highcard_int4_sort_tbl;
+ count
+---------
+ 9515397
+(1 row)
+
+select count(distinct(s)) from lowcard_int4_sort_tbl;
+ count
+-------
+ 101
+(1 row)
+
+select count(distinct(s)) from int4_sort_tbl_correlated;
+ count
+--------
+ 999963
+(1 row)
+
+select count(distinct(s)) from int4_sort_tbl_inverse;
+ count
+--------
+ 999952
+(1 row)
+
+set work_mem = '100MB';
+select count(distinct(s)) from int4_sort_tbl;
+ count
+--------
+ 999949
+(1 row)
+
+select count(distinct(s)) from highcard_int4_sort_tbl;
+ count
+---------
+ 9515397
+(1 row)
+
+select count(distinct(s)) from lowcard_int4_sort_tbl;
+ count
+-------
+ 101
+(1 row)
+
+select count(distinct(s)) from int4_sort_tbl_correlated;
+ count
+--------
+ 999963
+(1 row)
+
+select count(distinct(s)) from int4_sort_tbl_inverse;
+ count
+--------
+ 999952
+(1 row)
+
+set work_mem = '110MB';
+select count(distinct(s)) from int4_sort_tbl;
+ count
+--------
+ 999949
+(1 row)
+
+select count(distinct(s)) from highcard_int4_sort_tbl;
+ count
+---------
+ 9515397
+(1 row)
+
+select count(distinct(s)) from lowcard_int4_sort_tbl;
+ count
+-------
+ 101
+(1 row)
+
+select count(distinct(s)) from int4_sort_tbl_correlated;
+ count
+--------
+ 999963
+(1 row)
+
+select count(distinct(s)) from int4_sort_tbl_inverse;
+ count
+--------
+ 999952
+(1 row)
+
+set work_mem = '128MB';
+select count(distinct(s)) from int4_sort_tbl;
+ count
+--------
+ 999949
+(1 row)
+
+select count(distinct(s)) from highcard_int4_sort_tbl;
+ count
+---------
+ 9515397
+(1 row)
+
+select count(distinct(s)) from lowcard_int4_sort_tbl;
+ count
+-------
+ 101
+(1 row)
+
+select count(distinct(s)) from int4_sort_tbl_correlated;
+ count
+--------
+ 999963
+(1 row)
+
+select count(distinct(s)) from int4_sort_tbl_inverse;
+ count
+--------
+ 999952
+(1 row)
+
+set work_mem = '140MB';
+select count(distinct(s)) from int4_sort_tbl;
+ count
+--------
+ 999949
+(1 row)
+
+select count(distinct(s)) from highcard_int4_sort_tbl;
+ count
+---------
+ 9515397
+(1 row)
+
+select count(distinct(s)) from lowcard_int4_sort_tbl;
+ count
+-------
+ 101
+(1 row)
+
+select count(distinct(s)) from int4_sort_tbl_correlated;
+ count
+--------
+ 999963
+(1 row)
+
+select count(distinct(s)) from int4_sort_tbl_inverse;
+ count
+--------
+ 999952
+(1 row)
+
+set work_mem = '150MB';
+select count(distinct(s)) from int4_sort_tbl;
+ count
+--------
+ 999949
+(1 row)
+
+select count(distinct(s)) from highcard_int4_sort_tbl;
+ count
+---------
+ 9515397
+(1 row)
+
+select count(distinct(s)) from lowcard_int4_sort_tbl;
+ count
+-------
+ 101
+(1 row)
+
+select count(distinct(s)) from int4_sort_tbl_correlated;
+ count
+--------
+ 999963
+(1 row)
+
+select count(distinct(s)) from int4_sort_tbl_inverse;
+ count
+--------
+ 999952
+(1 row)
+
+-- should be in-memory:
+set work_mem = '512MB';
+select count(distinct(s)) from int4_sort_tbl;
+ count
+--------
+ 999949
+(1 row)
+
+select count(distinct(s)) from highcard_int4_sort_tbl;
+ count
+---------
+ 9515397
+(1 row)
+
+select count(distinct(s)) from lowcard_int4_sort_tbl;
+ count
+-------
+ 101
+(1 row)
+
+select count(distinct(s)) from int4_sort_tbl_correlated;
+ count
+--------
+ 999963
+(1 row)
+
+select count(distinct(s)) from int4_sort_tbl_inverse;
+ count
+--------
+ 999952
+(1 row)
+
+--
+-- text test (uses abbreviated keys, 10 million tuples)
+--
+select setseed(1);
+ setseed
+---------
+
+(1 row)
+
+create unlogged table text_sort_tbl as
+ select (random() * 100000)::int4::text s
+ from generate_series(1, 10000000);
+-- Start with sort that results in 3-way final merge:
+set work_mem = '190MB';
+select count(distinct(s)) from text_sort_tbl;
+ count
+--------
+ 100001
+(1 row)
+
+-- Uses optimization where it's marginal:
+set work_mem = '200MB';
+select count(distinct(s)) from text_sort_tbl;
+ count
+--------
+ 100001
+(1 row)
+
+-- Uses optimization where it's favorable:
+set work_mem = '450MB';
+select count(distinct(s)) from text_sort_tbl;
+ count
+--------
+ 100001
+(1 row)
+
+-- internal sort:
+set work_mem = '500MB';
+select count(distinct(s)) from text_sort_tbl;
+ count
+--------
+ 100001
+(1 row)
+
+drop table int4_sort_tbl;
+drop table highcard_int4_sort_tbl;
+drop table lowcard_int4_sort_tbl;
+drop table text_sort_tbl;
+drop table int4_sort_tbl_correlated;
+drop table int4_sort_tbl_inverse;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 4df15de..7ff656a 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -37,6 +37,7 @@ test: geometry horology regex oidjoins type_sanity opr_sanity
# ----------
test: insert
test: insert_conflict
+test: sorting
test: create_function_1
test: create_type
test: create_table
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 15d74d4..ebe7de0 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -51,6 +51,7 @@ test: type_sanity
test: opr_sanity
test: insert
test: insert_conflict
+test: sorting
test: create_function_1
test: create_type
test: create_table
diff --git a/src/test/regress/sql/sorting.sql b/src/test/regress/sql/sorting.sql
new file mode 100644
index 0000000..ae51b97
--- /dev/null
+++ b/src/test/regress/sql/sorting.sql
@@ -0,0 +1,115 @@
+--
+-- sorting tests
+--
+
+-- Seed PRNG; this probably isn't portable
+select setseed(1);
+
+--
+-- int4 test (10 million tuples, medium cardinality)
+--
+create unlogged table int4_sort_tbl as
+ select (random() * 1000000)::int4 s, 'abcdefghijlmn'::text junk
+ from generate_series(1, 10000000);
+
+--
+-- int4 test (10 million tuples, high cardinality)
+--
+create unlogged table highcard_int4_sort_tbl as
+ select (random() * 100000000)::int4 s, 'abcdefghijlmn'::text junk
+ from generate_series(1, 10000000);
+
+--
+-- int4 test (10 million tuples, low cardinality)
+--
+create unlogged table lowcard_int4_sort_tbl as
+ select (random() * 100)::int4 s, 'abcdefghijlmn'::text junk
+ from generate_series(1, 10000000);
+
+--
+-- int4 test (10 million tuples, medium cardinality, correlated)
+--
+create unlogged table int4_sort_tbl_correlated as
+ select (random() * 1000000)::int4 s, 'abcdefghijlmn'::text junk
+ from generate_series(1, 10000000) order by 1 asc;
+
+--
+-- int4 test (10 million tuples, medium cardinality, inverse correlated)
+--
+create unlogged table int4_sort_tbl_inverse as
+ select (random() * 1000000)::int4 s, 'abcdefghijlmn'::text junk
+ from generate_series(1, 10000000) order by 1 desc;
+
+-- Results should be consistent:
+set work_mem = '64MB';
+select count(distinct(s)) from int4_sort_tbl;
+select count(distinct(s)) from highcard_int4_sort_tbl;
+select count(distinct(s)) from lowcard_int4_sort_tbl;
+select count(distinct(s)) from int4_sort_tbl_correlated;
+select count(distinct(s)) from int4_sort_tbl_inverse;
+set work_mem = '100MB';
+select count(distinct(s)) from int4_sort_tbl;
+select count(distinct(s)) from highcard_int4_sort_tbl;
+select count(distinct(s)) from lowcard_int4_sort_tbl;
+select count(distinct(s)) from int4_sort_tbl_correlated;
+select count(distinct(s)) from int4_sort_tbl_inverse;
+set work_mem = '110MB';
+select count(distinct(s)) from int4_sort_tbl;
+select count(distinct(s)) from highcard_int4_sort_tbl;
+select count(distinct(s)) from lowcard_int4_sort_tbl;
+select count(distinct(s)) from int4_sort_tbl_correlated;
+select count(distinct(s)) from int4_sort_tbl_inverse;
+set work_mem = '128MB';
+select count(distinct(s)) from int4_sort_tbl;
+select count(distinct(s)) from highcard_int4_sort_tbl;
+select count(distinct(s)) from lowcard_int4_sort_tbl;
+select count(distinct(s)) from int4_sort_tbl_correlated;
+select count(distinct(s)) from int4_sort_tbl_inverse;
+set work_mem = '140MB';
+select count(distinct(s)) from int4_sort_tbl;
+select count(distinct(s)) from highcard_int4_sort_tbl;
+select count(distinct(s)) from lowcard_int4_sort_tbl;
+select count(distinct(s)) from int4_sort_tbl_correlated;
+select count(distinct(s)) from int4_sort_tbl_inverse;
+set work_mem = '150MB';
+select count(distinct(s)) from int4_sort_tbl;
+select count(distinct(s)) from highcard_int4_sort_tbl;
+select count(distinct(s)) from lowcard_int4_sort_tbl;
+select count(distinct(s)) from int4_sort_tbl_correlated;
+select count(distinct(s)) from int4_sort_tbl_inverse;
+-- should be in-memory:
+set work_mem = '512MB';
+select count(distinct(s)) from int4_sort_tbl;
+select count(distinct(s)) from highcard_int4_sort_tbl;
+select count(distinct(s)) from lowcard_int4_sort_tbl;
+select count(distinct(s)) from int4_sort_tbl_correlated;
+select count(distinct(s)) from int4_sort_tbl_inverse;
+
+--
+-- text test (uses abbreviated keys, 10 million tuples)
+--
+select setseed(1);
+create unlogged table text_sort_tbl as
+ select (random() * 100000)::int4::text s
+ from generate_series(1, 10000000);
+
+-- Start with sort that results in 3-way final merge:
+set work_mem = '190MB';
+select count(distinct(s)) from text_sort_tbl;
+-- Uses optimization where it's marginal:
+set work_mem = '200MB';
+select count(distinct(s)) from text_sort_tbl;
+-- Uses optimization where it's favorable:
+set work_mem = '450MB';
+select count(distinct(s)) from text_sort_tbl;
+-- internal sort:
+set work_mem = '500MB';
+select count(distinct(s)) from text_sort_tbl;
+
+drop table int4_sort_tbl;
+drop table highcard_int4_sort_tbl;
+drop table lowcard_int4_sort_tbl;
+drop table text_sort_tbl;
+drop table int4_sort_tbl_correlated;
+drop table int4_sort_tbl_inverse;
+
--
1.9.1
0004-Prefetch-from-memtuples-array-in-tuplesort.patchtext/x-patch; charset=US-ASCII; name=0004-Prefetch-from-memtuples-array-in-tuplesort.patchDownload
From f7befa582960c0034b24180c65eadb4ffc4dfe6d Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghegan86@gmail.com>
Date: Sun, 12 Jul 2015 13:14:01 -0700
Subject: [PATCH 4/5] Prefetch from memtuples array in tuplesort
This patch is almost the same as a canonical version, which appears
here: https://commitfest.postgresql.org/6/305/
This version adds some additional tricks specific to new cases for
external sorts. Of particular interest here is the prefetching of each
"tuple proper" during writing of batches of tuples. This is not
intended to be reviewed as part of the external sorting work, and is
provided only as a convenience to reviewers who would like to see where
prefetching can help with external sorts, too.
Original canonical version details:
Testing shows that prefetching the "tuple proper" of a slightly later
SortTuple in the memtuples array during each of many sequential,
in-logical-order SortTuple fetches speeds up various sorting intense
operations considerably. For example, B-Tree index builds are
accelerated as leaf pages are created from the memtuples array.
(i.e. The operation following actually "performing" the sort, but
before a tuplesort_end() call is made as a B-Tree spool is destroyed.)
Similarly, ordered set aggregates (all cases except the datumsort case
with a pass-by-value type), and regular heap tuplesorts benefit to about
the same degree. The optimization is only used when sorts fit in
memory, though.
Also, prefetch a few places ahead within the analogous "fetching" point
in tuplestore.c. This appears to offer similar benefits in certain
cases. For example, queries involving large common table expressions
significantly benefit.
---
config/c-compiler.m4 | 17 +++++++++++++++++
configure | 30 ++++++++++++++++++++++++++++++
configure.in | 1 +
src/backend/utils/sort/tuplesort.c | 36 ++++++++++++++++++++++++++++++++++++
src/backend/utils/sort/tuplestore.c | 13 +++++++++++++
src/include/c.h | 14 ++++++++++++++
src/include/pg_config.h.in | 3 +++
src/include/pg_config.h.win32 | 3 +++
src/include/pg_config_manual.h | 10 ++++++++++
9 files changed, 127 insertions(+)
diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 397e1b0..c730da5 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -253,6 +253,23 @@ fi])# PGAC_C_BUILTIN_UNREACHABLE
+# PGAC_C_BUILTIN_PREFETCH
+# -------------------------
+# Check if the C compiler understands __builtin_prefetch(),
+# and define HAVE__BUILTIN_PREFETCH if so.
+AC_DEFUN([PGAC_C_BUILTIN_PREFETCH],
+[AC_CACHE_CHECK(for __builtin_prefetch, pgac_cv__builtin_prefetch,
+[AC_COMPILE_IFELSE([AC_LANG_PROGRAM([],
+[int i = 0;__builtin_prefetch(&i, 0, 3);])],
+[pgac_cv__builtin_prefetch=yes],
+[pgac_cv__builtin_prefetch=no])])
+if test x"$pgac_cv__builtin_prefetch" = xyes ; then
+AC_DEFINE(HAVE__BUILTIN_PREFETCH, 1,
+ [Define to 1 if your compiler understands __builtin_prefetch.])
+fi])# PGAC_C_BUILTIN_PREFETCH
+
+
+
# PGAC_C_VA_ARGS
# --------------
# Check if the C compiler understands C99-style variadic macros,
diff --git a/configure b/configure
index ebb5cac..a3a413f 100755
--- a/configure
+++ b/configure
@@ -11315,6 +11315,36 @@ if test x"$pgac_cv__builtin_unreachable" = xyes ; then
$as_echo "#define HAVE__BUILTIN_UNREACHABLE 1" >>confdefs.h
fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_prefetch" >&5
+$as_echo_n "checking for __builtin_prefetch... " >&6; }
+if ${pgac_cv__builtin_prefetch+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+int
+main ()
+{
+int i = 0;__builtin_prefetch(&i, 0, 3);
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_compile "$LINENO"; then :
+ pgac_cv__builtin_prefetch=yes
+else
+ pgac_cv__builtin_prefetch=no
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_prefetch" >&5
+$as_echo "$pgac_cv__builtin_prefetch" >&6; }
+if test x"$pgac_cv__builtin_prefetch" = xyes ; then
+
+$as_echo "#define HAVE__BUILTIN_PREFETCH 1" >>confdefs.h
+
+fi
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __VA_ARGS__" >&5
$as_echo_n "checking for __VA_ARGS__... " >&6; }
if ${pgac_cv__va_args+:} false; then :
diff --git a/configure.in b/configure.in
index a28f9dd..778dd61 100644
--- a/configure.in
+++ b/configure.in
@@ -1319,6 +1319,7 @@ PGAC_C_TYPES_COMPATIBLE
PGAC_C_BUILTIN_BSWAP32
PGAC_C_BUILTIN_CONSTANT_P
PGAC_C_BUILTIN_UNREACHABLE
+PGAC_C_BUILTIN_PREFETCH
PGAC_C_VA_ARGS
PGAC_STRUCT_TIMEZONE
PGAC_UNION_SEMUN
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index f856cf0..e60f561 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -1796,6 +1796,26 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
if (state->current < state->memtupcount)
{
*stup = state->memtuples[state->current++];
+
+ /*
+ * Perform memory prefetch of "tuple proper" of the
+ * SortTuple that's three places ahead of current
+ * (which is returned to caller). Testing shows that
+ * this significantly boosts the performance for
+ * TSS_SORTEDINMEM "forward" callers by hiding memory
+ * latency behind their processing of returned tuples.
+ *
+ * Don't do this for pass-by-value datum sorts; even
+ * though hinting a NULL address does not affect
+ * correctness, it would have a noticeable overhead
+ * here.
+ */
+#ifdef USE_MEM_PREFETCH
+ if (stup->tuple != NULL &&
+ state->current + 2 < state->memtupcount)
+ pg_rfetch(state->memtuples[state->current + 2].tuple);
+#endif
+
return true;
}
state->eof_reached = true;
@@ -2024,6 +2044,17 @@ just_memtuples:
if (state->current < state->memtupcount)
{
*stup = state->memtuples[state->current++];
+
+ /*
+ * Once this point is reached, rationale for memory
+ * prefetching is identical to TSS_SORTEDINMEM case.
+ */
+#ifdef USE_MEM_PREFETCH
+ if (stup->tuple != NULL &&
+ state->current + 2 < state->memtupcount)
+ pg_rfetch(state->memtuples[state->current + 2].tuple);
+#endif
+
return true;
}
state->eof_reached = true;
@@ -3142,6 +3173,11 @@ dumpbatch(Tuplesortstate *state, bool alltuples)
WRITETUP(state, state->tp_tapenum[state->destTape],
&state->memtuples[i]);
state->memtupcount--;
+
+#ifdef USE_MEM_PREFETCH
+ if (state->memtuples[i].tuple != NULL && i + 2 < memtupwrite)
+ pg_rfetch(state->memtuples[i + 2].tuple);
+#endif
}
markrunend(state, state->tp_tapenum[state->destTape]);
state->tp_runs[state->destTape]++;
diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index 51f474d..15f956d 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -902,6 +902,19 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
return NULL;
if (readptr->current < state->memtupcount)
{
+ /*
+ * Perform memory prefetch of tuple that's three places
+ * ahead of current (which is returned to caller).
+ * Testing shows that this significantly boosts the
+ * performance for TSS_INMEM "forward" callers by
+ * hiding memory latency behind their processing of
+ * returned tuples.
+ */
+#ifdef USE_MEM_PREFETCH
+ if (readptr->current + 3 < state->memtupcount)
+ pg_rfetch(state->memtuples[readptr->current + 3]);
+#endif
+
/* We have another tuple, so return it */
return state->memtuples[readptr->current++];
}
diff --git a/src/include/c.h b/src/include/c.h
index b719eb9..67e3063 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -932,6 +932,20 @@ typedef NameData *Name;
#define pg_unreachable() abort()
#endif
+/*
+ * Prefetch support -- Support memory prefetching hints on some platforms.
+ *
+ * pg_rfetch() is specialized for the case where an array is accessed
+ * sequentially, and we can prefetch a pointer within the next element (or an
+ * even later element) in order to hide memory latency. This case involves
+ * prefetching addresses with low temporal locality. Note that it's rather
+ * difficult to get any kind of speedup using pg_rfetch(); any use of the
+ * intrinsic should be carefully tested. Also note that it's okay to pass it
+ * an invalid or NULL address, although it's best avoided.
+ */
+#if defined(USE_MEM_PREFETCH)
+#define pg_rfetch(addr) __builtin_prefetch((addr), 0, 0)
+#endif
/* ----------------------------------------------------------------
* Section 8: random stuff
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 9285c62..a8e5683 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -669,6 +669,9 @@
/* Define to 1 if your compiler understands __builtin_constant_p. */
#undef HAVE__BUILTIN_CONSTANT_P
+/* Define to 1 if your compiler understands __builtin_prefetch. */
+#undef HAVE__BUILTIN_PREFETCH
+
/* Define to 1 if your compiler understands __builtin_types_compatible_p. */
#undef HAVE__BUILTIN_TYPES_COMPATIBLE_P
diff --git a/src/include/pg_config.h.win32 b/src/include/pg_config.h.win32
index ad61392..a2f6eb3 100644
--- a/src/include/pg_config.h.win32
+++ b/src/include/pg_config.h.win32
@@ -523,6 +523,9 @@
/* Define to 1 if your compiler understands __builtin_constant_p. */
/* #undef HAVE__BUILTIN_CONSTANT_P */
+/* Define to 1 if your compiler understands __builtin_prefetch. */
+#undef HAVE__BUILTIN_PREFETCH
+
/* Define to 1 if your compiler understands __builtin_types_compatible_p. */
/* #undef HAVE__BUILTIN_TYPES_COMPATIBLE_P */
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index e278fa0..4c7b1d5 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -153,6 +153,16 @@
#endif
/*
+ * USE_MEM_PREFETCH controls whether Postgres will attempt to use memory
+ * prefetching. Usually the automatic configure tests are sufficient, but
+ * it's conceivable that using prefetching is counter-productive on some
+ * platforms. If necessary you can remove the #define here.
+ */
+#ifdef HAVE__BUILTIN_PREFETCH
+#define USE_MEM_PREFETCH
+#endif
+
+/*
* USE_SSL code should be compiled only when compiling with an SSL
* implementation. (Currently, only OpenSSL is supported, but we might add
* more implementations in the future.)
--
1.9.1
0003-Log-requirement-for-multiple-external-sort-passes.patchtext/x-patch; charset=US-ASCII; name=0003-Log-requirement-for-multiple-external-sort-passes.patchDownload
From 7ef577fb46984fc7e92be54d83355454490c0217 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghegan86@gmail.com>
Date: Sun, 16 Aug 2015 21:17:16 -0700
Subject: [PATCH 3/5] Log requirement for multiple external sort passes
The new log message warns users about a sort requiring multiple passes.
This is in the same spirit as checkpoint_warning. It seems very
ill-advised to ever attempt a sort that will require multiple passes on
contemporary hardware, since that can greatly increase the amount of I/O
required, and yet can only occur when available memory is small fraction
of what is required for a fully internal sort.
A new GUC, multipass_warning controls this log message. The default is
'on'. Also, a new debug GUC (not available in a standard build) for
controlling whether replacement selection can be avoided in the first
run is added.
During review, this patch may be useful for highlighting how effectively
replacement selection sort prevents multiple passes during the merge
step (relative to a hybrid sort-merge strategy) in practice.
---
doc/src/sgml/config.sgml | 22 +++++++++++++++++++++
src/backend/utils/misc/guc.c | 29 ++++++++++++++++++++++++---
src/backend/utils/sort/tuplesort.c | 40 ++++++++++++++++++++++++++++++++++++--
src/include/utils/guc.h | 2 ++
4 files changed, 88 insertions(+), 5 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e900dccb..3cd94a7 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1556,6 +1556,28 @@ include_dir 'conf.d'
<title>Disk</title>
<variablelist>
+ <varlistentry id="guc-multipass-warning" xreflabel="multipass_warning">
+ <term><varname>multipass_warning</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>multipass_warning</> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Write a message to the server log if an external sort
+ operation requires multiple passes (which suggests that
+ <varname>work_mem</> or <varname>maintenance_work_mem</> may
+ need to be raised). Only a small fraction of the memory
+ required for an internal sort is required for an external sort
+ that makes no more than a single pass (typically less than
+ 1%). Since multi-pass sorts are often much slower, it is
+ advisable to avoid them altogether whenever possible.
+ The default setting is <literal>on</>.
+ Only superusers can change this setting.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-temp-file-limit" xreflabel="temp_file_limit">
<term><varname>temp_file_limit</varname> (<type>integer</type>)
<indexterm>
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index b3dac51..3302648 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -115,8 +115,9 @@ extern bool synchronize_seqscans;
#ifdef TRACE_SYNCSCAN
extern bool trace_syncscan;
#endif
-#ifdef DEBUG_BOUNDED_SORT
+#ifdef DEBUG_SORT
extern bool optimize_bounded_sort;
+extern bool optimize_avoid_selection;
#endif
static int GUC_check_errcode_value;
@@ -1041,6 +1042,16 @@ static struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
{
+ {"multipass_warning", PGC_SUSET, LOGGING_WHAT,
+ gettext_noop("Enables warnings if external sorts require more than one pass."),
+ gettext_noop("Write a message to the server log if more than one pass is required "
+ "for an external sort operation.")
+ },
+ &multipass_warning,
+ true,
+ NULL, NULL, NULL
+ },
+ {
{"debug_assertions", PGC_INTERNAL, PRESET_OPTIONS,
gettext_noop("Shows whether the running server has assertion checks enabled."),
NULL,
@@ -1449,8 +1460,8 @@ static struct config_bool ConfigureNamesBool[] =
},
#endif
-#ifdef DEBUG_BOUNDED_SORT
- /* this is undocumented because not exposed in a standard build */
+#ifdef DEBUG_SORT
+ /* these are undocumented because not exposed in a standard build */
{
{
"optimize_bounded_sort", PGC_USERSET, QUERY_TUNING_METHOD,
@@ -1462,6 +1473,18 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+
+ {
+ {
+ "optimize_avoid_selection", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enable avoiding replacement selection using heap sort."),
+ NULL,
+ GUC_NOT_IN_SAMPLE
+ },
+ &optimize_avoid_selection,
+ true,
+ NULL, NULL, NULL
+ },
#endif
#ifdef WAL_DEBUG
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 6d766d2..f856cf0 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -160,8 +160,11 @@
bool trace_sort = false;
#endif
-#ifdef DEBUG_BOUNDED_SORT
+bool multipass_warning = true;
+
+#ifdef DEBUG_SORT
bool optimize_bounded_sort = true;
+bool optimize_avoid_selection = true;
#endif
@@ -250,6 +253,7 @@ struct Tuplesortstate
{
TupSortStatus status; /* enumerated value as shown above */
int nKeys; /* number of columns in sort key */
+ bool querySort; /* sort associated with query execution */
double rowNumHint; /* caller's hint of total # of rows */
bool randomAccess; /* did caller request random access? */
bool bounded; /* did caller specify a maximum number of
@@ -697,6 +701,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
#endif
state->nKeys = nkeys;
+ state->querySort = true;
TRACE_POSTGRESQL_SORT_START(HEAP_SORT,
false, /* no unique check */
@@ -771,6 +776,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
#endif
state->nKeys = RelationGetNumberOfAttributes(indexRel);
+ state->querySort = false;
TRACE_POSTGRESQL_SORT_START(CLUSTER_SORT,
false, /* no unique check */
@@ -864,6 +870,7 @@ tuplesort_begin_index_btree(Relation heapRel,
#endif
state->nKeys = RelationGetNumberOfAttributes(indexRel);
+ state->querySort = false;
TRACE_POSTGRESQL_SORT_START(INDEX_SORT,
enforceUnique,
@@ -939,6 +946,7 @@ tuplesort_begin_index_hash(Relation heapRel,
#endif
state->nKeys = 1; /* Only one sort column, the hash code */
+ state->querySort = false;
state->comparetup = comparetup_index_hash;
state->copytup = copytup_index;
@@ -976,6 +984,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
#endif
state->nKeys = 1; /* always a one-column sort */
+ state->querySort = true;
TRACE_POSTGRESQL_SORT_START(DATUM_SORT,
false, /* no unique check */
@@ -1042,7 +1051,7 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
Assert(state->memtupcount == 0);
Assert(!state->bounded);
-#ifdef DEBUG_BOUNDED_SORT
+#ifdef DEBUG_SORT
/* Honor GUC setting that disables the feature (for easy testing) */
if (!optimize_bounded_sort)
return;
@@ -2309,6 +2318,12 @@ useselection(Tuplesortstate *state)
double crossover;
bool useSelection;
+#ifdef DEBUG_SORT
+ /* Honor GUC setting that disables the feature (for easy testing) */
+ if (!optimize_avoid_selection)
+ return true;
+#endif
+
/* For randomAccess callers, "quicksort with spillover" is never used */
if (state->randomAccess)
return false;
@@ -2523,6 +2538,12 @@ selectnewtape(Tuplesortstate *state)
static void
mergeruns(Tuplesortstate *state)
{
+#ifdef TRACE_SORT
+ bool multiwarned = !(multipass_warning || trace_sort);
+#else
+ bool multiwarned = !multipass_warning;
+#endif
+
int tapenum,
svTape,
svRuns,
@@ -2626,6 +2647,21 @@ mergeruns(Tuplesortstate *state)
/* Step D6: decrease level */
if (--state->Level == 0)
break;
+
+ if (!multiwarned)
+ {
+ int64 memNowUsed = state->allowedMem - state->availMem;
+
+ ereport(LOG,
+ (errmsg("a multi-pass external merge sort is required "
+ "(%ld kB memory used. %d tape maximum. %d levels)",
+ memNowUsed / 1024L, state->maxTapes, state->Level + 1),
+ errhint("Consider increasing the configuration parameter \"%s\".",
+ state->querySort ? "work_mem" : "maintenance_work_mem")));
+
+ multiwarned = true;
+ }
+
/* rewind output tape T to use as new input */
LogicalTapeRewind(state->tapeset, state->tp_tapenum[state->tapeRange],
false);
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index dc167f9..1e1519a 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -272,6 +272,8 @@ extern int tcp_keepalives_count;
extern bool trace_sort;
#endif
+extern bool multipass_warning;
+
/*
* Functions exported by guc.c
*/
--
1.9.1
0002-Further-diminish-role-of-replacement-selection.patchtext/x-patch; charset=US-ASCII; name=0002-Further-diminish-role-of-replacement-selection.patchDownload
From f2486568558d4c2cd3ee59af024c3f450d6ba0fa Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghegan86@gmail.com>
Date: Thu, 13 Aug 2015 14:32:32 -0700
Subject: [PATCH 2/5] Further diminish role of replacement selection
Tuplesort callers now provide a total row estimate hint, typically the
optimizer's own estimate. This is used to determine if replacement
selection will be viable even for the first run. Testing shows that the
major benefit of replacement selection is only that it may enable a
"quicksort with spillover", which is the sole remaining justification
for going with replacement selection for the first run. Even the cases
traditionally considered very sympathetic to replacement selection (e.g.
almost sorted input) do not appear to come out ahead on contemporary
hardware, so callers may not provide a physical/logical correlation
hint. There is surprisingly little reason to try replacement selection
in the event of a strong correlation.
Some of the best cases for a simple hybrid sort-merge strategy can only
be seen when replacement selection isn't even attempted before being
abandoned; replacement selection's tendency to produce longer runs is a
liability here rather than a benefit. This change significantly reduces
the frequency that replacement selection will even be attempted
(previously, it was always at least used for the first run).
---
src/backend/access/hash/hash.c | 2 +-
src/backend/access/hash/hashsort.c | 4 +-
src/backend/access/nbtree/nbtree.c | 11 +-
src/backend/access/nbtree/nbtsort.c | 10 +-
src/backend/catalog/index.c | 1 +
src/backend/commands/cluster.c | 4 +-
src/backend/executor/nodeAgg.c | 26 ++++-
src/backend/executor/nodeSort.c | 1 +
src/backend/utils/adt/orderedsetaggs.c | 13 ++-
src/backend/utils/sort/tuplesort.c | 182 +++++++++++++++++++++++++--------
src/include/access/hash.h | 3 +-
src/include/access/nbtree.h | 2 +-
src/include/executor/nodeAgg.h | 2 +
src/include/utils/tuplesort.h | 15 ++-
14 files changed, 214 insertions(+), 62 deletions(-)
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 24b06a5..8f71980 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -86,7 +86,7 @@ hashbuild(PG_FUNCTION_ARGS)
* one page.
*/
if (num_buckets >= (uint32) NBuffers)
- buildstate.spool = _h_spoolinit(heap, index, num_buckets);
+ buildstate.spool = _h_spoolinit(heap, index, num_buckets, reltuples);
else
buildstate.spool = NULL;
diff --git a/src/backend/access/hash/hashsort.c b/src/backend/access/hash/hashsort.c
index c67c057..5c7e137 100644
--- a/src/backend/access/hash/hashsort.c
+++ b/src/backend/access/hash/hashsort.c
@@ -44,7 +44,8 @@ struct HSpool
* create and initialize a spool structure
*/
HSpool *
-_h_spoolinit(Relation heap, Relation index, uint32 num_buckets)
+_h_spoolinit(Relation heap, Relation index, uint32 num_buckets,
+ double reltuples)
{
HSpool *hspool = (HSpool *) palloc0(sizeof(HSpool));
uint32 hash_mask;
@@ -71,6 +72,7 @@ _h_spoolinit(Relation heap, Relation index, uint32 num_buckets)
index,
hash_mask,
maintenance_work_mem,
+ reltuples,
false);
return hspool;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index cf4a6dc..0957e0f 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -23,6 +23,7 @@
#include "access/xlog.h"
#include "catalog/index.h"
#include "commands/vacuum.h"
+#include "optimizer/plancat.h"
#include "storage/indexfsm.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
@@ -85,7 +86,9 @@ btbuild(PG_FUNCTION_ARGS)
Relation index = (Relation) PG_GETARG_POINTER(1);
IndexInfo *indexInfo = (IndexInfo *) PG_GETARG_POINTER(2);
IndexBuildResult *result;
+ BlockNumber relpages;
double reltuples;
+ double allvisfrac;
BTBuildState buildstate;
buildstate.isUnique = indexInfo->ii_Unique;
@@ -100,6 +103,9 @@ btbuild(PG_FUNCTION_ARGS)
ResetUsage();
#endif /* BTREE_BUILD_STATS */
+ /* Estimate the number of rows currently present in the table */
+ estimate_rel_size(heap, NULL, &relpages, &reltuples, &allvisfrac);
+
/*
* We expect to be called exactly once for any index relation. If that's
* not the case, big trouble's what we have.
@@ -108,14 +114,15 @@ btbuild(PG_FUNCTION_ARGS)
elog(ERROR, "index \"%s\" already contains data",
RelationGetRelationName(index));
- buildstate.spool = _bt_spoolinit(heap, index, indexInfo->ii_Unique, false);
+ buildstate.spool = _bt_spoolinit(heap, index, indexInfo->ii_Unique, false,
+ reltuples);
/*
* If building a unique index, put dead tuples in a second spool to keep
* them out of the uniqueness check.
*/
if (indexInfo->ii_Unique)
- buildstate.spool2 = _bt_spoolinit(heap, index, false, true);
+ buildstate.spool2 = _bt_spoolinit(heap, index, false, true, reltuples);
/* do the heap scan */
reltuples = IndexBuildHeapScan(heap, index, indexInfo, true,
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f95f67a..0d4a5ea 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -149,7 +149,8 @@ static void _bt_load(BTWriteState *wstate,
* create and initialize a spool structure
*/
BTSpool *
-_bt_spoolinit(Relation heap, Relation index, bool isunique, bool isdead)
+_bt_spoolinit(Relation heap, Relation index, bool isunique, bool isdead,
+ double reltuples)
{
BTSpool *btspool = (BTSpool *) palloc0(sizeof(BTSpool));
int btKbytes;
@@ -165,10 +166,15 @@ _bt_spoolinit(Relation heap, Relation index, bool isunique, bool isdead)
* unique index actually requires two BTSpool objects. We expect that the
* second one (for dead tuples) won't get very full, so we give it only
* work_mem.
+ *
+ * reltuples hint does not account for factors like whether or not this is
+ * a partial index, or if this is second BTSpool object, because it seems
+ * more conservative to estimate high.
*/
btKbytes = isdead ? work_mem : maintenance_work_mem;
btspool->sortstate = tuplesort_begin_index_btree(heap, index, isunique,
- btKbytes, false);
+ btKbytes, reltuples,
+ false);
return btspool;
}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e59b163..88ee81d 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2835,6 +2835,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
state.tuplesort = tuplesort_begin_datum(TIDOID, TIDLessOperator,
InvalidOid, false,
maintenance_work_mem,
+ ivinfo.num_heap_tuples,
false);
state.htups = state.itups = state.tups_inserted = 0;
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 7ab4874..23f6459 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -891,7 +891,9 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
/* Set up sorting if wanted */
if (use_sort)
tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex,
- maintenance_work_mem, false);
+ maintenance_work_mem,
+ OldHeap->rd_rel->reltuples,
+ false);
else
tuplesort = NULL;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 2e36855..f580cca 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -520,6 +520,7 @@ initialize_phase(AggState *aggstate, int newphase)
sortnode->collations,
sortnode->nullsFirst,
work_mem,
+ sortnode->plan.plan_rows,
false);
}
@@ -588,7 +589,8 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
pertrans->sortOperators[0],
pertrans->sortCollations[0],
pertrans->sortNullsFirst[0],
- work_mem, false);
+ work_mem, agg_input_rows(aggstate),
+ false);
else
pertrans->sortstates[aggstate->current_set] =
tuplesort_begin_heap(pertrans->evaldesc,
@@ -597,7 +599,8 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
pertrans->sortOperators,
pertrans->sortCollations,
pertrans->sortNullsFirst,
- work_mem, false);
+ work_mem, agg_input_rows(aggstate),
+ false);
}
/*
@@ -1439,6 +1442,25 @@ find_hash_columns(AggState *aggstate)
}
/*
+ * Estimate the number of rows input to the sorter.
+ *
+ * Exported for use by ordered-set aggregates.
+ */
+double
+agg_input_rows(AggState *aggstate)
+{
+ Plan *outerNode;
+
+ /*
+ * Get information about the size of the relation to be sorted (it's the
+ * "outer" subtree of this node)
+ */
+ outerNode = outerPlanState(aggstate)->plan;
+
+ return outerNode->plan_rows;
+}
+
+/*
* Estimate per-hash-table-entry overhead for the planner.
*
* Note that the estimate does not include space for pass-by-reference
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index af1dccf..e4b1104 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -89,6 +89,7 @@ ExecSort(SortState *node)
plannode->collations,
plannode->nullsFirst,
work_mem,
+ plannode->plan.plan_rows,
node->randomAccess);
if (node->bounded)
tuplesort_set_bound(tuplesortstate, node->bound);
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
index 39ed85b..b51a945 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -20,6 +20,7 @@
#include "catalog/pg_operator.h"
#include "catalog/pg_type.h"
#include "executor/executor.h"
+#include "executor/nodeAgg.h"
#include "miscadmin.h"
#include "nodes/nodeFuncs.h"
#include "optimizer/tlist.h"
@@ -103,6 +104,7 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
{
OSAPerGroupState *osastate;
OSAPerQueryState *qstate;
+ AggState *aggstate;
MemoryContext gcontext;
MemoryContext oldcontext;
@@ -117,8 +119,11 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
/*
* We keep a link to the per-query state in fn_extra; if it's not there,
* create it, and do the per-query setup we need.
+ *
+ * aggstate is used to get hint on total number of tuples for tuplesort.
*/
qstate = (OSAPerQueryState *) fcinfo->flinfo->fn_extra;
+ aggstate = (AggState *) fcinfo->context;
if (qstate == NULL)
{
Aggref *aggref;
@@ -276,13 +281,17 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
qstate->sortOperators,
qstate->sortCollations,
qstate->sortNullsFirsts,
- work_mem, false);
+ work_mem,
+ agg_input_rows(aggstate),
+ false);
else
osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
qstate->sortOperator,
qstate->sortCollation,
qstate->sortNullsFirst,
- work_mem, false);
+ work_mem,
+ agg_input_rows(aggstate),
+ false);
osastate->number_of_rows = 0;
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index fc4ac90..6d766d2 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -13,11 +13,13 @@
* See Knuth, volume 3, for more than you want to know about the external
* sorting algorithm. We divide the input into sorted runs using replacement
* selection, in the form of a priority tree implemented as a heap
- * (essentially his Algorithm 5.2.3H -- although that strategy can be
- * abandoned where it does not appear to help), then merge the runs using
- * polyphase merge, Knuth's Algorithm 5.4.2D. The logical "tapes" used by
- * Algorithm D are implemented by logtape.c, which avoids space wastage by
- * recycling disk space as soon as each block is read from its "tape".
+ * (essentially his Algorithm 5.2.3H -- although that strategy is often
+ * avoided altogether), then merge the runs using polyphase merge, Knuth's
+ * Algorithm 5.4.2D. The logical "tapes" used by Algorithm D are
+ * implemented by logtape.c, which avoids space wastage by recycling disk
+ * space as soon as each block is read from its "tape". Note that a hybrid
+ * sort-merge strategy is usually used in practice, because maintaining a
+ * priority tree/heap is expensive.
*
* We do not form the initial runs using Knuth's recommended replacement
* selection data structure (Algorithm 5.4.1R), because it uses a fixed
@@ -108,10 +110,13 @@
* If, having maintained a replacement selection priority queue (heap) for
* the first run it transpires that there will be multiple on-tape runs
* anyway, we abandon treating memtuples as a heap, and quicksort and write
- * in memtuples-sized batches. This gives us most of the advantages of
- * always quicksorting and batch dumping runs, which can perform much better
- * than heap sorting and incrementally spilling tuples, without giving up on
- * replacement selection in cases where it remains compelling.
+ * in memtuples-sized batches. This allows a "quicksort with spillover" to
+ * occur, but that remains about the only truly compelling case for
+ * replacement selection. Callers provides a hint for the total number of
+ * rows, used to avoid replacement selection when a "quicksort with
+ * spillover" is not anticipated -- see useselection(). A hybrid sort-merge
+ * strategy can be much faster for very large inputs when replacement
+ * selection is never attempted.
*
*
* Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
@@ -245,6 +250,7 @@ struct Tuplesortstate
{
TupSortStatus status; /* enumerated value as shown above */
int nKeys; /* number of columns in sort key */
+ double rowNumHint; /* caller's hint of total # of rows */
bool randomAccess; /* did caller request random access? */
bool bounded; /* did caller specify a maximum number of
* tuples to return? */
@@ -313,7 +319,9 @@ struct Tuplesortstate
/*
* While building initial runs, this indicates if the replacement
* selection strategy or simple hybrid sort-merge strategy is in use.
- * Replacement selection is abandoned after first run.
+ * Replacement selection may be determined to not be effective ahead of
+ * time, based on a caller-supplied hint. Otherwise, it is abandoned
+ * after first run.
*/
bool replaceActive;
@@ -505,9 +513,11 @@ struct Tuplesortstate
} while(0)
-static Tuplesortstate *tuplesort_begin_common(int workMem, bool randomAccess);
+static Tuplesortstate *tuplesort_begin_common(int workMem, double rowNumHint,
+ bool randomAccess);
static void puttuple_common(Tuplesortstate *state, SortTuple *tuple);
static bool consider_abort_common(Tuplesortstate *state);
+static bool useselection(Tuplesortstate *state);
static void inittapes(Tuplesortstate *state);
static void selectnewtape(Tuplesortstate *state);
static void mergeruns(Tuplesortstate *state);
@@ -584,12 +594,14 @@ static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
* Each variant of tuplesort_begin has a workMem parameter specifying the
* maximum number of kilobytes of RAM to use before spilling data to disk.
* (The normal value of this parameter is work_mem, but some callers use
- * other values.) Each variant also has a randomAccess parameter specifying
- * whether the caller needs non-sequential access to the sort result.
+ * other values.) Each variant also has a hint parameter of the total
+ * number of rows to be sorted, and a randomAccess parameter specifying
+ * whether the caller needs non-sequential access to the sort result. Since
+ * rowNumHint is just a hint, it's acceptable for it to be zero or negative.
*/
static Tuplesortstate *
-tuplesort_begin_common(int workMem, bool randomAccess)
+tuplesort_begin_common(int workMem, double rowNumHint, bool randomAccess)
{
Tuplesortstate *state;
MemoryContext sortcontext;
@@ -619,6 +631,7 @@ tuplesort_begin_common(int workMem, bool randomAccess)
#endif
state->status = TSS_INITIAL;
+ state->rowNumHint = rowNumHint;
state->randomAccess = randomAccess;
state->bounded = false;
state->boundUsed = false;
@@ -664,9 +677,11 @@ tuplesort_begin_heap(TupleDesc tupDesc,
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
- int workMem, bool randomAccess)
+ int workMem, double rowNumHint,
+ bool randomAccess)
{
- Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+ Tuplesortstate *state = tuplesort_begin_common(workMem, rowNumHint,
+ randomAccess);
MemoryContext oldcontext;
int i;
@@ -734,9 +749,11 @@ tuplesort_begin_heap(TupleDesc tupDesc,
Tuplesortstate *
tuplesort_begin_cluster(TupleDesc tupDesc,
Relation indexRel,
- int workMem, bool randomAccess)
+ int workMem,
+ double rowNumHint, bool randomAccess)
{
- Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+ Tuplesortstate *state = tuplesort_begin_common(workMem, rowNumHint,
+ randomAccess);
ScanKey indexScanKey;
MemoryContext oldcontext;
int i;
@@ -827,9 +844,11 @@ Tuplesortstate *
tuplesort_begin_index_btree(Relation heapRel,
Relation indexRel,
bool enforceUnique,
- int workMem, bool randomAccess)
+ int workMem,
+ double rowNumHint, bool randomAccess)
{
- Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+ Tuplesortstate *state = tuplesort_begin_common(workMem, rowNumHint,
+ randomAccess);
ScanKey indexScanKey;
MemoryContext oldcontext;
int i;
@@ -902,9 +921,11 @@ Tuplesortstate *
tuplesort_begin_index_hash(Relation heapRel,
Relation indexRel,
uint32 hash_mask,
- int workMem, bool randomAccess)
+ int workMem,
+ double rowNumHint, bool randomAccess)
{
- Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+ Tuplesortstate *state = tuplesort_begin_common(workMem, rowNumHint,
+ randomAccess);
MemoryContext oldcontext;
oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -937,9 +958,10 @@ tuplesort_begin_index_hash(Relation heapRel,
Tuplesortstate *
tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
bool nullsFirstFlag,
- int workMem, bool randomAccess)
+ int workMem, double rowNumHint, bool randomAccess)
{
- Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+ Tuplesortstate *state = tuplesort_begin_common(workMem, rowNumHint,
+ randomAccess);
MemoryContext oldcontext;
int16 typlen;
bool typbyval;
@@ -2270,6 +2292,73 @@ tuplesort_merge_order(int64 allowedMem)
}
/*
+ * useselection - determine if one replacement selection run should be
+ * attempted.
+ *
+ * This is called when we just ran out of memory, and must consider costs
+ * and benefits of replacement selection for first run, which can result in
+ * a "quicksort with spillover". Note that replacement selection is always
+ * abandoned after the first run.
+ */
+static bool
+useselection(Tuplesortstate *state)
+{
+ int64 memNowUsed = state->allowedMem - state->availMem;
+ double avgTupleSize;
+ int increments;
+ double crossover;
+ bool useSelection;
+
+ /* For randomAccess callers, "quicksort with spillover" is never used */
+ if (state->randomAccess)
+ return false;
+
+ /*
+ * Crossover point is somewhere between where memtuples is between 40%
+ * and all-but-one of total tuples to sort. This weighs approximate
+ * savings in I/O, against generic heap sorting cost.
+ */
+ avgTupleSize = (double) memNowUsed / (double) state->memtupsize;
+
+ /*
+ * Starting from a threshold of 90%, refund 7.5% per 32 byte
+ * average-size-increment.
+ */
+ increments = MAXALIGN_DOWN((int) avgTupleSize) / 32;
+ crossover = 0.90 - (increments * 0.075);
+
+ /*
+ * Clamp, making either outcome possible regardless of average size.
+ *
+ * 40% is about the minimum point at which "quicksort with spillover"
+ * can still occur without a logical/physical correlation.
+ */
+ crossover = Max(0.40, Min(crossover, 0.85));
+
+ /*
+ * The point where the overhead of maintaining the heap invariant is
+ * likely to dominate over any saving in I/O is somewhat arbitrarily
+ * assumed to be the point where memtuples' size exceeds MaxAllocSize
+ * (note that overall memory consumption may be far greater). Past this
+ * point, only the most compelling cases use replacement selection for
+ * their first run.
+ */
+ if (sizeof(SortTuple) * state->memtupcount > MaxAllocSize)
+ crossover = avgTupleSize > 32 ? 0.90 : 0.95;
+
+ useSelection = state->memtupcount > state->rowNumHint * crossover;
+
+#ifdef TRACE_SORT
+ if (trace_sort)
+ elog(LOG, "%s in use from row %d with %.2f total rows %.3f crossover",
+ useSelection? "replacement selection" : "hybrid sort-merge",
+ state->memtupcount, state->rowNumHint, crossover);
+#endif
+
+ return useSelection;
+}
+
+/*
* inittapes - initialize for tape sorting.
*
* This is called only if we have found we don't have room to sort in memory.
@@ -2278,7 +2367,6 @@ static void
inittapes(Tuplesortstate *state)
{
int maxTapes,
- ntuples,
j;
int64 tapeSpace;
@@ -2337,32 +2425,38 @@ inittapes(Tuplesortstate *state)
state->tp_tapenum = (int *) palloc0(maxTapes * sizeof(int));
/*
- * Give replacement selection a try. There will be a switch to a simple
- * hybrid sort-merge strategy after the first run (iff there is to be a
- * second on-tape run).
+ * Give replacement selection a try when number of tuples to be sorted
+ * has a reasonable chance of enabling a "quicksort with spillover".
+ * There will be a switch to a simple hybrid sort-merge strategy after
+ * the first run (iff there is to be a second on-tape run).
*/
- state->replaceActive = true;
+ state->replaceActive = useselection(state);
state->cached = false;
state->just_memtuples = false;
- /*
- * Convert the unsorted contents of memtuples[] into a heap. Each tuple is
- * marked as belonging to run number zero.
- *
- * NOTE: we pass false for checkIndex since there's no point in comparing
- * indexes in this step, even though we do intend the indexes to be part
- * of the sort key...
- */
- ntuples = state->memtupcount;
- state->memtupcount = 0; /* make the heap empty */
- for (j = 0; j < ntuples; j++)
+ if (state->replaceActive)
{
- /* Must copy source tuple to avoid possible overwrite */
- SortTuple stup = state->memtuples[j];
+ /*
+ * Convert the unsorted contents of memtuples[] into a heap. Each
+ * tuple is marked as belonging to run number zero.
+ *
+ * NOTE: we pass false for checkIndex since there's no point in
+ * comparing indexes in this step, even though we do intend the
+ * indexes to be part of the sort key...
+ */
+ int ntuples = state->memtupcount;
- tuplesort_heap_insert(state, &stup, 0, false);
+ state->memtupcount = 0; /* make the heap empty */
+
+ for (j = 0; j < ntuples; j++)
+ {
+ /* Must copy source tuple to avoid possible overwrite */
+ SortTuple stup = state->memtuples[j];
+
+ tuplesort_heap_insert(state, &stup, 0, false);
+ }
+ Assert(state->memtupcount == ntuples);
}
- Assert(state->memtupcount == ntuples);
state->currentRun = 0;
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 97cb859..95acc1d 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -335,7 +335,8 @@ extern bool _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir);
/* hashsort.c */
typedef struct HSpool HSpool; /* opaque struct in hashsort.c */
-extern HSpool *_h_spoolinit(Relation heap, Relation index, uint32 num_buckets);
+extern HSpool *_h_spoolinit(Relation heap, Relation index, uint32 num_buckets,
+ double reltuples);
extern void _h_spooldestroy(HSpool *hspool);
extern void _h_spool(HSpool *hspool, ItemPointer self,
Datum *values, bool *isnull);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 9e48efd..5504b7b 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -743,7 +743,7 @@ extern void BTreeShmemInit(void);
typedef struct BTSpool BTSpool; /* opaque type known only within nbtsort.c */
extern BTSpool *_bt_spoolinit(Relation heap, Relation index,
- bool isunique, bool isdead);
+ bool isunique, bool isdead, double reltuples);
extern void _bt_spooldestroy(BTSpool *btspool);
extern void _bt_spool(BTSpool *btspool, ItemPointer self,
Datum *values, bool *isnull);
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index fe3b81a..e6144f2 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -21,6 +21,8 @@ extern TupleTableSlot *ExecAgg(AggState *node);
extern void ExecEndAgg(AggState *node);
extern void ExecReScanAgg(AggState *node);
+extern double agg_input_rows(AggState *aggstate);
+
extern Size hash_agg_entry_size(int numAggs);
extern Datum aggregate_dummy(PG_FUNCTION_ARGS);
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 3679815..11a5fb7 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -62,22 +62,27 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
- int workMem, bool randomAccess);
+ int workMem,
+ double rowNumHint, bool randomAccess);
extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
Relation indexRel,
- int workMem, bool randomAccess);
+ int workMem,
+ double rowNumHint, bool randomAccess);
extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
Relation indexRel,
bool enforceUnique,
- int workMem, bool randomAccess);
+ int workMem,
+ double rowNumHint, bool randomAccess);
extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
Relation indexRel,
uint32 hash_mask,
- int workMem, bool randomAccess);
+ int workMem,
+ double rowNumHint, bool randomAccess);
extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
Oid sortOperator, Oid sortCollation,
bool nullsFirstFlag,
- int workMem, bool randomAccess);
+ int workMem,
+ double rowNumHint, bool randomAccess);
extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
--
1.9.1
On Thu, Aug 20, 2015 at 3:24 AM, Peter Geoghegan <pg@heroku.com> wrote:
I believe, in general, that we should consider a multi-pass sort to be
a kind of inherently suspect thing these days, in the same way that
checkpoints occurring 5 seconds apart are: not actually abnormal, but
something that we should regard suspiciously. Can you really not
afford enough work_mem to only do one pass? Does it really make sense
to add far more I/O and CPU costs to avoid that other tiny memory
capacity cost?
I think this is the crux of the argument. And I think you're
basically, but not entirely, right.
The key metric there is not how cheap memory has gotten but rather
what the ratio is between the system's memory and disk storage. The
use case I think you're leaving out is the classic "data warehouse"
with huge disk arrays attached to a single host running massive
queries for hours. In that case reducing run size will reduce I/O
requirements directly and halving the amount of I/O sort takes will
halve the time it takes regardless of cpu efficiency. And I have a
suspicion typical data distributions get much better than a 2x
speedup.
But I think you're basically right that this is the wrong use case to
worry about for most users. Even those users that do have large batch
queries are probably not processing so much that they should be doing
multiple passes. The ones that do are are probably more interested in
parallel query, federated databases, column stores, and so on rather
than worrying about just how many hours it takes to sort their
multiple terabytes on a single processor.
I am quite suspicious of quicksort though. It has O(n^2) worst case
and I think it's only a matter of time before people start worrying
about DOS attacks from users able to influence the data ordering. It's
also not very suitable for GPU processing. Quicksort gets most of its
advantage from cache efficiency, it isn't a super efficient algorithm
otherwise, are there not other cache efficient algorithms to consider?
Alternately, has anyone tested whether Timsort would work well?
--
greg
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Greg Stark <stark@mit.edu> writes:
Alternately, has anyone tested whether Timsort would work well?
I think that was proposed a few years ago and did not look so good
in simple testing.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 20 August 2015 at 03:24, Peter Geoghegan <pg@heroku.com> wrote:
The patch is ~3.25x faster than master
I've tried to read this post twice and both times my work_mem overflowed.
;-)
Can you summarize what this patch does? I understand clearly what it
doesn't do...
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Aug 20, 2015 at 6:54 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Greg Stark <stark@mit.edu> writes:
Alternately, has anyone tested whether Timsort would work well?
I think that was proposed a few years ago and did not look so good
in simple testing.
I tested it in 2012. I got as far as writing a patch.
Timsort is very good where comparisons are expensive -- that's why
it's especially compelling when your comparator is written in Python.
However, when testing it with text, even though there were
significantly fewer comparisons, it was still slower than quicksort.
Quicksort is cache oblivious, and that's an enormous advantage. This
was before abbreviated keys; these days, the difference must be
larger.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Aug 20, 2015 at 8:15 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 20 August 2015 at 03:24, Peter Geoghegan <pg@heroku.com> wrote:
The patch is ~3.25x faster than master
I've tried to read this post twice and both times my work_mem overflowed.
;-)Can you summarize what this patch does? I understand clearly what it doesn't
do...
The most important thing that it does is always quicksort runs, that
are formed by simply filling work_mem with tuples in no particular
order, rather than trying to make runs that are twice as large as
work_mem on average. That's what the ~3.25x improvement concerned.
That's actually a significantly simpler algorithm than replacement
selection, and appears to be much faster. You might even say that it's
a dumb algorithm, because it is less sophisticated than replacement
selection. However, replacement selection tends to use CPU caches very
poorly, while its traditional advantages have become dramatically less
important due to large main memory sizes in particular. Also, it hurts
that we don't currently dump tuples in batches, for several reasons.
Better to do memory intense operations in batch, rather than having a
huge inner loop, in order to minimize or prevent instruction cache
misses. And we can better take advantage of asynchronous I/O.
The complicated aspect of considering the patch is whether or not it's
okay to not use replacement selection anymore -- is that an
appropriate trade-off?
The reason that the code has not actually been simplified by this
patch is that I still want to use replacement selection for one
specific case: when it is anticipated that a "quicksort with
spillover" can occur, which is only possible with incremental
spilling. That may avoid most I/O, by spilling just a few tuples using
a heap/priority queue, and quicksorting everything else. That's
compelling when you can manage it, but no reason to always use
replacement selection for the first run in the common case where there
well be several runs in total.
Is that any clearer? To borrow a phrase from the processor
architecture community, from a high level this is a "Brainiac versus
Speed Demon" [1]http://www.lighterra.com/papers/modernmicroprocessors/#thebrainiacdebate -- Peter Geoghegan trade-off. (I wish that there was a widely accepted
name for this trade-off.)
[1]: http://www.lighterra.com/papers/modernmicroprocessors/#thebrainiacdebate -- Peter Geoghegan
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Aug 20, 2015 at 10:41 AM, Peter Geoghegan <pg@heroku.com> wrote:
On Thu, Aug 20, 2015 at 8:15 AM, Simon Riggs <simon@2ndquadrant.com>
wrote:On 20 August 2015 at 03:24, Peter Geoghegan <pg@heroku.com> wrote:
The patch is ~3.25x faster than master
I've tried to read this post twice and both times my work_mem overflowed.
;-)Can you summarize what this patch does? I understand clearly what it
doesn't
do...
The most important thing that it does is always quicksort runs, that
are formed by simply filling work_mem with tuples in no particular
order, rather than trying to make runs that are twice as large as
work_mem on average. That's what the ~3.25x improvement concerned.
That's actually a significantly simpler algorithm than replacement
selection, and appears to be much faster. You might even say that it's
a dumb algorithm, because it is less sophisticated than replacement
selection. However, replacement selection tends to use CPU caches very
poorly, while its traditional advantages have become dramatically less
important due to large main memory sizes in particular. Also, it hurts
that we don't currently dump tuples in batches, for several reasons.
Better to do memory intense operations in batch, rather than having a
huge inner loop, in order to minimize or prevent instruction cache
misses. And we can better take advantage of asynchronous I/O.The complicated aspect of considering the patch is whether or not it's
okay to not use replacement selection anymore -- is that an
appropriate trade-off?The reason that the code has not actually been simplified by this
patch is that I still want to use replacement selection for one
specific case: when it is anticipated that a "quicksort with
spillover" can occur, which is only possible with incremental
spilling. That may avoid most I/O, by spilling just a few tuples using
a heap/priority queue, and quicksorting everything else. That's
compelling when you can manage it, but no reason to always use
replacement selection for the first run in the common case where there
well be several runs in total.Is that any clearer? To borrow a phrase from the processor
architecture community, from a high level this is a "Brainiac versus
Speed Demon" [1] trade-off. (I wish that there was a widely accepted
name for this trade-off.)[1]
http://www.lighterra.com/papers/modernmicroprocessors/#thebrainiacdebate
--
Peter Geoghegan--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi, Peter,
Just a quick anecdotal evidence. I did similar experiment about three
years ago. The conclusion was that if you have SSD, just do quick sort
and forget the longer runs, but if you are using hard drives, longer runs
is the winner (and safer, to avoid cliffs). I did not experiment with
RAID0/5 on many spindles though.
Not limited to sort, more generally, SSD is different enough from HDD,
therefore it may worth the effort for backend to "guess" what storage
device it has, then choose the right thing to do.
Cheers.
On Thu, Aug 20, 2015 at 12:42 PM, Feng Tian <ftian@vitessedata.com> wrote:
Just a quick anecdotal evidence. I did similar experiment about three years
ago. The conclusion was that if you have SSD, just do quick sort and
forget the longer runs, but if you are using hard drives, longer runs is the
winner (and safer, to avoid cliffs). I did not experiment with RAID0/5 on
many spindles though.Not limited to sort, more generally, SSD is different enough from HDD,
therefore it may worth the effort for backend to "guess" what storage device
it has, then choose the right thing to do.
The devil is in the details. I cannot really comment on such a general
statement.
I would be willing to believe that that's true under
unrealistic/unrepresentative conditions. Specifically, when multiple
passes are required with a sort-merge strategy where that isn't the
case with replacement selection. This could happen with a tiny
work_mem setting (tiny in an absolute sense more than a relative
sense). With an HDD, where sequential I/O is so much faster, this
could be enough to make replacement selection win, just as it would
have in the 1970s with magnetic tapes.
As I've said, the solution is to simply avoid multiple passes, which
should be possible in virtually all cases because of the quadratic
growth in a classic hybrid sort-merge strategy's capacity to avoid
multiple passes (growth relative to work_mem's growth). Once you
ensure that, then you probably have a mostly I/O bound workload, which
can be made faster by adding sequential I/O capacity (or, on the
Postgres internals side, adding asynchronous I/O, or with memory
prefetching). You cannot really buy a faster CPU to make a degenerate
heapsort faster.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Aug 20, 2015 at 1:16 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Thu, Aug 20, 2015 at 12:42 PM, Feng Tian <ftian@vitessedata.com> wrote:
Just a quick anecdotal evidence. I did similar experiment about three
years
ago. The conclusion was that if you have SSD, just do quick sort and
forget the longer runs, but if you are using hard drives, longer runs isthe
winner (and safer, to avoid cliffs). I did not experiment with
RAID0/5 on
many spindles though.
Not limited to sort, more generally, SSD is different enough from HDD,
therefore it may worth the effort for backend to "guess" what storagedevice
it has, then choose the right thing to do.
The devil is in the details. I cannot really comment on such a general
statement.I would be willing to believe that that's true under
unrealistic/unrepresentative conditions. Specifically, when multiple
passes are required with a sort-merge strategy where that isn't the
case with replacement selection. This could happen with a tiny
work_mem setting (tiny in an absolute sense more than a relative
sense). With an HDD, where sequential I/O is so much faster, this
could be enough to make replacement selection win, just as it would
have in the 1970s with magnetic tapes.As I've said, the solution is to simply avoid multiple passes, which
should be possible in virtually all cases because of the quadratic
growth in a classic hybrid sort-merge strategy's capacity to avoid
multiple passes (growth relative to work_mem's growth). Once you
ensure that, then you probably have a mostly I/O bound workload, which
can be made faster by adding sequential I/O capacity (or, on the
Postgres internals side, adding asynchronous I/O, or with memory
prefetching). You cannot really buy a faster CPU to make a degenerate
heapsort faster.--
Peter Geoghegan
Agree everything in principal,except one thing -- no, random IO on HDD in
2010s (relative to CPU/Memory/SSD), is not any faster than tape in 1970s.
:-)
On Thu, Aug 20, 2015 at 1:28 PM, Feng Tian <ftian@vitessedata.com> wrote:
Agree everything in principal,except one thing -- no, random IO on HDD in
2010s (relative to CPU/Memory/SSD), is not any faster than tape in 1970s.
:-)
Sure. The advantage of replacement selection could be a deciding
factor in unrepresentative cases, as I mentioned, but even then it's
not going to be a dramatic difference as it would have been in the
past.
By the way, please don't top-post.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Aug 20, 2015 at 6:05 AM, Greg Stark <stark@mit.edu> wrote:
On Thu, Aug 20, 2015 at 3:24 AM, Peter Geoghegan <pg@heroku.com> wrote:
I believe, in general, that we should consider a multi-pass sort to be
a kind of inherently suspect thing these days, in the same way that
checkpoints occurring 5 seconds apart are: not actually abnormal, but
something that we should regard suspiciously. Can you really not
afford enough work_mem to only do one pass? Does it really make sense
to add far more I/O and CPU costs to avoid that other tiny memory
capacity cost?I think this is the crux of the argument. And I think you're
basically, but not entirely, right.
I agree that that's the crux of my argument. I disagree about my not
being entirely right. :-)
The key metric there is not how cheap memory has gotten but rather
what the ratio is between the system's memory and disk storage. The
use case I think you're leaving out is the classic "data warehouse"
with huge disk arrays attached to a single host running massive
queries for hours. In that case reducing run size will reduce I/O
requirements directly and halving the amount of I/O sort takes will
halve the time it takes regardless of cpu efficiency. And I have a
suspicion typical data distributions get much better than a 2x
speedup.
It could reduce seek time, which might be the dominant cost (but not
I/O as such). I do accept that my argument did not really apply to
this case, but you seem to be making an additional non-conflicting
argument that certain data warehousing cases would be helped in
another way by my patch. My argument was only about multi-gigabyte
cases that I tested that were significantly improved, primarily due to
CPU caching effects. If this helps with extremely large sorts that do
require multiple passes by reducing seek time -- I think that they'd
have to be multi-terabyte sorts, which I am ill-equipped to test --
then so much the better, I suppose.
In any case, as I've said the way we allow run size to be dictated
only by available memory (plus whatever replacement selection can do
to make on-tape runs longer) is bogus. In the future there should be a
cost model for an optimal run size, too.
But I think you're basically right that this is the wrong use case to
worry about for most users. Even those users that do have large batch
queries are probably not processing so much that they should be doing
multiple passes. The ones that do are are probably more interested in
parallel query, federated databases, column stores, and so on rather
than worrying about just how many hours it takes to sort their
multiple terabytes on a single processor.
I suppose so. If you can afford multiple terabytes of storage, you can
probably still afford gigabytes of memory to do a single pass. My
laptop is almost 3 years old, weighs about 1.5 Kg, and has 16 GiB of
memory. It's usually always that simple, and not really because we
assume that Postgres doesn't have to deal with multi-terabyte sorts.
Maybe I lack perspective, having never really dealt with a real data
warehouse. I didn't mean to imply that in no circumstances could
anyone profit from a multi-pass sort. If you're using Hadoop or
something, I imagine that it still makes sense.
In general, I think you'll agree that we should strongly leverage the
fact that a multi-pass sort just isn't going to be needed when things
are set up correctly under standard operating conditions nowadays.
I am quite suspicious of quicksort though. It has O(n^2) worst case
and I think it's only a matter of time before people start worrying
about DOS attacks from users able to influence the data ordering. It's
also not very suitable for GPU processing. Quicksort gets most of its
advantage from cache efficiency, it isn't a super efficient algorithm
otherwise, are there not other cache efficient algorithms to consider?
I think that high quality quicksort implementations [1]https://www.cs.princeton.edu/~rs/talks/QuicksortIsOptimal.pdf -- Peter Geoghegan will continue
to be the way to go for sorting integers internally at the very least.
Practically speaking, problems with the worst case performance have
been completely ironed out since the early 1990s. I think it's
possible to DOS Postgres by artificially introducing a worst-case, but
it's very unlikely to be the easiest way of doing that in practice. I
admit that it's probably the coolest way, though.
I think that the benefits of offloading sorting to the GPU are not in
evidence today. This may be especially true of a "street legal"
implementation that takes into account all of the edge cases, as
opposed to a hand customized thing for sorting uniformly distributed
random integers. GPU sorts tend to use radix sort, and I just can't
see that catching on.
[1]: https://www.cs.princeton.edu/~rs/talks/QuicksortIsOptimal.pdf -- Peter Geoghegan
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Aug 20, 2015 at 11:16 PM, Peter Geoghegan <pg@heroku.com> wrote:
It could reduce seek time, which might be the dominant cost (but not
I/O as such).
No I didn't quite follow the argument to completion. Increasing the
run size is a win if it reduces the number of passes. In the
single-pass case it has to read all the data once, write it all out to
tapes, then read it all back in again.So 3x the data. If it's still
not sorted it
needs to write it all back out yet again and read it all back in
again. So 5x the data. If the tapes are larger it can avoid that 66%
increase in total I/O. In large data sets it can need 3, 4, or maybe
more passes through the data and saving one pass would be a smaller
incremental difference. I haven't thought through the exponential
growth carefully enough to tell if doubling the run size should
decrease the number of passes linearly or by a constant number.
But you're right that seems to be less and less a realistic scenario.
Times when users are really processing data sets that large nowadays
they'll just throw it into Hadoop or Biigquery or whatever to get the
parallelism of many cpus. Or maybe Citus and the like.
The main case where I expect people actually run into this is in
building indexes, especially for larger data types (which come to
think of it might be exactly where the comparison is expensive enough
that quicksort's cache efficiency isn't helpful).
But to do fair tests I would suggest you configure work_mem smaller
(since running tests on multi-terabyte data sets is a pain) and sort
some slower data types that don't fit in memory. Maybe arrays of text
or json?
--
greg
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Aug 20, 2015 at 5:02 PM, Greg Stark <stark@mit.edu> wrote:
I haven't thought through the exponential
growth carefully enough to tell if doubling the run size should
decrease the number of passes linearly or by a constant number.
It seems that with 5 times the data that previously required ~30MB to
avoid a multi-pass sort (where ~2300MB is required for an internal
sort -- the benchmark query), it took ~60MB to avoid a multi-pass
sort. I guess I just didn't exactly determine either threshold due to
that taking too long, and that as predicted, every time the input size
quadruples, the required amount of work_mem to avoid multiple passes
only doubles. That will need to be verified more vigorously, but it
looks that way.
But you're right that seems to be less and less a realistic scenario.
Times when users are really processing data sets that large nowadays
they'll just throw it into Hadoop or Biigquery or whatever to get the
parallelism of many cpus. Or maybe Citus and the like.
I'm not sure that even that's generally true, simply because sorting a
huge amount of data is very expensive -- it's not really a "big data"
thing, so to speak. Look at recent results on this site:
Last year's winning "Gray" entrant, TritonSort, uses a huge parallel
cluster of 186 machines, but only sorts 100TB. That's just over 500GB
per node. Each node is a 32 core Intel Xeon EC2 instance with 244GB
memory, and lots of SSDs. It seems like the point of the 100TB minimum
rule in the "Gray" contest category is that that's practically
impossible to fit entirely in memory (to avoid merging).
Eventually, linearithmic growth becomes extremely painful, not matter
how much processing power you have. It takes a while, though.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 20 August 2015 at 18:41, Peter Geoghegan <pg@heroku.com> wrote:
On Thu, Aug 20, 2015 at 8:15 AM, Simon Riggs <simon@2ndquadrant.com>
wrote:On 20 August 2015 at 03:24, Peter Geoghegan <pg@heroku.com> wrote:
The patch is ~3.25x faster than master
I've tried to read this post twice and both times my work_mem overflowed.
;-)Can you summarize what this patch does? I understand clearly what it
doesn't
do...
The most important thing that it does is always quicksort runs, that
are formed by simply filling work_mem with tuples in no particular
order, rather than trying to make runs that are twice as large as
work_mem on average. That's what the ~3.25x improvement concerned.
That's actually a significantly simpler algorithm than replacement
selection, and appears to be much faster.
Then I think this is fine, not least because it seems like a first step
towards parallel sort.
This will give more runs, so merging those needs some thought. It will also
give a more predictable number of runs, so we'll be able to predict any
merging issues ahead of time. We can more easily find out the min/max tuple
in each run, so we only merge overlapping runs.
You might even say that it's
a dumb algorithm, because it is less sophisticated than replacement
selection. However, replacement selection tends to use CPU caches very
poorly, while its traditional advantages have become dramatically less
important due to large main memory sizes in particular. Also, it hurts
that we don't currently dump tuples in batches, for several reasons.
Better to do memory intense operations in batch, rather than having a
huge inner loop, in order to minimize or prevent instruction cache
misses. And we can better take advantage of asynchronous I/O.The complicated aspect of considering the patch is whether or not it's
okay to not use replacement selection anymore -- is that an
appropriate trade-off?
Using a heapsort is known to be poor for large heaps. We previously
discussed the idea of quicksorting the first chunk of memory, then
reallocating the heap as a smaller chunk for the rest of the sort. That
would solve the cache miss problem.
I'd like to see some discussion of how we might integrate aggregation and
sorting. A heap might work quite well for that, whereas quicksort doesn't
sound like it would work as well.
The reason that the code has not actually been simplified by this
patch is that I still want to use replacement selection for one
specific case: when it is anticipated that a "quicksort with
spillover" can occur, which is only possible with incremental
spilling. That may avoid most I/O, by spilling just a few tuples using
a heap/priority queue, and quicksorting everything else. That's
compelling when you can manage it, but no reason to always use
replacement selection for the first run in the common case where there
well be several runs in total.
I think its premature to retire that algorithm - I think we should keep it
for a while yet. I suspect it may serve well in cases where we have low
memory, though I accept that is no longer the case for larger servers that
we would now call typical.
This could cause particular issues in optimization, since heap sort is
wonderfully predictable. We'd need a cost_sort() that was slightly
pessimistic to cover the risk that a quicksort might not be as fast as we
hope.
Is that any clearer?
Yes, thank you.
I'd like to see a more general and concise plan for how sorting evolves. We
are close to having the infrastructure to perform intermediate aggregation,
which would allow that to happen during sorting when required (aggregation,
sort distinct). We also agreed some time back that parallel sorting would
be the first incarnation of parallel operations, so we need to consider
that also.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Aug 20, 2015 at 11:56 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
This will give more runs, so merging those needs some thought. It will also
give a more predictable number of runs, so we'll be able to predict any
merging issues ahead of time. We can more easily find out the min/max tuple
in each run, so we only merge overlapping runs.
I think that merging runs can be optimized to reduce the number of
cache misses. Poul-Henning Kamp, the FreeBSD guy, has described
problems with binary heaps and cache misses [1]http://queue.acm.org/detail.cfm?id=1814327 -- Peter Geoghegan, and I think we could
use his solution for merging. But we should definitely still quicksort
runs.
Using a heapsort is known to be poor for large heaps. We previously
discussed the idea of quicksorting the first chunk of memory, then
reallocating the heap as a smaller chunk for the rest of the sort. That
would solve the cache miss problem.I'd like to see some discussion of how we might integrate aggregation and
sorting. A heap might work quite well for that, whereas quicksort doesn't
sound like it would work as well.
If you're talking about deduplicating within tuplesort, then there are
techniques. I don't know that that needs to be an up-front priority of
this work.
I think its premature to retire that algorithm - I think we should keep it
for a while yet. I suspect it may serve well in cases where we have low
memory, though I accept that is no longer the case for larger servers that
we would now call typical.
I have given one case where I think the first run should still use
replacement selection: where that enables a "quicksort with
spillover". For that reason, I would consider that I have not actually
proposed to retire the algorithm. In principle, I agree with also
using it under any other circumstances where it is likely to be
appreciably faster, but it's just not in evidence that there is any
other such case. I did look at all the traditionally sympathetic
cases, as I went into, and it still seemed to not be worth it at all.
But by all means, if you think I missed something, please show me a
test case.
This could cause particular issues in optimization, since heap sort is
wonderfully predictable. We'd need a cost_sort() that was slightly
pessimistic to cover the risk that a quicksort might not be as fast as we
hope.
Wonderfully predictable? Really? It's totally sensitive to CPU cache
characteristics. I wouldn't say that at all. If you're alluding to the
quicksort worst case, that seems like the wrong thing to worry about.
The risk around that is often overstated, or based on experience from
third-rate implementations that don't follow various widely accepted
recommendations from the research community.
I'd like to see a more general and concise plan for how sorting evolves. We
are close to having the infrastructure to perform intermediate aggregation,
which would allow that to happen during sorting when required (aggregation,
sort distinct). We also agreed some time back that parallel sorting would be
the first incarnation of parallel operations, so we need to consider that
also.
I agree with everything you say here, I think. I think it's
appropriate that this work anticipate adding a number of other
optimizations in the future, at least including:
* Parallel sort using worker processes.
* Memory prefetching.
* Offset-value coding of runs, a compression technique that was used
in System R, IIRC. This can speed up merging a lot, and will save I/O
bandwidth on dumping out runs.
* Asynchronous I/O.
There should be an integrated approach to applying every possible
optimization, or at least leaving the possibility open. A lot of these
techniques are complementary. For example, there are significant
benefits where the "onlyKey" optimization is now used with external
sorts, which you get for free by using quicksort for runs. In short, I
am absolutely on-board with the idea that these things need to be
anticipated at the very least. For another speculative example, offset
coding makes the merge step cheaper, but the work of doing the offset
coding can be offloaded to worker processes, whereas the merge step
proper cannot really be effectively parallelized -- those two
techniques together are greater than the sum of their parts. One big
problem that I see with replacement selection is that it makes most of
these things impossible.
In general, I think that parallel sort should be an external sort
technique first and foremost. If you can only parallelize an internal
sort, then running out of road when there isn't enough memory to do
the sort in memory becomes a serious issue. Besides, you need to
partition the input anyway, and external sorting naturally needs to do
that while not precluding runs not actually being dumped to disk.
[1]: http://queue.acm.org/detail.cfm?id=1814327 -- Peter Geoghegan
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Aug 19, 2015 at 7:24 PM, Peter Geoghegan <pg@heroku.com> wrote:
Let's start by comparing an external sort that uses 1/3 the memory of
an internal sort against the master branch. That's completely unfair
on the patch, of course, but it is a useful indicator of how well
external sorts do overall. Although an external sort surely cannot be
as fast as an internal sort, it might be able to approach an internal
sort's speed when there is plenty of I/O bandwidth. That's a good
thing to aim for, I think.
The patch only takes ~10% more time to execute this query, which seems
very good considering that ~1/3 the work_mem has been put to use.
Note that the on-tape runs are small relative to CPU costs, so this
query is a bit sympathetic (consider the time spent writing batches
that trace_sort indicates here). CREATE INDEX would not compare so
well with an internal sort, for example, especially if it was a
composite index or something.
This is something that I've made great progress on (see "concrete
example" below for numbers). The differences in the amount of I/O
required between these two cases (due to per-case variability in the
width of tuples written to tape for datum sorts and index sorts) did
not significantly factor in to the differences in performance, it
turns out. The big issue was that while a pass-by-value datum sort
accidentally has good cache characteristics during the merge step,
that is not generally true. I figured out a way of making it generally
true, though. I attach a revised patch series with a new commit that
adds an optimization to the merge step, relieving what was a big
remaining bottleneck in the CREATE INDEX case (and *every* external
sort case that isn't a pass-by-value datum sort, which is most
things). There are a few tweaks to earlier commits including, but
nothing very interesting.
All of my benchmarks suggests that this most recent revision puts
external sorting within a fairly small margin of a fully internal sort
on the master branch in many common cases. This difference is seen
when the implementation only makes use of a fraction of the memory
required for an internal sort, provided the system is reasonably well
balanced. For a single backend, there is an overhead of about 5% - 20%
against master's internal sort performance. This speedup appears to be
fairly robust across a variety of different cases.
I particularly care about CREATE INDEX, since that is where most pain
is felt in the real world, and I'm happy that I found a way to make
CREATE INDEX external sort reasonably comparable in run time to
internal sorts that consume much more memory. I think it's time to
stop talking about this as performance work, and start talking about
it as scalability work. With that in mind, I'm mostly going to compare
the performance of the new, optimized external sort implementation
with the existing internal sort implementation from now on.
New patch -- Sequential memory access
===============================
The trick I hit upon for relieving the merge bottleneck was fairly simple.
Prefetching works for internal sorts, but isn't practical for external
sorts while merging. OTOH, I can arrange to have runs allocate their
"tuple proper" contents into a memory pool, partitioned by final
on-the-fly tape number. Today, runs/tapes are slurped from disk
sequentially in a staggered fashion, based on the availability of
in-memory tuples from each tape while merging. The new patch is very
effective in reducing cache misses by simply making sure that each
tape's "tuple proper" (e.g. each IndexTuple) is accessed in memory in
the natural, predictable order (the sorted order that runs on tape
always have). Unlike with internal sorts (where explicit memory
prefetching of each "tuple proper" may be advisable), the final order
in which the caller must consume a tape's "tuple proper" is
predictable well in advance.
A little rearrangement is required to make what were previously retail
palloc() calls during prereading (a palloc() for each "tuple proper",
within each READTUP() routine) consume space from the memory pool
instead. The pool (a big, once-off memory allocation) is reused in a
circular fashion per tape partition. This saves a lot of palloc()
overhead.
Under this scheme, each tape's next few IndexTuples are all in one
cacheline. This patch has the merge step make better use of available
memory bandwidth, rather than attempting to conceal memory latency.
Explicit prefetch instructions (that we may independently end up using
to do something similar with internal sorts when fetching tuples
following sorting proper) are all about hiding latency.
Concrete example -- performance
---------------------------------------------
I attach a text file describing a practical, reproducible example
CREATE INDEX. It shows how CREATE INDEX now compares fairly well with
an equivalent operation that has enough maintenance_work_mem to
complete its sort internally. I'll just summarize it here:
A CREATE INDEX on a single int4 attribute on an unlogged table takes
only ~18% longer. This is a 100 million row table that is 4977 MB on
disk. On master, CREATE INDEX takes 66.6 seconds in total with an
*internal* sort. With the patch series applied, an *external* sort
involving a final on-the-fly merge of 6 runs takes 78.5 seconds.
Obviously, since there are 6 runs to merge, work_mem is only
approximately 1/6 of what is required for a fully internal sort.
High watermark memory usage
------------------------------------------
One concern about the patch may be that it increases the high
watermark memory usage by any on-the-fly final merge step. It takes
full advantage of the availMem allowance at a point where every "tuple
proper" is freed, and availMem has only had SortTuple/memtuples array
"slot" memory subtracted (plus overhead). Memory is allocated in bulk
once, and partitioned among active tapes, with no particular effort
towards limiting memory usage beyond enforcing that we always
!LACKMEM().
A lot of the overhead of many retail palloc() calls is removed by
simply using one big memory allocation. In practice, LACKMEM() will
rarely become true, because the availability of slots now tends to be
the limiting factor. This is partially explained by the number of
slots being established when palloc() overhead was in play, prior to
the final merge step. However, I have concerns about the memory usage
of this new approach.
With the int4 CREATE INDEX case above, which has a uniform
distribution, I noticed that about 40% of each tape's memory space
remains unused when slots are exhausted. Ideally, we'd only have
allocated enough memory to run out at about the same time that slots
are exhausted, since the two would be balanced. This might be possible
for fixed-sized tuples. I have not allocated each final on-the-fly
merge step's active tape's pool individually, because while this waste
of memory is large enough to be annoying, it's not large enough to be
significantly helped by managing a bunch of per-tape buffers and
enlarging them as needed geometrically (e.g. starting small, and
doubling each time the buffer size is hit until the per-tape limit is
finally reached).
The main reason that the high watermark is increased is not because of
this, though. It's mostly just that "tuple proper" memory is not freed
until the sort is done, whereas before there were many small pfree()
calls to match the many palloc() calls -- calls that occurred early
and often. Note that the availability of "slots" (i.e. the size of the
memtuples array, minus one element for each tape's heap item) is
currently determined by whatever size it happened to be at when
memtuples stopped growing, which isn't particularly well principled
(hopefully this is no worse now).
Optimal memory usage
-------------------------------
In the absence of any clear thing to care about most beyond making
sorting faster while still enforcing !LACKMEM(), for now I've kept it
simple. I am saving a lot of memory by clawing back palloc() overhead,
but may be wasting more than that in another way now, to say nothing
of the new high watermark itself. If we're entirely I/O bound, maybe
we should not waste memory by simply not allocating as much anyway
(i.e. the extra memory may only theoretically help even when it is
written to). But what does it really mean to be I/O bound? The OS
cache probably consumes plenty of memory, too.
Finally, let us not forget that it's clearly still the case that even
following this work, run size needs to be optimized using a cost
model, rather than simply being determined by how much memory can be
made available (work_mem). If we get a faster sort using far less
work_mem, then the DBA is probably accidentally wasting huge amounts
of memory due to failing to do that. As an implementor, it's really
hard to balance all of these concerns, or to say that one in
particular is most urgent.
Parallel sorting
===========
Simon rightly emphasized the need for joined-up thinking in relation
to applying important tuplesort optimizations. We must at least
consider parallelism as part of this work.
I'm glad that the first consumer of parallel infrastructure is set to
be parallel sequential scans, not internal parallel sorts. That's
because it seems that overall, a significant cost is actually reading
tuples into memtuples to sort -- heap scanning and related costs in
the buffer manager (even assuming everything is in shared_buffers),
COPYTUP() palloc() calls, and so on. Taken together, they can be a
bigger overall cost than sorting proper, even assuming abbreviated
keys are not used. The third bucket that I tend to categorize costs
into, "time spent actually writing out finished runs", is small on a
well balanced system. Surprisingly small, I would say.
I will sketch a simple implementation of parallel sorting based on the
patch series that may be workable, and requires relatively little
implementation effort compare to other ideas that were raised at
various times:
* Establish an optimal run size ahead of time using a cost model. We
need this for serial external sorts anyway, to relieve the DBA of
having to worry about sizing maintenance_work_mem according to obscure
considerations around cache efficiency within tuplesort. Parallelism
probably doesn't add much complexity to the cost model, which is not
especially complicated to begin with. Note that I have not added this
cost model yet (just the ad-hoc, tuplesort-private cost model for
using replacement selection to get a "quicksort with spillover"). It
may be best if this cost model lives in the optimizer.
* Have parallel workers do a parallel heap scan of the relation until
they fill this optimal run size. Use local memory to sort within
workers. Write runs out in the usual way. Then, the worker picks up
the next run scheduled. If there are no more runs to build, there is
no more work for the parallel workers.
* Shut down workers. Do an on-the-fly merge in the parent process.
This is the same as with a serial merge, but with a little
coordination with worker processes to make sure every run is
available, etc. In general, coordination is kept to an absolute
minimum.
I tend to think that this really simple approach would get much of the
gain of something more complicated -- no need to write shared memory
management code, minimal need to handle coordination between workers,
and no real changes to the algorithms used for each sub-problem. This
makes merging more of a bottleneck again, but that is a bottleneck on
I/O and especially memory bandwidth. Parallelism cannot help much with
that anyway (except by compressing runs with offset coding, perhaps,
but that isn't specific to parallelism and won't always help). Writing
out runs in bulk is very fast here -- certainly much faster than I
thought it would be when I started thinking about external sorting.
And if that turns out to be a problem for cases that have sufficient
memory to do everything internally, that can later be worked on
non-invasively.
As I've said in the past, I think parallel sorting only makes sense
when memory latency and bandwidth are not huge bottlenecks, which we
should bend over backwards to avoid. In a sense, you can't really make
use of parallel workers for sorting until you fix that problem first.
I am not suggesting that we do this because it's easier than other
approaches. I think it's actually most effective to not make parallel
sorting too divergent from serial sorting, because making things
cumulative makes speed-ups from localized optimizations cumulative,
while at the same time, AFAICT there isn't anything to recommend
extensive specialization for parallel sort. If what I've sketched is
also a significantly easier approach, then that's a bonus.
--
Peter Geoghegan
Attachments:
quicksort_external_test.txttext/plain; charset=US-ASCII; name=quicksort_external_test.txtDownload
Setup
*****
-- 100 million row table, 4977 MB overall size, high cardinality int4 attribute
-- to build index on. Already entirely in shared_buffers in all tests (used
-- pg_prewarm for that, and took a few other noise-avoidance precautions):
CREATE TABLE big_high_cardinality_int4 AS
SELECT (random() * 2000000000)::int4 s,
'abcdefghijlmn'::text junk
FROM generate_series(1, 100000000);
Table details:
postgres=# \dt+ big_high_cardinality_int4
List of relations
Schema | Name | Type | Owner | Size | Description
--------+---------------------------+-------+-------+---------+-------------
public | big_high_cardinality_int4 | table | pg | 4977 MB |
(1 row)
Master branch (internal sort)
*****************************
With just enough memory to do an internal sort (master branch, so no explicit
memory prefetching):
postgres=# SET maintenance_work_mem = '5400MB';
SET
Time: 0.215 ms
postgres=# CREATE INDEX ON big_high_cardinality_int4(s);
CREATE INDEX
Time: 66628.886 ms <---------
trace_sort output
-----------------
begin index sort: unique = f, workMem = 5529600, randomAccess = f
performsort starting: CPU 0.65s/13.50u sec elapsed 14.16 sec
performsort done: CPU 0.65s/43.07u sec elapsed 43.74 sec
internal sort ended, 5494829 KB used: CPU 6.92s/56.64u sec elapsed 66.41 sec <---------
Patch series (external sort)
****************************
postgres=# set maintenance_work_mem = '1GB';
SET
Time: 0.383 ms
postgres=# CREATE INDEX ON big_high_cardinality_int4(s);
CREATE INDEX
Time: 78564.853 ms <---------
trace_sort output
-----------------
begin index sort: unique = f, workMem = 1048576, randomAccess = f
switching to external sort with 3745 tapes: CPU 0.87s/2.53u sec elapsed 3.41 sec
hybrid sort-merge strategy used at row 19173960 crossover 0.825 (est 100000000.00 rows 5.22 runs)
starting quicksort of run 1: CPU 0.87s/2.53u sec elapsed 3.41 sec
finished quicksorting run 1: CPU 0.87s/7.54u sec elapsed 8.42 sec
finished writing run 1 to tape 0: CPU 1.09s/8.70u sec elapsed 9.81 sec
starting quicksort of run 2: CPU 1.09s/12.30u sec elapsed 13.41 sec
finished quicksorting run 2: CPU 1.09s/16.52u sec elapsed 17.63 sec
finished writing run 2 to tape 1: CPU 1.34s/17.46u sec elapsed 18.98 sec
starting quicksort of run 3: CPU 1.34s/21.08u sec elapsed 22.60 sec
finished quicksorting run 3: CPU 1.34s/25.31u sec elapsed 26.83 sec
finished writing run 3 to tape 2: CPU 1.51s/26.33u sec elapsed 28.02 sec
starting quicksort of run 4: CPU 1.51s/29.95u sec elapsed 31.64 sec
finished quicksorting run 4: CPU 1.51s/34.17u sec elapsed 35.86 sec
finished writing run 4 to tape 3: CPU 1.67s/35.17u sec elapsed 37.03 sec
starting quicksort of run 5: CPU 1.67s/38.80u sec elapsed 40.66 sec
finished quicksorting run 5: CPU 1.67s/43.05u sec elapsed 44.91 sec
finished writing run 5 to tape 4: CPU 1.83s/44.06u sec elapsed 46.08 sec
performsort starting: CPU 1.83s/47.55u sec elapsed 49.57 sec
starting quicksort of run 6: CPU 1.83s/47.55u sec elapsed 49.57 sec
finished quicksorting run 6: CPU 1.83s/51.62u sec elapsed 53.64 sec
finished writing run 6 to tape 5: CPU 1.99s/52.59u sec elapsed 54.76 sec
performsort done (except 6-way final merge): CPU 2.55s/53.13u sec elapsed 55.87 sec
external sort ended, 244373 disk blocks used: CPU 9.05s/68.92u sec elapsed 78.50 sec <---------
Conclusions
***********
Index details:
postgres=# \di+ big_high_cardinality_int4_s_idx
List of relations
Schema | Name | Type | Owner | Table | Size | Description
--------+---------------------------------+-------+-------+---------------------------+---------+-------------
public | big_high_cardinality_int4_s_idx | index | pg | big_high_cardinality_int4 | 2142 MB |
(1 row)
Only an 18% overhead relative to master's internal sort performance (i.e. total
CREATE INDEX duration) is seen here. Obviously, the patch series uses a small
fraction of the work_mem of master, and so is not actually comparable, but
internal sort performance seems close to an ideal to aim for. Note that I have
avoided giving the baseline internal sort performance the benefit of explicit
memory prefetching (by actually using the master branch).
Presumably, a smaller difference (more favorable to the patch) can be observed
with a logged table, but that was avoided. Although an unlogged table was
used, it was stored on a regular, durable ext4 partition on an SSD (which uses
LVM + disk encryption), which is where the index ended up, too (although the
index would not have been fsync()'d, since it was unlogged). tmpfs probably
competed fairly aggressively with that for I/O due to using swap space (there
is only 16 GiB of memory on this system), and so I'm willing to believe that
this competition could be even closer still on a real server. This is just how
one simple, representative case worked out on my laptop, which lacks real I/O
parallelism. Note that the time spent writing out runs is actually really low
here (see trace_sort output), even though the temp_tablespace tmpfs partition
is backed only by a consumer-grade mobile SSD. OS filesystem caching must help
here, but more I/O bandwidth would almost certainly make the patch series
faster for this case.
This result was obtained without going to the trouble of running tests on
hardware that plays to the strengths of the new patch. More favorable CREATE
INDEX numbers seem quite possible, but this was not a priority ahead of patch
submission. A big sort (e.g. one that requires 200GB of temp files before the
merge phase) on a server with lots of memory, and plenty of temp_tablespace I/O
bandwidth would probably be more interesting.
The merge phase's new cache efficiency helps at a point involving comparisons
generally more likely to indicate equality, often overall tuple equality, but
especially equality for earlier attributes. I imagine this helps a lot with
multi-attribute CREATE INDEX cases with moderate cardinality leading
attributes. It must also help with the TID tie-breaker within
comparetup_index_btree(), since pointer chasing need only read from the
IndexTuple already in L1 cache. The fact that second-or-subsequent attributes
are not directly represented within each SortTuple becomes significantly less
important at the right time.
0005-Use-tuple-proper-memory-pool-in-tuplesort.patchtext/x-patch; charset=US-ASCII; name=0005-Use-tuple-proper-memory-pool-in-tuplesort.patchDownload
From 54585cc93cc5207b80c7dbd8f34772c49290e2f2 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghegan86@gmail.com>
Date: Mon, 24 Aug 2015 13:22:11 -0700
Subject: [PATCH 5/5] Use "tuple proper" memory pool in tuplesort
Allocating one pool of memory for use by READTUP() routines during an
on-the-fly merge phase of an external sort (where tuples are preloaded
from disk in batch) accelerates the merge phase significantly. Merging
was a notable remaining bottleneck following previous commits
overhauling external sorting.
Sequentially consuming memory from the pool accelerates merging because
the natural order of all subsequent access (access during merging
proper) is sequential per tape; cache characteristics are improved. In
addition, modern microarchitectures and compilers may even automatically
perform prefetching with a sequential access pattern.
---
src/backend/utils/sort/tuplesort.c | 346 +++++++++++++++++++++++++++++++++----
1 file changed, 315 insertions(+), 31 deletions(-)
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index d828723..7941fcc 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -362,6 +362,23 @@ struct Tuplesortstate
int64 *mergeavailmem; /* availMem for prereading each tape */
int mergefreelist; /* head of freelist of recycled slots */
int mergefirstfree; /* first slot never used in this merge */
+ int64 spacePerTape; /* space (memory) segmented to each tape */
+
+ /*
+ * During final on-the-fly merge steps, an ad-hoc "tuple proper" memory
+ * pool is used, reducing palloc() overhead. Importantly, this also
+ * results in far more cache efficient merging, since each tape's tuples
+ * must naturally be accessed sequentially (in sorted order).
+ *
+ * These variables manage each active tape's ownership of memory pool
+ * space. The underlying buffer is allocated once, just before
+ * TSS_FINALMERGE prereading.
+ */
+ bool usePool; /* using "tuple proper" memory pool? */
+ char **mergetupstart; /* start of each run's partition */
+ char **mergetupcur; /* current offset into each run's partition */
+ char **mergetuptail; /* last appended item's start point */
+ char **mergeoverflow; /* single retail palloc() "overflow" */
/*
* Variables for Algorithm D. Note that destTape is a "logical" tape
@@ -527,9 +544,13 @@ static void selectnewtape(Tuplesortstate *state);
static void mergeruns(Tuplesortstate *state);
static void mergeonerun(Tuplesortstate *state);
static void mergememruns(Tuplesortstate *state);
-static void beginmerge(Tuplesortstate *state);
+static void beginmerge(Tuplesortstate *state, bool finalMerge);
+static void mergepool(Tuplesortstate *state);
+static void mergepoolone(Tuplesortstate *state, int srcTape,
+ SortTuple *stup, bool *should_free);
static void mergepreread(Tuplesortstate *state);
-static void mergeprereadone(Tuplesortstate *state, int srcTape);
+static void mergeprereadone(Tuplesortstate *state, int srcTape,
+ SortTuple *stup, bool *should_free);
static void dumptuples(Tuplesortstate *state, bool alltuples);
static void dumpbatch(Tuplesortstate *state, bool alltuples);
static void make_bounded_heap(Tuplesortstate *state);
@@ -541,6 +562,7 @@ static void tuplesort_heap_siftup(Tuplesortstate *state, bool checkIndex);
static void reversedirection(Tuplesortstate *state);
static unsigned int getlen(Tuplesortstate *state, int tapenum, bool eofOK);
static void markrunend(Tuplesortstate *state, int tapenum);
+static void *tupproperalloc(Tuplesortstate *state, int tapenum, Size size);
static int comparetup_heap(const SortTuple *a, const SortTuple *b,
Tuplesortstate *state);
static void copytup_heap(Tuplesortstate *state, SortTuple *stup, void *tup);
@@ -669,6 +691,7 @@ tuplesort_begin_common(int workMem, double rowNumHint, bool randomAccess)
* inittapes(), if needed
*/
+ state->usePool = false; /* memory pool not used until final merge */
state->result_tape = -1; /* flag that result tape has not been formed */
MemoryContextSwitchTo(oldcontext);
@@ -1780,6 +1803,7 @@ tuplesort_performsort(Tuplesortstate *state)
* Internal routine to fetch the next tuple in either forward or back
* direction into *stup. Returns FALSE if no more tuples.
* If *should_free is set, the caller must pfree stup.tuple when done with it.
+ * Otherwise, caller should not use tuple following next call here.
*/
static bool
tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
@@ -1791,6 +1815,7 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
{
case TSS_SORTEDINMEM:
Assert(forward || state->randomAccess);
+ Assert(!state->usePool);
*should_free = false;
if (forward)
{
@@ -1856,6 +1881,7 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
case TSS_SORTEDONTAPE:
Assert(forward || state->randomAccess);
+ Assert(!state->usePool);
*should_free = true;
if (forward)
{
@@ -1941,6 +1967,7 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
case TSS_MEMTAPEMERGE:
Assert(forward);
+ Assert(!state->usePool);
/* For now, assume tuple returned from memory */
*should_free = false;
@@ -2065,7 +2092,9 @@ just_memtuples:
case TSS_FINALMERGE:
Assert(forward);
- *should_free = true;
+ Assert(state->usePool);
+ /* For now, assume "tuple proper" is from memory pool */
+ *should_free = false;
/*
* This code should match the inner loop of mergeonerun().
@@ -2073,18 +2102,17 @@ just_memtuples:
if (state->memtupcount > 0)
{
int srcTape = state->memtuples[0].tupindex;
- Size tuplen;
int tupIndex;
SortTuple *newtup;
+ /*
+ * Returned tuple is still counted in our memory space most
+ * of the time. See mergepoolone() for discussion of why
+ * caller may occasionally be required to free returned
+ * tuple, and how the preread memory pool is managed with
+ * regard to edge cases more generally.
+ */
*stup = state->memtuples[0];
- /* returned tuple is no longer counted in our memory space */
- if (stup->tuple)
- {
- tuplen = GetMemoryChunkSpace(stup->tuple);
- state->availMem += tuplen;
- state->mergeavailmem[srcTape] += tuplen;
- }
tuplesort_heap_siftup(state, false);
if ((tupIndex = state->mergenext[srcTape]) == 0)
{
@@ -2092,9 +2120,11 @@ just_memtuples:
* out of preloaded data on this tape, try to read more
*
* Unlike mergeonerun(), we only preload from the single
- * tape that's run dry. See mergepreread() comments.
+ * tape that's run dry, though not before preparing its
+ * partition within the memory pool for a new round of
+ * sequential consumption. See mergepreread() comments.
*/
- mergeprereadone(state, srcTape);
+ mergeprereadone(state, srcTape, stup, should_free);
/*
* if still no data, we've reached end of run on this tape
@@ -2156,6 +2186,8 @@ tuplesort_gettupleslot(Tuplesortstate *state, bool forward,
* Fetch the next tuple in either forward or back direction.
* Returns NULL if no more tuples. If *should_free is set, the
* caller must pfree the returned tuple when done with it.
+ * If it is not set, caller should not use tuple following next
+ * call here.
*/
HeapTuple
tuplesort_getheaptuple(Tuplesortstate *state, bool forward, bool *should_free)
@@ -2175,6 +2207,8 @@ tuplesort_getheaptuple(Tuplesortstate *state, bool forward, bool *should_free)
* Fetch the next index tuple in either forward or back direction.
* Returns NULL if no more tuples. If *should_free is set, the
* caller must pfree the returned tuple when done with it.
+ * If it is not set, caller should not use tuple following next
+ * call here.
*/
IndexTuple
tuplesort_getindextuple(Tuplesortstate *state, bool forward,
@@ -2472,6 +2506,10 @@ inittapes(Tuplesortstate *state)
state->mergelast = (int *) palloc0(maxTapes * sizeof(int));
state->mergeavailslots = (int *) palloc0(maxTapes * sizeof(int));
state->mergeavailmem = (int64 *) palloc0(maxTapes * sizeof(int64));
+ state->mergetupstart = (char **) palloc0(maxTapes * sizeof(char *));
+ state->mergetupcur = (char **) palloc0(maxTapes * sizeof(char *));
+ state->mergetuptail = (char **) palloc0(maxTapes * sizeof(char *));
+ state->mergeoverflow = (char **) palloc0(maxTapes * sizeof(char *));
state->tp_fib = (int *) palloc0(maxTapes * sizeof(int));
state->tp_runs = (int *) palloc0(maxTapes * sizeof(int));
state->tp_dummy = (int *) palloc0(maxTapes * sizeof(int));
@@ -2659,7 +2697,7 @@ mergeruns(Tuplesortstate *state)
/* Tell logtape.c we won't be writing anymore */
LogicalTapeSetForgetFreeSpace(state->tapeset);
/* Initialize for the final merge pass */
- beginmerge(state);
+ beginmerge(state, true);
state->status = TSS_FINALMERGE;
return;
}
@@ -2765,7 +2803,7 @@ mergeonerun(Tuplesortstate *state)
* Start the merge by loading one tuple from each active source tape into
* the heap. We can also decrease the input run/dummy run counts.
*/
- beginmerge(state);
+ beginmerge(state, false);
/*
* Execute merge by repeatedly extracting lowest tuple in heap, writing it
@@ -2888,13 +2926,12 @@ mergememruns(Tuplesortstate *state)
* fill the merge heap with the first tuple from each active tape.
*/
static void
-beginmerge(Tuplesortstate *state)
+beginmerge(Tuplesortstate *state, bool finalMerge)
{
int activeTapes;
int tapenum;
int srcTape;
int slotsPerTape;
- int64 spacePerTape;
/* Heap should be empty here */
Assert(state->memtupcount == 0);
@@ -2933,19 +2970,29 @@ beginmerge(Tuplesortstate *state)
Assert(activeTapes > 0);
slotsPerTape = (state->memtupsize - state->mergefirstfree) / activeTapes;
Assert(slotsPerTape > 0);
- spacePerTape = state->availMem / activeTapes;
+ state->spacePerTape = state->availMem / activeTapes;
for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
{
if (state->mergeactive[srcTape])
{
state->mergeavailslots[srcTape] = slotsPerTape;
- state->mergeavailmem[srcTape] = spacePerTape;
+ state->mergeavailmem[srcTape] = state->spacePerTape;
}
}
/*
+ * Preallocate "tuple proper" pool memory, and partition pool among
+ * tapes. Actual memory allocation is performed here at most once per
+ * sort, just in advance of the final on-the-fly merge step. That
+ * implies that the optimization is never used by randomAccess callers,
+ * since no on-the-fly final merge step will occur.
+ */
+ if (finalMerge)
+ mergepool(state);
+
+ /*
* Preread as many tuples as possible (and at least one) from each active
- * tape
+ * tape. This may use "tuple proper" pool.
*/
mergepreread(state);
@@ -2971,6 +3018,152 @@ beginmerge(Tuplesortstate *state)
}
/*
+ * mergepool - initialize "tuple proper" memory pool
+ *
+ * This allows sequential access to sorted tuples buffered in memory from
+ * tapes/runs on disk during a final on-the-fly merge step.
+ */
+static void
+mergepool(Tuplesortstate *state)
+{
+ char *tupProperPool;
+ Size poolSize = state->spacePerTape * state->activeTapes;
+ int srcTape;
+ int i;
+
+ /* Heap should be empty here */
+ Assert(state->memtupcount == 0);
+ Assert(state->activeTapes > 0);
+ Assert(!state->randomAccess);
+
+ /*
+ * For the purposes of tuplesort's memory accounting, the memory pool is
+ * not special, and so per-active-tape mergeavailmem is decreased only
+ * as memory is consumed from the pool through regular USEMEM() calls
+ * (see mergeprereadone()). All memory is actually allocated here all
+ * at once, with only rare exceptions, so freeing memory consumed from
+ * the pool must for the most part also occur at a higher, "logical"
+ * level.
+ *
+ * mergeavailmem is reset specially (logically freed) when the partition
+ * space is exhausted for its tape when a new round of prereading is
+ * required. This is okay because there is no "global" availMem
+ * consumer from here on. The inability for such a consumer to consume
+ * memory from the pool is therefore of no concern.
+ *
+ * mergeavailmem is only relied on by the on-the-fly merge step to
+ * determine that space has been exhausted for a tape. There still
+ * needs to be some leeway to perform retail palloc() calls because
+ * LACKMEM() is only a soft limit. In general, mergeavailmem may be
+ * insufficient memory for storing even one tuple.
+ */
+ tupProperPool = MemoryContextAllocHuge(state->sortcontext, poolSize);
+ state->usePool = true;
+
+ i = 0;
+ for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
+ {
+ if (!state->mergeactive[srcTape])
+ continue;
+
+ state->mergetupstart[srcTape] = tupProperPool + (i++ * state->spacePerTape);
+ /* Initialize current point into buffer */
+ state->mergetupcur[srcTape] = state->mergetupstart[srcTape];
+ state->mergetuptail[srcTape] = state->mergetupstart[srcTape];
+ state->mergeoverflow[srcTape] = NULL;
+ }
+ Assert(i == state->activeTapes);
+}
+
+/*
+ * mergepoolone - prepare preallocated pool for one merge input tape
+ *
+ * This is called following the exhaustion of preread tuples for one input
+ * tape. While no memory is actually deallocated (or allocated) here, this
+ * routine could be said to logically free the source tape's segment of the
+ * memory pool. Of course, all that actually occurs is that the memory pool
+ * state for the source tape is reset to indicate that all memory may be
+ * reused. Because the tape's mergeavailmem is also reset to the general
+ * per-active-tape share, calling here will generally result in LACKMEM()
+ * ceasing to apply.
+ *
+ * This routine must deal with fixing up the tuple that is about to be
+ * returned to the client, due to fine aspects of memory management.
+ */
+static void
+mergepoolone(Tuplesortstate *state, int srcTape, SortTuple *stup,
+ bool *should_free)
+{
+ Size lastTupLen;
+
+ /* By here, final on-the-fly merge step actually underway */
+ Assert(state->status == TSS_FINALMERGE);
+
+ /*
+ * Tuple about to be returned to caller ("stup") is final preread tuple
+ * from tape, just removed from the top of the heap. Special steps
+ * around memory management must be performed for that tuple.
+ */
+ if (!state->mergeoverflow[srcTape])
+ {
+ /*
+ * Mark tuple proper buffer range for reuse, but be careful to move
+ * final, tail tuple to start of space for next run so that it's
+ * available to caller when stup is returned, and remains available
+ * at least until the next tuple is requested.
+ */
+ lastTupLen =
+ state->mergetupcur[srcTape] - state->mergetuptail[srcTape];
+ state->mergetupcur[srcTape] = state->mergetupstart[srcTape];
+
+ memmove(state->mergetupcur[srcTape],
+ state->mergetuptail[srcTape],
+ lastTupLen);
+
+ /* Make SortTuple at top of the heap point to new "tuple proper" */
+ stup->tuple = (void *) state->mergetupcur[srcTape];
+ state->mergetupcur[srcTape] += lastTupLen;
+ }
+ else
+ {
+ /*
+ * Handle an "overflow" retail palloc.
+ *
+ * This is needed on the rare occasions when very little work_mem is
+ * available, particularly relative to the size of individual tuples
+ * passed by caller. Sometimes, tuplesort will only store one tuple
+ * per tape in memtuples, so some amount of dynamic allocation is
+ * inevitable.
+ */
+ Size tuplen;
+
+ /* No moving of last "tuple proper" required */
+ lastTupLen = 0;
+ state->mergetupcur[srcTape] = state->mergetupstart[srcTape];
+
+ /* Returned tuple is no longer counted in our memory space */
+ if (stup->tuple)
+ {
+ Assert(stup->tuple == (void *) state->mergeoverflow[srcTape]);
+ tuplen = GetMemoryChunkSpace(stup->tuple);
+ state->availMem += tuplen;
+ state->mergeavailmem[srcTape] += tuplen;
+ /* Caller should free palloc'd tuple proper */
+ *should_free = true;
+ }
+ state->mergeoverflow[srcTape] = NULL;
+ }
+
+ /*
+ * Give back tape's range in memory pool (accounting for tail contents
+ * having been moved to front in the common case where there was no
+ * handling of an "overflow" retail palloc).
+ */
+ state->mergetuptail[srcTape] = state->mergetupstart[srcTape];
+ state->mergeavailmem[srcTape] = state->spacePerTape - lastTupLen;
+}
+
+/*
* mergepreread - load tuples from merge input tapes
*
* This routine exists to improve sequentiality of reads during a merge pass,
@@ -2991,7 +3184,8 @@ beginmerge(Tuplesortstate *state)
* that state and so no point in scanning through all the tapes to fix one.
* (Moreover, there may be quite a lot of inactive tapes in that state, since
* we might have had many fewer runs than tapes. In a regular tape-to-tape
- * merge we can expect most of the tapes to be active.)
+ * merge we can expect most of the tapes to be active. Plus, only FINALMERGE
+ * state has to consider memory management for the "tuple proper" memory pool.)
*/
static void
mergepreread(Tuplesortstate *state)
@@ -2999,7 +3193,7 @@ mergepreread(Tuplesortstate *state)
int srcTape;
for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
- mergeprereadone(state, srcTape);
+ mergeprereadone(state, srcTape, NULL, NULL);
}
/*
@@ -3008,9 +3202,15 @@ mergepreread(Tuplesortstate *state)
* Read tuples from the specified tape until it has used up its free memory
* or array slots; but ensure that we have at least one tuple, if any are
* to be had.
+ *
+ * FINALMERGE state passes *rtup and *should_free variables, to have
+ * pool-related memory management responsibilities handled by
+ * mergepoolone(). Otherwise a memory pool isn't used, and this is not
+ * required.
*/
static void
-mergeprereadone(Tuplesortstate *state, int srcTape)
+mergeprereadone(Tuplesortstate *state, int srcTape, SortTuple *rtup,
+ bool *should_free)
{
unsigned int tuplen;
SortTuple stup;
@@ -3020,6 +3220,26 @@ mergeprereadone(Tuplesortstate *state, int srcTape)
if (!state->mergeactive[srcTape])
return; /* tape's run is already exhausted */
+
+ /*
+ * Manage memory pool segment for tape (if pool is in use).
+ *
+ * This is also a natural point to redraw partitioning boundaries inside
+ * the preread memory pool, by donating the unneeded memory of our tape
+ * (once it becomes exhausted) to some adjacent active tape.
+ *
+ * For now we don't bother, though. It does not seem to help much even
+ * in the event of a strong logical/physical correlation, where earlier
+ * runs will be drained before later runs return their earliest/lowest
+ * tuples. A contributing factor must be that the size of memtuples
+ * during the merge phase is based on memory accounting with significant
+ * palloc() overhead. There is almost always initially somewhat more
+ * memory for each tape in the pool than is needed, because the number
+ * of slots available tends to be the limiting factor then.
+ */
+ if (rtup)
+ mergepoolone(state, srcTape, rtup, should_free);
+
priorAvail = state->availMem;
state->availMem = state->mergeavailmem[srcTape];
while ((state->mergeavailslots[srcTape] > 0 && !LACKMEM(state)) ||
@@ -3666,6 +3886,68 @@ markrunend(Tuplesortstate *state, int tapenum)
LogicalTapeWrite(state->tapeset, tapenum, (void *) &len, sizeof(len));
}
+/*
+ * Allocate memory either using palloc(), or using a dedicated memory pool
+ * "logical allocation" during tuple preloading. READTUP() routines call
+ * here in place of a palloc() and USEMEM() call.
+ *
+ * READTUP() routines may receive memory from the memory pool when calling
+ * here, but at that point the tuples cannot subsequently be used with
+ * WRITETUP() routines, since they are unprepared for any "tuple proper" not
+ * allocated with a retail palloc().
+ *
+ * In the main, it doesn't seem worth optimizing the case where WRITETUP()
+ * must be called following consuming memory from the "tuple proper" pool
+ * (and allocating the pool earlier). During initial sorts of runs, the
+ * order in which tuples will finally need to be written out is
+ * unpredictable, so no big improvement in cache efficiency should be
+ * expected from using a memory pool. Multi-pass sorts are usually
+ * unnecessary and generally best avoided, so again, the non-use of a pool
+ * is not considered a problem. However, randomAccess callers may
+ * appreciably benefit from using a memory pool, and may be revisited as a
+ * target for memory pooling in the future.
+ */
+static void *
+tupproperalloc(Tuplesortstate *state, int tapenum, Size size)
+{
+ Size reserve_size = MAXALIGN(size);
+ char *ret;
+
+ if (!state->usePool)
+ {
+ /* Memory pool not in use */
+ ret = palloc(size);
+ USEMEM(state, GetMemoryChunkSpace(ret));
+ }
+ else if (state->mergetupcur[tapenum] + reserve_size <
+ state->mergetupstart[tapenum] + state->spacePerTape)
+ {
+ /*
+ * Usual case -- caller is returned memory from its tape's partition
+ * in buffer, since there is an adequate supply of memory.
+ */
+ ret = state->mergetuptail[tapenum] = state->mergetupcur[tapenum];
+
+ /*
+ * Logically, for accounting purposes, memory from our pool has only
+ * been allocated now. We expect reserve_size to actually diminish
+ * our tape's mergeavailmem, not "global" availMem.
+ */
+ state->mergetupcur[tapenum] += reserve_size;
+ USEMEM(state, reserve_size);
+ }
+ else
+ {
+ /* Should only use overflow allocation once per tape per level */
+ Assert(state->mergeoverflow[tapenum] == NULL);
+ /* palloc() in ordinary way */
+ ret = state->mergeoverflow[tapenum] = palloc(size);
+ USEMEM(state, GetMemoryChunkSpace(ret));
+ }
+
+ return ret;
+}
+
/*
* Routines specialized for HeapTuple (actually MinimalTuple) case
@@ -3826,6 +4108,7 @@ writetup_heap(Tuplesortstate *state, int tapenum, SortTuple *stup)
LogicalTapeWrite(state->tapeset, tapenum,
(void *) &tuplen, sizeof(tuplen));
+ Assert(!state->usePool);
FREEMEM(state, GetMemoryChunkSpace(tuple));
heap_free_minimal_tuple(tuple);
}
@@ -3836,11 +4119,10 @@ readtup_heap(Tuplesortstate *state, SortTuple *stup,
{
unsigned int tupbodylen = len - sizeof(int);
unsigned int tuplen = tupbodylen + MINIMAL_TUPLE_DATA_OFFSET;
- MinimalTuple tuple = (MinimalTuple) palloc(tuplen);
+ MinimalTuple tuple = (MinimalTuple) tupproperalloc(state, tapenum, tuplen);
char *tupbody = (char *) tuple + MINIMAL_TUPLE_DATA_OFFSET;
HeapTupleData htup;
- USEMEM(state, GetMemoryChunkSpace(tuple));
/* read in the tuple proper */
tuple->t_len = tuplen;
LogicalTapeReadExact(state->tapeset, tapenum,
@@ -4059,6 +4341,7 @@ writetup_cluster(Tuplesortstate *state, int tapenum, SortTuple *stup)
LogicalTapeWrite(state->tapeset, tapenum,
&tuplen, sizeof(tuplen));
+ Assert(!state->usePool);
FREEMEM(state, GetMemoryChunkSpace(tuple));
heap_freetuple(tuple);
}
@@ -4068,9 +4351,9 @@ readtup_cluster(Tuplesortstate *state, SortTuple *stup,
int tapenum, unsigned int tuplen)
{
unsigned int t_len = tuplen - sizeof(ItemPointerData) - sizeof(int);
- HeapTuple tuple = (HeapTuple) palloc(t_len + HEAPTUPLESIZE);
+ HeapTuple tuple = (HeapTuple) tupproperalloc(state, tapenum,
+ t_len + HEAPTUPLESIZE);
- USEMEM(state, GetMemoryChunkSpace(tuple));
/* Reconstruct the HeapTupleData header */
tuple->t_data = (HeapTupleHeader) ((char *) tuple + HEAPTUPLESIZE);
tuple->t_len = t_len;
@@ -4359,6 +4642,7 @@ writetup_index(Tuplesortstate *state, int tapenum, SortTuple *stup)
LogicalTapeWrite(state->tapeset, tapenum,
(void *) &tuplen, sizeof(tuplen));
+ Assert(!state->usePool);
FREEMEM(state, GetMemoryChunkSpace(tuple));
pfree(tuple);
}
@@ -4368,9 +4652,8 @@ readtup_index(Tuplesortstate *state, SortTuple *stup,
int tapenum, unsigned int len)
{
unsigned int tuplen = len - sizeof(unsigned int);
- IndexTuple tuple = (IndexTuple) palloc(tuplen);
+ IndexTuple tuple = (IndexTuple) tupproperalloc(state, tapenum, tuplen);
- USEMEM(state, GetMemoryChunkSpace(tuple));
LogicalTapeReadExact(state->tapeset, tapenum,
tuple, tuplen);
if (state->randomAccess) /* need trailing length word? */
@@ -4450,6 +4733,7 @@ writetup_datum(Tuplesortstate *state, int tapenum, SortTuple *stup)
LogicalTapeWrite(state->tapeset, tapenum,
(void *) &writtenlen, sizeof(writtenlen));
+ Assert(!state->usePool);
if (stup->tuple)
{
FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
@@ -4480,14 +4764,13 @@ readtup_datum(Tuplesortstate *state, SortTuple *stup,
}
else
{
- void *raddr = palloc(tuplen);
+ void *raddr = tupproperalloc(state, tapenum, tuplen);
LogicalTapeReadExact(state->tapeset, tapenum,
raddr, tuplen);
stup->datum1 = PointerGetDatum(raddr);
stup->isnull1 = false;
stup->tuple = raddr;
- USEMEM(state, GetMemoryChunkSpace(raddr));
}
if (state->randomAccess) /* need trailing length word? */
@@ -4501,6 +4784,7 @@ readtup_datum(Tuplesortstate *state, SortTuple *stup,
static void
free_sort_tuple(Tuplesortstate *state, SortTuple *stup)
{
+ Assert(!state->usePool);
FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
pfree(stup->tuple);
}
--
1.9.1
0004-Prefetch-from-memtuples-array-in-tuplesort.patchtext/x-patch; charset=US-ASCII; name=0004-Prefetch-from-memtuples-array-in-tuplesort.patchDownload
From 221a0be949feeea8357fb1e2272325a6b3d97edd Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghegan86@gmail.com>
Date: Sun, 12 Jul 2015 13:14:01 -0700
Subject: [PATCH 4/5] Prefetch from memtuples array in tuplesort
This patch is almost the same as a canonical version, which appears
here: https://commitfest.postgresql.org/6/305/
This version adds some additional tricks specific to new cases for
external sorts. Of particular interest here is the prefetching of each
"tuple proper" during writing of batches of tuples. This is not
intended to be reviewed as part of the external sorting work, and is
provided only as a convenience to reviewers who would like to see where
prefetching can help with external sorts, too.
Original canonical version details:
Testing shows that prefetching the "tuple proper" of a slightly later
SortTuple in the memtuples array during each of many sequential,
in-logical-order SortTuple fetches speeds up various sorting intense
operations considerably. For example, B-Tree index builds are
accelerated as leaf pages are created from the memtuples array.
(i.e. The operation following actually "performing" the sort, but
before a tuplesort_end() call is made as a B-Tree spool is destroyed.)
Similarly, ordered set aggregates (all cases except the datumsort case
with a pass-by-value type), and regular heap tuplesorts benefit to about
the same degree. The optimization is only used when sorts fit in
memory, though.
Also, prefetch a few places ahead within the analogous "fetching" point
in tuplestore.c. This appears to offer similar benefits in certain
cases. For example, queries involving large common table expressions
significantly benefit.
---
config/c-compiler.m4 | 17 +++++++++++++++++
configure | 31 ++++++++++++++++++++++++++++++
configure.in | 1 +
src/backend/utils/sort/tuplesort.c | 38 +++++++++++++++++++++++++++++++++++++
src/backend/utils/sort/tuplestore.c | 13 +++++++++++++
src/include/c.h | 14 ++++++++++++++
src/include/pg_config.h.in | 3 +++
src/include/pg_config.h.win32 | 3 +++
src/include/pg_config_manual.h | 10 ++++++++++
9 files changed, 130 insertions(+)
diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 9feec0c..3516314 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -253,6 +253,23 @@ fi])# PGAC_C_BUILTIN_UNREACHABLE
+# PGAC_C_BUILTIN_PREFETCH
+# -------------------------
+# Check if the C compiler understands __builtin_prefetch(),
+# and define HAVE__BUILTIN_PREFETCH if so.
+AC_DEFUN([PGAC_C_BUILTIN_PREFETCH],
+[AC_CACHE_CHECK(for __builtin_prefetch, pgac_cv__builtin_prefetch,
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([],
+[int i = 0;__builtin_prefetch(&i, 0, 3);])],
+[pgac_cv__builtin_prefetch=yes],
+[pgac_cv__builtin_prefetch=no])])
+if test x"$pgac_cv__builtin_prefetch" = xyes ; then
+AC_DEFINE(HAVE__BUILTIN_PREFETCH, 1,
+ [Define to 1 if your compiler understands __builtin_prefetch.])
+fi])# PGAC_C_BUILTIN_PREFETCH
+
+
+
# PGAC_C_VA_ARGS
# --------------
# Check if the C compiler understands C99-style variadic macros,
diff --git a/configure b/configure
index 0bed81c..88a7a6d 100755
--- a/configure
+++ b/configure
@@ -11315,6 +11315,37 @@ if test x"$pgac_cv__builtin_unreachable" = xyes ; then
$as_echo "#define HAVE__BUILTIN_UNREACHABLE 1" >>confdefs.h
fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_prefetch" >&5
+$as_echo_n "checking for __builtin_prefetch... " >&6; }
+if ${pgac_cv__builtin_prefetch+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+int
+main ()
+{
+int i = 0;__builtin_prefetch(&i, 0, 3);
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__builtin_prefetch=yes
+else
+ pgac_cv__builtin_prefetch=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_prefetch" >&5
+$as_echo "$pgac_cv__builtin_prefetch" >&6; }
+if test x"$pgac_cv__builtin_prefetch" = xyes ; then
+
+$as_echo "#define HAVE__BUILTIN_PREFETCH 1" >>confdefs.h
+
+fi
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __VA_ARGS__" >&5
$as_echo_n "checking for __VA_ARGS__... " >&6; }
if ${pgac_cv__va_args+:} false; then :
diff --git a/configure.in b/configure.in
index a28f9dd..778dd61 100644
--- a/configure.in
+++ b/configure.in
@@ -1319,6 +1319,7 @@ PGAC_C_TYPES_COMPATIBLE
PGAC_C_BUILTIN_BSWAP32
PGAC_C_BUILTIN_CONSTANT_P
PGAC_C_BUILTIN_UNREACHABLE
+PGAC_C_BUILTIN_PREFETCH
PGAC_C_VA_ARGS
PGAC_STRUCT_TIMEZONE
PGAC_UNION_SEMUN
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index abbef6c..d828723 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -1797,6 +1797,27 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
if (state->current < state->memtupcount)
{
*stup = state->memtuples[state->current++];
+
+ /*
+ * Perform memory prefetch of "tuple proper" of the
+ * SortTuple that's three places ahead of current
+ * (which is returned to caller). Testing shows that
+ * this significantly boosts the performance for
+ * TSS_SORTEDINMEM "forward" callers by hiding memory
+ * latency behind their processing of returned tuples.
+ *
+ * Don't do this for pass-by-value datum sorts; even
+ * though hinting a NULL address does not affect
+ * correctness, it would have a noticeable overhead
+ * here.
+ */
+#ifdef USE_MEM_PREFETCH
+ if (stup->tuple != NULL &&
+ state->current + 2 < state->memtupcount)
+ pg_read_prefetch(
+ state->memtuples[state->current + 2].tuple);
+#endif
+
return true;
}
state->eof_reached = true;
@@ -2025,6 +2046,18 @@ just_memtuples:
if (state->current < state->memtupcount)
{
*stup = state->memtuples[state->current++];
+
+ /*
+ * Once this point is reached, rationale for memory
+ * prefetching is identical to TSS_SORTEDINMEM case.
+ */
+#ifdef USE_MEM_PREFETCH
+ if (stup->tuple != NULL &&
+ state->current + 2 < state->memtupcount)
+ pg_read_prefetch(
+ state->memtuples[state->current + 2].tuple);
+#endif
+
return true;
}
state->eof_reached = true;
@@ -3162,6 +3195,11 @@ dumpbatch(Tuplesortstate *state, bool alltuples)
WRITETUP(state, state->tp_tapenum[state->destTape],
&state->memtuples[i]);
state->memtupcount--;
+
+#ifdef USE_MEM_PREFETCH
+ if (state->memtuples[i].tuple != NULL && i + 2 < memtupwrite)
+ pg_read_prefetch(state->memtuples[i + 2].tuple);
+#endif
}
markrunend(state, state->tp_tapenum[state->destTape]);
state->tp_runs[state->destTape]++;
diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index 51f474d..e9cb599 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -902,6 +902,19 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
return NULL;
if (readptr->current < state->memtupcount)
{
+ /*
+ * Perform memory prefetch of tuple that's three places
+ * ahead of current (which is returned to caller).
+ * Testing shows that this significantly boosts the
+ * performance for TSS_INMEM "forward" callers by
+ * hiding memory latency behind their processing of
+ * returned tuples.
+ */
+#ifdef USE_MEM_PREFETCH
+ if (readptr->current + 3 < state->memtupcount)
+ pg_read_prefetch(state->memtuples[readptr->current + 3]);
+#endif
+
/* We have another tuple, so return it */
return state->memtuples[readptr->current++];
}
diff --git a/src/include/c.h b/src/include/c.h
index b719eb9..6dd6d44 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -932,6 +932,20 @@ typedef NameData *Name;
#define pg_unreachable() abort()
#endif
+/*
+ * Prefetch support -- Support memory prefetching hints on some platforms.
+ *
+ * pg_read_prefetch() is specialized for the case where an array is accessed
+ * sequentially, and we can prefetch a pointer within the next element (or an
+ * even later element) in order to hide memory latency. This case involves
+ * prefetching addresses with low temporal locality. Note that it's rather
+ * difficult to get any kind of speedup with this; any use of the intrinsic
+ * should be carefully tested. It's okay to pass it an invalid or NULL
+ * address, although it's best avoided.
+ */
+#if defined(USE_MEM_PREFETCH)
+#define pg_read_prefetch(addr) __builtin_prefetch((addr), 0, 0)
+#endif
/* ----------------------------------------------------------------
* Section 8: random stuff
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 8873dcc..d9eda4b 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -675,6 +675,9 @@
/* Define to 1 if your compiler understands __builtin_constant_p. */
#undef HAVE__BUILTIN_CONSTANT_P
+/* Define to 1 if your compiler understands __builtin_prefetch. */
+#undef HAVE__BUILTIN_PREFETCH
+
/* Define to 1 if your compiler understands __builtin_types_compatible_p. */
#undef HAVE__BUILTIN_TYPES_COMPATIBLE_P
diff --git a/src/include/pg_config.h.win32 b/src/include/pg_config.h.win32
index ad61392..a2f6eb3 100644
--- a/src/include/pg_config.h.win32
+++ b/src/include/pg_config.h.win32
@@ -523,6 +523,9 @@
/* Define to 1 if your compiler understands __builtin_constant_p. */
/* #undef HAVE__BUILTIN_CONSTANT_P */
+/* Define to 1 if your compiler understands __builtin_prefetch. */
+#undef HAVE__BUILTIN_PREFETCH
+
/* Define to 1 if your compiler understands __builtin_types_compatible_p. */
/* #undef HAVE__BUILTIN_TYPES_COMPATIBLE_P */
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index e278fa0..4c7b1d5 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -153,6 +153,16 @@
#endif
/*
+ * USE_MEM_PREFETCH controls whether Postgres will attempt to use memory
+ * prefetching. Usually the automatic configure tests are sufficient, but
+ * it's conceivable that using prefetching is counter-productive on some
+ * platforms. If necessary you can remove the #define here.
+ */
+#ifdef HAVE__BUILTIN_PREFETCH
+#define USE_MEM_PREFETCH
+#endif
+
+/*
* USE_SSL code should be compiled only when compiling with an SSL
* implementation. (Currently, only OpenSSL is supported, but we might add
* more implementations in the future.)
--
1.9.1
0003-Log-requirement-for-multiple-external-sort-passes.patchtext/x-patch; charset=US-ASCII; name=0003-Log-requirement-for-multiple-external-sort-passes.patchDownload
From faea51dac940ce6c79235d60ea3b516c186d6e06 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghegan86@gmail.com>
Date: Sun, 16 Aug 2015 21:17:16 -0700
Subject: [PATCH 3/5] Log requirement for multiple external sort passes
The new log message warns users about a sort requiring multiple passes.
This is in the same spirit as checkpoint_warning. It seems very
ill-advised to ever attempt a sort that will require multiple passes on
contemporary hardware, since that can greatly increase the amount of I/O
required, and yet can only occur when available memory is small fraction
of what is required for a fully internal sort.
A new GUC, multipass_warning controls this log message. The default is
'on'. Also, a new debug GUC (not available in a standard build) for
controlling whether replacement selection can be avoided in the first
run is added.
During review, this patch may be useful for highlighting how effectively
replacement selection sort prevents multiple passes during the merge
step (relative to a hybrid sort-merge strategy) in practice.
---
doc/src/sgml/config.sgml | 22 +++++++++++++++++++++
src/backend/utils/misc/guc.c | 29 +++++++++++++++++++++++++---
src/backend/utils/sort/tuplesort.c | 39 ++++++++++++++++++++++++++++++++++++--
src/include/utils/guc.h | 2 ++
4 files changed, 87 insertions(+), 5 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e3dc23b..4be1ad8 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1556,6 +1556,28 @@ include_dir 'conf.d'
<title>Disk</title>
<variablelist>
+ <varlistentry id="guc-multipass-warning" xreflabel="multipass_warning">
+ <term><varname>multipass_warning</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>multipass_warning</> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Write a message to the server log if an external sort
+ operation requires multiple passes (which suggests that
+ <varname>work_mem</> or <varname>maintenance_work_mem</> may
+ need to be raised). Only a small fraction of the memory
+ required for an internal sort is required for an external sort
+ that makes no more than a single pass (typically less than
+ 1%). Since multi-pass sorts are often much slower, it is
+ advisable to avoid them altogether whenever possible.
+ The default setting is <literal>on</>.
+ Only superusers can change this setting.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-temp-file-limit" xreflabel="temp_file_limit">
<term><varname>temp_file_limit</varname> (<type>integer</type>)
<indexterm>
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index b3dac51..3302648 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -115,8 +115,9 @@ extern bool synchronize_seqscans;
#ifdef TRACE_SYNCSCAN
extern bool trace_syncscan;
#endif
-#ifdef DEBUG_BOUNDED_SORT
+#ifdef DEBUG_SORT
extern bool optimize_bounded_sort;
+extern bool optimize_avoid_selection;
#endif
static int GUC_check_errcode_value;
@@ -1041,6 +1042,16 @@ static struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
{
+ {"multipass_warning", PGC_SUSET, LOGGING_WHAT,
+ gettext_noop("Enables warnings if external sorts require more than one pass."),
+ gettext_noop("Write a message to the server log if more than one pass is required "
+ "for an external sort operation.")
+ },
+ &multipass_warning,
+ true,
+ NULL, NULL, NULL
+ },
+ {
{"debug_assertions", PGC_INTERNAL, PRESET_OPTIONS,
gettext_noop("Shows whether the running server has assertion checks enabled."),
NULL,
@@ -1449,8 +1460,8 @@ static struct config_bool ConfigureNamesBool[] =
},
#endif
-#ifdef DEBUG_BOUNDED_SORT
- /* this is undocumented because not exposed in a standard build */
+#ifdef DEBUG_SORT
+ /* these are undocumented because not exposed in a standard build */
{
{
"optimize_bounded_sort", PGC_USERSET, QUERY_TUNING_METHOD,
@@ -1462,6 +1473,18 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+
+ {
+ {
+ "optimize_avoid_selection", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enable avoiding replacement selection using heap sort."),
+ NULL,
+ GUC_NOT_IN_SAMPLE
+ },
+ &optimize_avoid_selection,
+ true,
+ NULL, NULL, NULL
+ },
#endif
#ifdef WAL_DEBUG
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index cca9683..abbef6c 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -160,8 +160,11 @@
bool trace_sort = false;
#endif
-#ifdef DEBUG_BOUNDED_SORT
+bool multipass_warning = true;
+
+#ifdef DEBUG_SORT
bool optimize_bounded_sort = true;
+bool optimize_avoid_selection = true;
#endif
@@ -250,6 +253,7 @@ struct Tuplesortstate
{
TupSortStatus status; /* enumerated value as shown above */
int nKeys; /* number of columns in sort key */
+ bool querySort; /* sort associated with query execution */
double rowNumHint; /* caller's hint of total # of rows */
bool randomAccess; /* did caller request random access? */
bool bounded; /* did caller specify a maximum number of
@@ -697,6 +701,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
#endif
state->nKeys = nkeys;
+ state->querySort = true;
TRACE_POSTGRESQL_SORT_START(HEAP_SORT,
false, /* no unique check */
@@ -771,6 +776,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
#endif
state->nKeys = RelationGetNumberOfAttributes(indexRel);
+ state->querySort = false;
TRACE_POSTGRESQL_SORT_START(CLUSTER_SORT,
false, /* no unique check */
@@ -864,6 +870,7 @@ tuplesort_begin_index_btree(Relation heapRel,
#endif
state->nKeys = RelationGetNumberOfAttributes(indexRel);
+ state->querySort = false;
TRACE_POSTGRESQL_SORT_START(INDEX_SORT,
enforceUnique,
@@ -939,6 +946,7 @@ tuplesort_begin_index_hash(Relation heapRel,
#endif
state->nKeys = 1; /* Only one sort column, the hash code */
+ state->querySort = false;
state->comparetup = comparetup_index_hash;
state->copytup = copytup_index;
@@ -976,6 +984,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
#endif
state->nKeys = 1; /* always a one-column sort */
+ state->querySort = true;
TRACE_POSTGRESQL_SORT_START(DATUM_SORT,
false, /* no unique check */
@@ -1042,7 +1051,7 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
Assert(state->memtupcount == 0);
Assert(!state->bounded);
-#ifdef DEBUG_BOUNDED_SORT
+#ifdef DEBUG_SORT
/* Honor GUC setting that disables the feature (for easy testing) */
if (!optimize_bounded_sort)
return;
@@ -2310,6 +2319,12 @@ useselection(Tuplesortstate *state)
double crossover;
bool useSelection;
+#ifdef DEBUG_SORT
+ /* Honor GUC setting that disables the feature (for easy testing) */
+ if (!optimize_avoid_selection)
+ return true;
+#endif
+
/* For randomAccess callers, "quicksort with spillover" is never used */
if (state->randomAccess)
return false;
@@ -2528,6 +2543,12 @@ selectnewtape(Tuplesortstate *state)
static void
mergeruns(Tuplesortstate *state)
{
+#ifdef TRACE_SORT
+ bool multiwarned = !(multipass_warning || trace_sort);
+#else
+ bool multiwarned = !multipass_warning;
+#endif
+
int tapenum,
svTape,
svRuns,
@@ -2639,6 +2660,20 @@ mergeruns(Tuplesortstate *state)
/* Step D6: decrease level */
if (--state->Level == 0)
break;
+
+ if (!multiwarned)
+ {
+ int64 memNowUsed = state->allowedMem - state->availMem;
+
+ ereport(LOG,
+ (errmsg("a multi-pass external merge sort is required "
+ "(%d tape maximum)", state->maxTapes),
+ errhint("Consider increasing the configuration parameter \"%s\".",
+ state->querySort ? "work_mem" : "maintenance_work_mem")));
+
+ multiwarned = true;
+ }
+
/* rewind output tape T to use as new input */
LogicalTapeRewind(state->tapeset, state->tp_tapenum[state->tapeRange],
false);
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index dc167f9..1e1519a 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -272,6 +272,8 @@ extern int tcp_keepalives_count;
extern bool trace_sort;
#endif
+extern bool multipass_warning;
+
/*
* Functions exported by guc.c
*/
--
1.9.1
0002-Further-diminish-role-of-replacement-selection.patchtext/x-patch; charset=US-ASCII; name=0002-Further-diminish-role-of-replacement-selection.patchDownload
From 819c3293385a1de811406448216a6a88b5ca70c8 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghegan86@gmail.com>
Date: Thu, 13 Aug 2015 14:32:32 -0700
Subject: [PATCH 2/5] Further diminish role of replacement selection
Tuplesort callers now provide a total row estimate hint. This is used
to determine if replacement selection will be viable even for the first
run using a simple ad-hoc cost model. "Quicksort with spillover" is the
sole remaining justification for going with replacement selection for
even the first run, and if that optimization cannot be applied it is
worth avoiding any use of replacement selection. Even for cases
considered sympathetic to replacement selection (e.g. almost sorted
input) do not appear to come out ahead on modern hardware, so this
simple cost model is reasonably complete.
Prior to this commit, replacement selection's tendency to produce longer
runs was a cost rather than a benefit where a "quicksort with spillover"
is not ultimately performed. The merge phase was often left with one
long replacement selection run, and several other runs strictly bounded
in size by work_mem (these were 50% of the size of the first run on
average).
---
src/backend/access/hash/hash.c | 2 +-
src/backend/access/hash/hashsort.c | 4 +-
src/backend/access/nbtree/nbtree.c | 11 +-
src/backend/access/nbtree/nbtsort.c | 10 +-
src/backend/catalog/index.c | 1 +
src/backend/commands/cluster.c | 4 +-
src/backend/executor/nodeAgg.c | 26 ++++-
src/backend/executor/nodeSort.c | 1 +
src/backend/utils/adt/orderedsetaggs.c | 13 ++-
src/backend/utils/sort/tuplesort.c | 186 +++++++++++++++++++++++++--------
src/include/access/hash.h | 3 +-
src/include/access/nbtree.h | 2 +-
src/include/executor/nodeAgg.h | 2 +
src/include/utils/tuplesort.h | 15 ++-
14 files changed, 218 insertions(+), 62 deletions(-)
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 24b06a5..8f71980 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -86,7 +86,7 @@ hashbuild(PG_FUNCTION_ARGS)
* one page.
*/
if (num_buckets >= (uint32) NBuffers)
- buildstate.spool = _h_spoolinit(heap, index, num_buckets);
+ buildstate.spool = _h_spoolinit(heap, index, num_buckets, reltuples);
else
buildstate.spool = NULL;
diff --git a/src/backend/access/hash/hashsort.c b/src/backend/access/hash/hashsort.c
index c67c057..5c7e137 100644
--- a/src/backend/access/hash/hashsort.c
+++ b/src/backend/access/hash/hashsort.c
@@ -44,7 +44,8 @@ struct HSpool
* create and initialize a spool structure
*/
HSpool *
-_h_spoolinit(Relation heap, Relation index, uint32 num_buckets)
+_h_spoolinit(Relation heap, Relation index, uint32 num_buckets,
+ double reltuples)
{
HSpool *hspool = (HSpool *) palloc0(sizeof(HSpool));
uint32 hash_mask;
@@ -71,6 +72,7 @@ _h_spoolinit(Relation heap, Relation index, uint32 num_buckets)
index,
hash_mask,
maintenance_work_mem,
+ reltuples,
false);
return hspool;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index cf4a6dc..0957e0f 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -23,6 +23,7 @@
#include "access/xlog.h"
#include "catalog/index.h"
#include "commands/vacuum.h"
+#include "optimizer/plancat.h"
#include "storage/indexfsm.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
@@ -85,7 +86,9 @@ btbuild(PG_FUNCTION_ARGS)
Relation index = (Relation) PG_GETARG_POINTER(1);
IndexInfo *indexInfo = (IndexInfo *) PG_GETARG_POINTER(2);
IndexBuildResult *result;
+ BlockNumber relpages;
double reltuples;
+ double allvisfrac;
BTBuildState buildstate;
buildstate.isUnique = indexInfo->ii_Unique;
@@ -100,6 +103,9 @@ btbuild(PG_FUNCTION_ARGS)
ResetUsage();
#endif /* BTREE_BUILD_STATS */
+ /* Estimate the number of rows currently present in the table */
+ estimate_rel_size(heap, NULL, &relpages, &reltuples, &allvisfrac);
+
/*
* We expect to be called exactly once for any index relation. If that's
* not the case, big trouble's what we have.
@@ -108,14 +114,15 @@ btbuild(PG_FUNCTION_ARGS)
elog(ERROR, "index \"%s\" already contains data",
RelationGetRelationName(index));
- buildstate.spool = _bt_spoolinit(heap, index, indexInfo->ii_Unique, false);
+ buildstate.spool = _bt_spoolinit(heap, index, indexInfo->ii_Unique, false,
+ reltuples);
/*
* If building a unique index, put dead tuples in a second spool to keep
* them out of the uniqueness check.
*/
if (indexInfo->ii_Unique)
- buildstate.spool2 = _bt_spoolinit(heap, index, false, true);
+ buildstate.spool2 = _bt_spoolinit(heap, index, false, true, reltuples);
/* do the heap scan */
reltuples = IndexBuildHeapScan(heap, index, indexInfo, true,
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f95f67a..0d4a5ea 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -149,7 +149,8 @@ static void _bt_load(BTWriteState *wstate,
* create and initialize a spool structure
*/
BTSpool *
-_bt_spoolinit(Relation heap, Relation index, bool isunique, bool isdead)
+_bt_spoolinit(Relation heap, Relation index, bool isunique, bool isdead,
+ double reltuples)
{
BTSpool *btspool = (BTSpool *) palloc0(sizeof(BTSpool));
int btKbytes;
@@ -165,10 +166,15 @@ _bt_spoolinit(Relation heap, Relation index, bool isunique, bool isdead)
* unique index actually requires two BTSpool objects. We expect that the
* second one (for dead tuples) won't get very full, so we give it only
* work_mem.
+ *
+ * reltuples hint does not account for factors like whether or not this is
+ * a partial index, or if this is second BTSpool object, because it seems
+ * more conservative to estimate high.
*/
btKbytes = isdead ? work_mem : maintenance_work_mem;
btspool->sortstate = tuplesort_begin_index_btree(heap, index, isunique,
- btKbytes, false);
+ btKbytes, reltuples,
+ false);
return btspool;
}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e59b163..88ee81d 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2835,6 +2835,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
state.tuplesort = tuplesort_begin_datum(TIDOID, TIDLessOperator,
InvalidOid, false,
maintenance_work_mem,
+ ivinfo.num_heap_tuples,
false);
state.htups = state.itups = state.tups_inserted = 0;
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 7ab4874..23f6459 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -891,7 +891,9 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
/* Set up sorting if wanted */
if (use_sort)
tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex,
- maintenance_work_mem, false);
+ maintenance_work_mem,
+ OldHeap->rd_rel->reltuples,
+ false);
else
tuplesort = NULL;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 2e36855..f580cca 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -520,6 +520,7 @@ initialize_phase(AggState *aggstate, int newphase)
sortnode->collations,
sortnode->nullsFirst,
work_mem,
+ sortnode->plan.plan_rows,
false);
}
@@ -588,7 +589,8 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
pertrans->sortOperators[0],
pertrans->sortCollations[0],
pertrans->sortNullsFirst[0],
- work_mem, false);
+ work_mem, agg_input_rows(aggstate),
+ false);
else
pertrans->sortstates[aggstate->current_set] =
tuplesort_begin_heap(pertrans->evaldesc,
@@ -597,7 +599,8 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
pertrans->sortOperators,
pertrans->sortCollations,
pertrans->sortNullsFirst,
- work_mem, false);
+ work_mem, agg_input_rows(aggstate),
+ false);
}
/*
@@ -1439,6 +1442,25 @@ find_hash_columns(AggState *aggstate)
}
/*
+ * Estimate the number of rows input to the sorter.
+ *
+ * Exported for use by ordered-set aggregates.
+ */
+double
+agg_input_rows(AggState *aggstate)
+{
+ Plan *outerNode;
+
+ /*
+ * Get information about the size of the relation to be sorted (it's the
+ * "outer" subtree of this node)
+ */
+ outerNode = outerPlanState(aggstate)->plan;
+
+ return outerNode->plan_rows;
+}
+
+/*
* Estimate per-hash-table-entry overhead for the planner.
*
* Note that the estimate does not include space for pass-by-reference
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index af1dccf..e4b1104 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -89,6 +89,7 @@ ExecSort(SortState *node)
plannode->collations,
plannode->nullsFirst,
work_mem,
+ plannode->plan.plan_rows,
node->randomAccess);
if (node->bounded)
tuplesort_set_bound(tuplesortstate, node->bound);
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
index 39ed85b..b51a945 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -20,6 +20,7 @@
#include "catalog/pg_operator.h"
#include "catalog/pg_type.h"
#include "executor/executor.h"
+#include "executor/nodeAgg.h"
#include "miscadmin.h"
#include "nodes/nodeFuncs.h"
#include "optimizer/tlist.h"
@@ -103,6 +104,7 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
{
OSAPerGroupState *osastate;
OSAPerQueryState *qstate;
+ AggState *aggstate;
MemoryContext gcontext;
MemoryContext oldcontext;
@@ -117,8 +119,11 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
/*
* We keep a link to the per-query state in fn_extra; if it's not there,
* create it, and do the per-query setup we need.
+ *
+ * aggstate is used to get hint on total number of tuples for tuplesort.
*/
qstate = (OSAPerQueryState *) fcinfo->flinfo->fn_extra;
+ aggstate = (AggState *) fcinfo->context;
if (qstate == NULL)
{
Aggref *aggref;
@@ -276,13 +281,17 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
qstate->sortOperators,
qstate->sortCollations,
qstate->sortNullsFirsts,
- work_mem, false);
+ work_mem,
+ agg_input_rows(aggstate),
+ false);
else
osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
qstate->sortOperator,
qstate->sortCollation,
qstate->sortNullsFirst,
- work_mem, false);
+ work_mem,
+ agg_input_rows(aggstate),
+ false);
osastate->number_of_rows = 0;
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 55b0583..cca9683 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -13,11 +13,13 @@
* See Knuth, volume 3, for more than you want to know about the external
* sorting algorithm. We divide the input into sorted runs using replacement
* selection, in the form of a priority tree implemented as a heap
- * (essentially his Algorithm 5.2.3H -- although that strategy can be
- * abandoned where it does not appear to help), then merge the runs using
- * polyphase merge, Knuth's Algorithm 5.4.2D. The logical "tapes" used by
- * Algorithm D are implemented by logtape.c, which avoids space wastage by
- * recycling disk space as soon as each block is read from its "tape".
+ * (essentially his Algorithm 5.2.3H -- although that strategy is often
+ * avoided altogether), then merge the runs using polyphase merge, Knuth's
+ * Algorithm 5.4.2D. The logical "tapes" used by Algorithm D are
+ * implemented by logtape.c, which avoids space wastage by recycling disk
+ * space as soon as each block is read from its "tape". Note that a hybrid
+ * sort-merge strategy is usually used in practice, because maintaining a
+ * priority tree/heap is expensive.
*
* We do not form the initial runs using Knuth's recommended replacement
* selection data structure (Algorithm 5.4.1R), because it uses a fixed
@@ -108,10 +110,13 @@
* If, having maintained a replacement selection priority queue (heap) for
* the first run it transpires that there will be multiple on-tape runs
* anyway, we abandon treating memtuples as a heap, and quicksort and write
- * in memtuples-sized batches. This gives us most of the advantages of
- * always quicksorting and batch dumping runs, which can perform much better
- * than heap sorting and incrementally spilling tuples, without giving up on
- * replacement selection in cases where it remains compelling.
+ * in memtuples-sized batches. This allows a "quicksort with spillover" to
+ * occur, but that remains about the only truly compelling case for
+ * replacement selection. Callers provides a hint for the total number of
+ * rows, used to avoid replacement selection when a "quicksort with
+ * spillover" is not anticipated -- see useselection(). A hybrid sort-merge
+ * strategy can be much faster for very large inputs when replacement
+ * selection is never attempted.
*
*
* Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
@@ -245,6 +250,7 @@ struct Tuplesortstate
{
TupSortStatus status; /* enumerated value as shown above */
int nKeys; /* number of columns in sort key */
+ double rowNumHint; /* caller's hint of total # of rows */
bool randomAccess; /* did caller request random access? */
bool bounded; /* did caller specify a maximum number of
* tuples to return? */
@@ -313,7 +319,9 @@ struct Tuplesortstate
/*
* While building initial runs, this indicates if the replacement
* selection strategy or simple hybrid sort-merge strategy is in use.
- * Replacement selection is abandoned after first run.
+ * Replacement selection may be determined to not be effective ahead of
+ * time, based on a caller-supplied hint. Otherwise, it is abandoned
+ * after first run.
*/
bool replaceActive;
@@ -505,9 +513,11 @@ struct Tuplesortstate
} while(0)
-static Tuplesortstate *tuplesort_begin_common(int workMem, bool randomAccess);
+static Tuplesortstate *tuplesort_begin_common(int workMem, double rowNumHint,
+ bool randomAccess);
static void puttuple_common(Tuplesortstate *state, SortTuple *tuple);
static bool consider_abort_common(Tuplesortstate *state);
+static bool useselection(Tuplesortstate *state);
static void inittapes(Tuplesortstate *state);
static void selectnewtape(Tuplesortstate *state);
static void mergeruns(Tuplesortstate *state);
@@ -584,12 +594,14 @@ static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
* Each variant of tuplesort_begin has a workMem parameter specifying the
* maximum number of kilobytes of RAM to use before spilling data to disk.
* (The normal value of this parameter is work_mem, but some callers use
- * other values.) Each variant also has a randomAccess parameter specifying
- * whether the caller needs non-sequential access to the sort result.
+ * other values.) Each variant also has a hint parameter of the total
+ * number of rows to be sorted, and a randomAccess parameter specifying
+ * whether the caller needs non-sequential access to the sort result. Since
+ * rowNumHint is just a hint, it's acceptable for it to be zero or negative.
*/
static Tuplesortstate *
-tuplesort_begin_common(int workMem, bool randomAccess)
+tuplesort_begin_common(int workMem, double rowNumHint, bool randomAccess)
{
Tuplesortstate *state;
MemoryContext sortcontext;
@@ -619,6 +631,7 @@ tuplesort_begin_common(int workMem, bool randomAccess)
#endif
state->status = TSS_INITIAL;
+ state->rowNumHint = rowNumHint;
state->randomAccess = randomAccess;
state->bounded = false;
state->boundUsed = false;
@@ -664,9 +677,11 @@ tuplesort_begin_heap(TupleDesc tupDesc,
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
- int workMem, bool randomAccess)
+ int workMem, double rowNumHint,
+ bool randomAccess)
{
- Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+ Tuplesortstate *state = tuplesort_begin_common(workMem, rowNumHint,
+ randomAccess);
MemoryContext oldcontext;
int i;
@@ -734,9 +749,11 @@ tuplesort_begin_heap(TupleDesc tupDesc,
Tuplesortstate *
tuplesort_begin_cluster(TupleDesc tupDesc,
Relation indexRel,
- int workMem, bool randomAccess)
+ int workMem,
+ double rowNumHint, bool randomAccess)
{
- Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+ Tuplesortstate *state = tuplesort_begin_common(workMem, rowNumHint,
+ randomAccess);
ScanKey indexScanKey;
MemoryContext oldcontext;
int i;
@@ -827,9 +844,11 @@ Tuplesortstate *
tuplesort_begin_index_btree(Relation heapRel,
Relation indexRel,
bool enforceUnique,
- int workMem, bool randomAccess)
+ int workMem,
+ double rowNumHint, bool randomAccess)
{
- Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+ Tuplesortstate *state = tuplesort_begin_common(workMem, rowNumHint,
+ randomAccess);
ScanKey indexScanKey;
MemoryContext oldcontext;
int i;
@@ -902,9 +921,11 @@ Tuplesortstate *
tuplesort_begin_index_hash(Relation heapRel,
Relation indexRel,
uint32 hash_mask,
- int workMem, bool randomAccess)
+ int workMem,
+ double rowNumHint, bool randomAccess)
{
- Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+ Tuplesortstate *state = tuplesort_begin_common(workMem, rowNumHint,
+ randomAccess);
MemoryContext oldcontext;
oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -937,9 +958,10 @@ tuplesort_begin_index_hash(Relation heapRel,
Tuplesortstate *
tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
bool nullsFirstFlag,
- int workMem, bool randomAccess)
+ int workMem, double rowNumHint, bool randomAccess)
{
- Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+ Tuplesortstate *state = tuplesort_begin_common(workMem, rowNumHint,
+ randomAccess);
MemoryContext oldcontext;
int16 typlen;
bool typbyval;
@@ -2271,6 +2293,77 @@ tuplesort_merge_order(int64 allowedMem)
}
/*
+ * useselection - determine if one replacement selection run should be
+ * attempted.
+ *
+ * This is called when we just ran out of memory, and must consider costs
+ * and benefits of replacement selection for first run, which can result in
+ * a "quicksort with spillover". Note that replacement selection is always
+ * abandoned after the first run.
+ */
+static bool
+useselection(Tuplesortstate *state)
+{
+ int64 memNowUsed = state->allowedMem - state->availMem;
+ double avgTupleSize;
+ int increments;
+ double crossover;
+ bool useSelection;
+
+ /* For randomAccess callers, "quicksort with spillover" is never used */
+ if (state->randomAccess)
+ return false;
+
+ /*
+ * Crossover point is somewhere between where memtuples is between 40%
+ * and all-but-one of total tuples to sort. This weighs approximate
+ * savings in I/O, against generic heap sorting cost.
+ */
+ avgTupleSize = (double) memNowUsed / (double) state->memtupsize;
+
+ /*
+ * Starting from a threshold of 90%, refund 7.5% per 32 byte
+ * average-size-increment.
+ */
+ increments = MAXALIGN_DOWN((int) avgTupleSize) / 32;
+ crossover = 0.90 - (increments * 0.075);
+
+ /*
+ * Clamp, making either outcome possible regardless of average size.
+ *
+ * 40% is about the minimum point at which "quicksort with spillover"
+ * can still occur without a logical/physical correlation.
+ */
+ crossover = Max(0.40, Min(crossover, 0.85));
+
+ /*
+ * The point where the overhead of maintaining the heap invariant is
+ * likely to dominate over any saving in I/O is somewhat arbitrarily
+ * assumed to be the point where memtuples' size exceeds MaxAllocSize
+ * (note that overall memory consumption may be far greater). Past this
+ * point, only the most compelling cases use replacement selection for
+ * their first run.
+ */
+ if (sizeof(SortTuple) * state->memtupcount > MaxAllocSize)
+ crossover = avgTupleSize > 32 ? 0.90 : 0.95;
+
+ useSelection = state->memtupcount > state->rowNumHint * crossover;
+
+#ifdef TRACE_SORT
+ if (trace_sort)
+ elog(LOG,
+ "%s used at row %d crossover %.3f (est %.2f rows %.2f runs)",
+ useSelection ?
+ "replacement selection (quicksort with spillover) strategy" :
+ "hybrid sort-merge strategy",
+ state->memtupcount, crossover, state->rowNumHint,
+ state->rowNumHint / state->memtupcount);
+#endif
+
+ return useSelection;
+}
+
+/*
* inittapes - initialize for tape sorting.
*
* This is called only if we have found we don't have room to sort in memory.
@@ -2279,7 +2372,6 @@ static void
inittapes(Tuplesortstate *state)
{
int maxTapes,
- ntuples,
j;
int64 tapeSpace;
@@ -2338,32 +2430,38 @@ inittapes(Tuplesortstate *state)
state->tp_tapenum = (int *) palloc0(maxTapes * sizeof(int));
/*
- * Give replacement selection a try. There will be a switch to a simple
- * hybrid sort-merge strategy after the first run (iff there is to be a
- * second on-tape run).
+ * Give replacement selection a try when number of tuples to be sorted
+ * has a reasonable chance of enabling a "quicksort with spillover".
+ * There will be a switch to a simple hybrid sort-merge strategy after
+ * the first run (iff there is to be a second on-tape run).
*/
- state->replaceActive = true;
+ state->replaceActive = useselection(state);
state->cached = false;
state->just_memtuples = false;
- /*
- * Convert the unsorted contents of memtuples[] into a heap. Each tuple is
- * marked as belonging to run number zero.
- *
- * NOTE: we pass false for checkIndex since there's no point in comparing
- * indexes in this step, even though we do intend the indexes to be part
- * of the sort key...
- */
- ntuples = state->memtupcount;
- state->memtupcount = 0; /* make the heap empty */
- for (j = 0; j < ntuples; j++)
+ if (state->replaceActive)
{
- /* Must copy source tuple to avoid possible overwrite */
- SortTuple stup = state->memtuples[j];
+ /*
+ * Convert the unsorted contents of memtuples[] into a heap. Each
+ * tuple is marked as belonging to run number zero.
+ *
+ * NOTE: we pass false for checkIndex since there's no point in
+ * comparing indexes in this step, even though we do intend the
+ * indexes to be part of the sort key...
+ */
+ int ntuples = state->memtupcount;
- tuplesort_heap_insert(state, &stup, 0, false);
+ state->memtupcount = 0; /* make the heap empty */
+
+ for (j = 0; j < ntuples; j++)
+ {
+ /* Must copy source tuple to avoid possible overwrite */
+ SortTuple stup = state->memtuples[j];
+
+ tuplesort_heap_insert(state, &stup, 0, false);
+ }
+ Assert(state->memtupcount == ntuples);
}
- Assert(state->memtupcount == ntuples);
state->currentRun = 0;
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 97cb859..95acc1d 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -335,7 +335,8 @@ extern bool _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir);
/* hashsort.c */
typedef struct HSpool HSpool; /* opaque struct in hashsort.c */
-extern HSpool *_h_spoolinit(Relation heap, Relation index, uint32 num_buckets);
+extern HSpool *_h_spoolinit(Relation heap, Relation index, uint32 num_buckets,
+ double reltuples);
extern void _h_spooldestroy(HSpool *hspool);
extern void _h_spool(HSpool *hspool, ItemPointer self,
Datum *values, bool *isnull);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 9e48efd..5504b7b 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -743,7 +743,7 @@ extern void BTreeShmemInit(void);
typedef struct BTSpool BTSpool; /* opaque type known only within nbtsort.c */
extern BTSpool *_bt_spoolinit(Relation heap, Relation index,
- bool isunique, bool isdead);
+ bool isunique, bool isdead, double reltuples);
extern void _bt_spooldestroy(BTSpool *btspool);
extern void _bt_spool(BTSpool *btspool, ItemPointer self,
Datum *values, bool *isnull);
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index fe3b81a..e6144f2 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -21,6 +21,8 @@ extern TupleTableSlot *ExecAgg(AggState *node);
extern void ExecEndAgg(AggState *node);
extern void ExecReScanAgg(AggState *node);
+extern double agg_input_rows(AggState *aggstate);
+
extern Size hash_agg_entry_size(int numAggs);
extern Datum aggregate_dummy(PG_FUNCTION_ARGS);
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 3679815..11a5fb7 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -62,22 +62,27 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid *sortCollations,
bool *nullsFirstFlags,
- int workMem, bool randomAccess);
+ int workMem,
+ double rowNumHint, bool randomAccess);
extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
Relation indexRel,
- int workMem, bool randomAccess);
+ int workMem,
+ double rowNumHint, bool randomAccess);
extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
Relation indexRel,
bool enforceUnique,
- int workMem, bool randomAccess);
+ int workMem,
+ double rowNumHint, bool randomAccess);
extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
Relation indexRel,
uint32 hash_mask,
- int workMem, bool randomAccess);
+ int workMem,
+ double rowNumHint, bool randomAccess);
extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
Oid sortOperator, Oid sortCollation,
bool nullsFirstFlag,
- int workMem, bool randomAccess);
+ int workMem,
+ double rowNumHint, bool randomAccess);
extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
--
1.9.1
0001-Quicksort-when-performing-external-sorts.patchtext/x-patch; charset=US-ASCII; name=0001-Quicksort-when-performing-external-sorts.patchDownload
From a45b6cb684a2f9ebae5f194e2af5d6e85c8b7dc5 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghegan86@gmail.com>
Date: Wed, 29 Jul 2015 15:38:12 -0700
Subject: [PATCH 1/5] Quicksort when performing external sorts
Add "quicksort with spillover". This allows an external sort that is
within about 2x of work_mem to avoid writing out most tuples, and to
quicksort perhaps almost all tuples rather than performing a degenerate
heapsort. Often, an external sort is now only marginally more expensive
than an internal sort, which is a significant improvement. Sort
performance is made much more predictable as tables gradually increase
in size.
In addition, have tuplesort give up on replacement selection after the
first run, where it is generally not the fastest approach. Most of the
benefits of replacement selection are seen where incremental spilling
rather than spilling in batches allows a "quicksort with spillover" to
ultimately write almost no tuples out.
When a second or subsequent run is necessary (rather than preliminarily
appearing necessary, something a "quicksort with spillover" is often
able to disregard), the second and subsequent run tuples are simply
stored in no particular order initially, and finally quicksorted and
dumped in batch when work_mem is once again exhausted. Testing has
shown this to be much faster in many realistic cases, although there is
no saving in I/O. Overall, cache efficiency ought to be the dominant
consideration when engineering an external sort routine targeting modern
hardware.
---
src/backend/commands/explain.c | 13 +-
src/backend/utils/sort/tuplesort.c | 573 ++++++++++++++++++++++++++++++++-----
src/include/utils/tuplesort.h | 3 +-
3 files changed, 519 insertions(+), 70 deletions(-)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 5d06fa4..94b1f77 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2178,20 +2178,27 @@ show_sort_info(SortState *sortstate, ExplainState *es)
const char *sortMethod;
const char *spaceType;
long spaceUsed;
+ int rowsSortedMem;
- tuplesort_get_stats(state, &sortMethod, &spaceType, &spaceUsed);
+ tuplesort_get_stats(state, &sortMethod, &spaceType, &spaceUsed,
+ &rowsSortedMem);
if (es->format == EXPLAIN_FORMAT_TEXT)
{
appendStringInfoSpaces(es->str, es->indent * 2);
- appendStringInfo(es->str, "Sort Method: %s %s: %ldkB\n",
- sortMethod, spaceType, spaceUsed);
+ appendStringInfo(es->str,
+ "Sort Method: %s %s: %ldkB Rows In Memory: %d\n",
+ sortMethod,
+ spaceType,
+ spaceUsed,
+ rowsSortedMem);
}
else
{
ExplainPropertyText("Sort Method", sortMethod, es);
ExplainPropertyLong("Sort Space Used", spaceUsed, es);
ExplainPropertyText("Sort Space Type", spaceType, es);
+ ExplainPropertyInteger("Rows In Memory", rowsSortedMem, es);
}
}
}
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index d532e87..55b0583 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -8,15 +8,16 @@
* if necessary). It works efficiently for both small and large amounts
* of data. Small amounts are sorted in-memory using qsort(). Large
* amounts are sorted using temporary files and a standard external sort
- * algorithm.
+ * algorithm with numerous special enhancements.
*
* See Knuth, volume 3, for more than you want to know about the external
* sorting algorithm. We divide the input into sorted runs using replacement
* selection, in the form of a priority tree implemented as a heap
- * (essentially his Algorithm 5.2.3H), then merge the runs using polyphase
- * merge, Knuth's Algorithm 5.4.2D. The logical "tapes" used by Algorithm D
- * are implemented by logtape.c, which avoids space wastage by recycling
- * disk space as soon as each block is read from its "tape".
+ * (essentially his Algorithm 5.2.3H -- although that strategy can be
+ * abandoned where it does not appear to help), then merge the runs using
+ * polyphase merge, Knuth's Algorithm 5.4.2D. The logical "tapes" used by
+ * Algorithm D are implemented by logtape.c, which avoids space wastage by
+ * recycling disk space as soon as each block is read from its "tape".
*
* We do not form the initial runs using Knuth's recommended replacement
* selection data structure (Algorithm 5.4.1R), because it uses a fixed
@@ -72,6 +73,15 @@
* to one run per logical tape. The final merge is then performed
* on-the-fly as the caller repeatedly calls tuplesort_getXXX; this
* saves one cycle of writing all the data out to disk and reading it in.
+ * Also, if only one run is spilled to tape so far when
+ * tuplesort_performsort() is reached, and if the caller does not require
+ * random access, then the merge step can take place between still
+ * in-memory tuples, and tuples stored on tape (it does not matter that
+ * there may be a second run in that array -- only that a second one has
+ * spilled). This ensures that spilling to disk only occurs for a number of
+ * tuples approximately equal to the number of tuples read in after
+ * work_mem was reached and it became apparent that an external sort is
+ * required.
*
* Before Postgres 8.2, we always used a seven-tape polyphase merge, on the
* grounds that 7 is the "sweet spot" on the tapes-to-passes curve according
@@ -86,6 +96,23 @@
* we preread from a tape, so as to maintain the locality of access described
* above. Nonetheless, with large workMem we can have many tapes.
*
+ * Before Postgres 9.6, we always used a heap for replacement selection when
+ * building runs. However, Knuth does not consider the influence of memory
+ * access on overall performance, which is a crucial consideration on modern
+ * machines; replacement selection is only really of value where a single
+ * run or two runs can be produced, sometimes avoiding a merge step
+ * entirely. Replacement selection makes this likely when tuples are read
+ * in approximately logical order, even if work_mem is only a small fraction
+ * of the requirement for an internal sort, but large main memory sizes
+ * don't benefit from tiny, incremental spills, even with enormous datasets.
+ * If, having maintained a replacement selection priority queue (heap) for
+ * the first run it transpires that there will be multiple on-tape runs
+ * anyway, we abandon treating memtuples as a heap, and quicksort and write
+ * in memtuples-sized batches. This gives us most of the advantages of
+ * always quicksorting and batch dumping runs, which can perform much better
+ * than heap sorting and incrementally spilling tuples, without giving up on
+ * replacement selection in cases where it remains compelling.
+ *
*
* Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
@@ -160,7 +187,10 @@ bool optimize_bounded_sort = true;
* described above. Accordingly, "tuple" is always used in preference to
* datum1 as the authoritative value for pass-by-reference cases.
*
- * While building initial runs, tupindex holds the tuple's run number. During
+ * While building initial runs, tupindex holds the tuple's run number.
+ * Historically, the run number could meaningfully distinguish many runs, but
+ * currently it only meaningfully distinguishes the first run with any other
+ * run, since replacement selection is abandoned after the first run. During
* merge passes, we re-use it to hold the input tape number that each tuple in
* the heap was read from, or to hold the index of the next tuple pre-read
* from the same tape in the case of pre-read entries. tupindex goes unused
@@ -186,6 +216,7 @@ typedef enum
TSS_BUILDRUNS, /* Loading tuples; writing to tape */
TSS_SORTEDINMEM, /* Sort completed entirely in memory */
TSS_SORTEDONTAPE, /* Sort completed, final run is on tape */
+ TSS_MEMTAPEMERGE, /* Performing memory/tape merge on-the-fly */
TSS_FINALMERGE /* Performing final merge on-the-fly */
} TupSortStatus;
@@ -280,6 +311,13 @@ struct Tuplesortstate
bool growmemtuples; /* memtuples' growth still underway? */
/*
+ * While building initial runs, this indicates if the replacement
+ * selection strategy or simple hybrid sort-merge strategy is in use.
+ * Replacement selection is abandoned after first run.
+ */
+ bool replaceActive;
+
+ /*
* While building initial runs, this is the current output run number
* (starting at 0). Afterwards, it is the number of initial runs we made.
*/
@@ -327,12 +365,22 @@ struct Tuplesortstate
int activeTapes; /* # of active input tapes in merge pass */
/*
+ * These variables are used after tuplesort_performsort() for the
+ * TSS_MEMTAPEMERGE case. This is a special, optimized final on-the-fly
+ * merge pass involving merging the result tape with memtuples that were
+ * quicksorted (but never made it out to a tape).
+ */
+ SortTuple tape_cache; /* cached tape tuple from prior call */
+ bool cached; /* tape_cache holds pending tape tuple */
+ bool just_memtuples; /* merge only fetching from memtuples */
+
+ /*
* These variables are used after completion of sorting to keep track of
* the next tuple to return. (In the tape case, the tape's current read
* position is also critical state.)
*/
int result_tape; /* actual tape number of finished output */
- int current; /* array index (only used if SORTEDINMEM) */
+ int current; /* memtuples array index */
bool eof_reached; /* reached EOF (needed for cursors) */
/* markpos_xxx holds marked position for mark and restore */
@@ -464,12 +512,15 @@ static void inittapes(Tuplesortstate *state);
static void selectnewtape(Tuplesortstate *state);
static void mergeruns(Tuplesortstate *state);
static void mergeonerun(Tuplesortstate *state);
+static void mergememruns(Tuplesortstate *state);
static void beginmerge(Tuplesortstate *state);
static void mergepreread(Tuplesortstate *state);
static void mergeprereadone(Tuplesortstate *state, int srcTape);
static void dumptuples(Tuplesortstate *state, bool alltuples);
+static void dumpbatch(Tuplesortstate *state, bool alltuples);
static void make_bounded_heap(Tuplesortstate *state);
static void sort_bounded_heap(Tuplesortstate *state);
+static void tuplesort_quicksort(Tuplesortstate *state);
static void tuplesort_heap_insert(Tuplesortstate *state, SortTuple *tuple,
int tupleindex, bool checkIndex);
static void tuplesort_heap_siftup(Tuplesortstate *state, bool checkIndex);
@@ -1486,22 +1537,61 @@ puttuple_common(Tuplesortstate *state, SortTuple *tuple)
/*
* Insert the tuple into the heap, with run number currentRun if
- * it can go into the current run, else run number currentRun+1.
- * The tuple can go into the current run if it is >= the first
- * not-yet-output tuple. (Actually, it could go into the current
- * run if it is >= the most recently output tuple ... but that
- * would require keeping around the tuple we last output, and it's
- * simplest to let writetup free each tuple as soon as it's
- * written.)
+ * it can go into the current run, else run number INT_MAX (some
+ * later run). The tuple can go into the current run if it is
+ * >= the first not-yet-output tuple. (Actually, it could go
+ * into the current run if it is >= the most recently output
+ * tuple ... but that would require keeping around the tuple we
+ * last output, and it's simplest to let writetup free each
+ * tuple as soon as it's written.)
*
- * Note there will always be at least one tuple in the heap at
- * this point; see dumptuples.
+ * Note that this only applies if the currentRun is 0 (prior to
+ * giving up on heapification). There is no meaningful
+ * distinction between any two runs in memory except the first
+ * and second run. When the currentRun is not 0, there is no
+ * guarantee that any tuples are already stored in memory here,
+ * and if there are any they're in no significant order.
*/
- Assert(state->memtupcount > 0);
- if (COMPARETUP(state, tuple, &state->memtuples[0]) >= 0)
+ Assert(!state->replaceActive || state->memtupcount > 0);
+ if (state->replaceActive &&
+ COMPARETUP(state, tuple, &state->memtuples[0]) >= 0)
+ {
+ /*
+ * Unlike classic replacement selection, which this module was
+ * previously based on, only run 0 is treated as a priority
+ * queue through heapification. The second run (run 1) is
+ * appended indifferently below, and will never be trusted to
+ * maintain the heap invariant beyond simply not getting in
+ * the way of spilling run 0 incrementally. In other words,
+ * second run tuples may be sifted out of the way of first
+ * run tuples; COMPARETUP() will never be called for run
+ * 1 tuples. However, not even HEAPCOMPARE() will be
+ * called for a subsequent run's tuples.
+ */
tuplesort_heap_insert(state, tuple, state->currentRun, true);
+ }
else
- tuplesort_heap_insert(state, tuple, state->currentRun + 1, true);
+ {
+ /*
+ * Note that unlike Knuth, we do not care about the second
+ * run's tuples when loading runs. After the first run is
+ * complete, tuples will not be dumped incrementally at all,
+ * but as long as the first run (run 0) is current it will
+ * be maintained. dumptuples does not trust that the second
+ * or subsequent runs are heapified (beyond merely not
+ * getting in the way of the first, fully heapified run,
+ * which only matters for the second run, run 1). Anything
+ * past the first run will be quicksorted.
+ *
+ * Past the first run, there is no need to differentiate runs
+ * in memory (only the first and second runs will ever be
+ * usefully differentiated). Use a generic INT_MAX run
+ * number (just to be tidy). There should always be room to
+ * store the incoming tuple.
+ */
+ tuple->tupindex = INT_MAX;
+ state->memtuples[state->memtupcount++] = *tuple;
+ }
/*
* If we are over the memory limit, dump tuples till we're under.
@@ -1576,20 +1666,9 @@ tuplesort_performsort(Tuplesortstate *state)
/*
* We were able to accumulate all the tuples within the allowed
- * amount of memory. Just qsort 'em and we're done.
+ * amount of memory. Just quicksort 'em and we're done.
*/
- if (state->memtupcount > 1)
- {
- /* Can we use the single-key sort function? */
- if (state->onlyKey != NULL)
- qsort_ssup(state->memtuples, state->memtupcount,
- state->onlyKey);
- else
- qsort_tuple(state->memtuples,
- state->memtupcount,
- state->comparetup,
- state);
- }
+ tuplesort_quicksort(state);
state->current = 0;
state->eof_reached = false;
state->markpos_offset = 0;
@@ -1616,12 +1695,26 @@ tuplesort_performsort(Tuplesortstate *state)
/*
* Finish tape-based sort. First, flush all tuples remaining in
- * memory out to tape; then merge until we have a single remaining
- * run (or, if !randomAccess, one run per tape). Note that
- * mergeruns sets the correct state->status.
+ * memory out to tape where that's required (when more than one
+ * run's tuples made it to tape, or when the caller required
+ * random access). Then, either merge until we have a single
+ * remaining run on tape, or merge runs in memory by sorting
+ * them into one single in-memory run. Note that
+ * mergeruns/mergememruns sets the correct state->status.
*/
- dumptuples(state, true);
- mergeruns(state);
+ if (state->currentRun > 0 || state->randomAccess)
+ {
+ dumptuples(state, true);
+ mergeruns(state);
+ }
+ else
+ {
+ /*
+ * Only possible for !randomAccess callers, just as with
+ * tape based on-the-fly merge
+ */
+ mergememruns(state);
+ }
state->eof_reached = false;
state->markpos_block = 0L;
state->markpos_offset = 0;
@@ -1640,6 +1733,9 @@ tuplesort_performsort(Tuplesortstate *state)
elog(LOG, "performsort done (except %d-way final merge): %s",
state->activeTapes,
pg_rusage_show(&state->ru_start));
+ else if (state->status == TSS_MEMTAPEMERGE)
+ elog(LOG, "performsort done (except memory/tape final merge): %s",
+ pg_rusage_show(&state->ru_start));
else
elog(LOG, "performsort done: %s",
pg_rusage_show(&state->ru_start));
@@ -1791,6 +1887,118 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
READTUP(state, stup, state->result_tape, tuplen);
return true;
+ case TSS_MEMTAPEMERGE:
+ Assert(forward);
+ /* For now, assume tuple returned from memory */
+ *should_free = false;
+
+ /*
+ * Should be at least one memtuple (work_mem should be roughly
+ * fully consumed)
+ */
+ Assert(state->memtupcount > 0);
+
+ if (state->eof_reached)
+ return false;
+
+ if (state->just_memtuples)
+ goto just_memtuples;
+
+ /*
+ * Merge together quicksorted memtuples array, and sorted tape.
+ *
+ * When this optimization was initially applied, the array was
+ * heapified. Some number of tuples were spilled to disk from the
+ * top of the heap irregularly, and are read from tape here in
+ * fully sorted order. memtuples usually originally contains 2
+ * runs, though, so we merge it with the on-tape run.
+ * (Quicksorting effectively merged the 2 in-memory runs into one
+ * in-memory run already)
+ *
+ * Exhaust the supply of tape tuples first.
+ *
+ * "stup" is always initially set to the current tape tuple if
+ * any remain, which may be cached from previous call, or read
+ * from tape when nothing cached.
+ */
+ if (state->cached)
+ *stup = state->tape_cache;
+ else if ((tuplen = getlen(state, state->result_tape, true)) != 0)
+ READTUP(state, stup, state->result_tape, tuplen);
+ else
+ {
+ /* Supply of tape tuples was just exhausted */
+ state->just_memtuples = true;
+ goto just_memtuples;
+ }
+
+ /*
+ * Kludge: Trigger abbreviated tie-breaker if in-memory tuples
+ * use abbreviation (writing tuples to tape never preserves
+ * abbreviated keys). Do this by assigning in-memory
+ * abbreviated tuple to tape tuple directly.
+ *
+ * It doesn't seem worth generating a new abbreviated key for
+ * the tape tuple, and this approach is simpler than
+ * "unabbreviating" the memtuple tuple from a "common" routine
+ * like this.
+ *
+ * In the future, this routine could offer an API that allows
+ * certain clients (like ordered set aggregate callers) to
+ * cheaply test *inequality* across adjacent pairs of sorted
+ * tuples on the basis of simple abbreviated key binary
+ * inequality. Another advantage of this approach is that that
+ * can still work without reporting to clients that abbreviation
+ * wasn't used. The tape tuples might only be a small minority
+ * of all tuples returned.
+ */
+ if (state->sortKeys != NULL && state->sortKeys->abbrev_converter != NULL)
+ stup->datum1 = state->memtuples[state->current].datum1;
+
+ /*
+ * Compare current tape tuple to current memtuple.
+ *
+ * Since we always start with at least one memtuple, and since tape
+ * tuples are always returned before equal memtuples, it follows
+ * that there must be at least one memtuple left to return here.
+ */
+ Assert(state->current < state->memtupcount);
+
+ if (COMPARETUP(state, stup, &state->memtuples[state->current]) <= 0)
+ {
+ /*
+ * Tape tuple less than or equal to memtuple array current
+ * position. Return it.
+ */
+ state->cached = false;
+ /* Caller can free tape tuple memory */
+ *should_free = true;
+ }
+ else
+ {
+ /*
+ * Tape tuple greater than memtuple array's current tuple.
+ *
+ * Return current memtuple tuple, and cache tape tuple for
+ * next call. It will be returned on next or subsequent
+ * call.
+ */
+ state->tape_cache = *stup;
+ state->cached = true;
+ *stup = state->memtuples[state->current++];
+ }
+ return true;
+
+just_memtuples:
+ /* Just return memtuples -- merging done */
+ if (state->current < state->memtupcount)
+ {
+ *stup = state->memtuples[state->current++];
+ return true;
+ }
+ state->eof_reached = true;
+ return false;
+
case TSS_FINALMERGE:
Assert(forward);
*should_free = true;
@@ -2000,6 +2208,7 @@ tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples, bool forward)
return false;
case TSS_SORTEDONTAPE:
+ case TSS_MEMTAPEMERGE:
case TSS_FINALMERGE:
/*
@@ -2129,6 +2338,15 @@ inittapes(Tuplesortstate *state)
state->tp_tapenum = (int *) palloc0(maxTapes * sizeof(int));
/*
+ * Give replacement selection a try. There will be a switch to a simple
+ * hybrid sort-merge strategy after the first run (iff there is to be a
+ * second on-tape run).
+ */
+ state->replaceActive = true;
+ state->cached = false;
+ state->just_memtuples = false;
+
+ /*
* Convert the unsorted contents of memtuples[] into a heap. Each tuple is
* marked as belonging to run number zero.
*
@@ -2225,6 +2443,14 @@ mergeruns(Tuplesortstate *state)
* volume is between 1X and 2X workMem), we can just use that tape as the
* finished output, rather than doing a useless merge. (This obvious
* optimization is not in Knuth's algorithm.)
+ *
+ * This should almost be dead code in practice because replacement
+ * selection will not be allowed to continue by tuplesort_performsort().
+ * It is perhaps still possible that certain edge-cases will leave the
+ * final dumptuples() call as a no-op, resulting in only one run
+ * remaining on tape, allowing this optimization to still be used.
+ * Before the role of replacement selection was significantly diminished
+ * to allow quicksorting of runs, this code was more useful.
*/
if (state->currentRun == 1)
{
@@ -2426,6 +2652,68 @@ mergeonerun(Tuplesortstate *state)
}
/*
+ * mergememruns -- merge runs in memory into a new in-memory run.
+ *
+ * This allows tuplesort to avoid dumping many tuples in the common case
+ * where work_mem is less than 2x the amount required for an internal sort
+ * ("quicksort with spillover"). This optimization does not appear in
+ * Knuth's algorithm.
+ *
+ * Merging here actually means quicksorting, without regard to the run
+ * number of each memtuple. Note that this in-memory merge is distinct from
+ * the final on-the-fly merge step that follows. This routine merges what
+ * was originally the tail of the first run with what was originally the
+ * entire second run in advance of the on-the-fly merge step (sometimes,
+ * there will only be one run in memory, but sorting is still required).
+ * The final on-the-fly merge occurs between the new all-in-memory run
+ * created by this routine, and what was originally the first part of the
+ * first run (and is now simply the first run), which is already sorted on
+ * tape.
+ *
+ * The fact that the memtuples array has already been heapified (within the
+ * first run) is no reason to commit to the path of unnecessarily dumping
+ * and heapsorting input tuples. Often, memtuples will be much larger than
+ * the final on-tape run, which is where this optimization is most
+ * effective.
+ */
+static void
+mergememruns(Tuplesortstate *state)
+{
+ Assert(state->replaceActive);
+ Assert(!state->randomAccess);
+
+ /*
+ * It doesn't seem worth being more clever in the cases where there is
+ * no second run pending (i.e. no tuple that didn't belong to the
+ * original/first "currentRun") just to avoid a memory/tape final
+ * on-the-fly merge step, although that might be possible with care.
+ */
+ markrunend(state, state->currentRun);
+
+ /*
+ * The usual path for quicksorting runs (quicksort just before dumping
+ * all tuples) was avoided by caller, so quicksort to merge.
+ *
+ * Note that this may use abbreviated keys, which are no longer
+ * available for the tuples that spilled to tape. This is something
+ * that the final on-the-fly merge step accounts for.
+ */
+ tuplesort_quicksort(state);
+ state->current = 0;
+
+#ifdef TRACE_SORT
+ if (trace_sort)
+ elog(LOG, "finished quicksort of %d tuples to create single in-memory run: %s",
+ state->memtupcount, pg_rusage_show(&state->ru_start));
+#endif
+
+ state->result_tape = state->tp_tapenum[state->destTape];
+ /* Must freeze and rewind the finished output tape */
+ LogicalTapeFreeze(state->tapeset, state->result_tape);
+ state->status = TSS_MEMTAPEMERGE;
+}
+
+/*
* beginmerge - initialize for a merge pass
*
* We decrease the counts of real and dummy runs for each tape, and mark
@@ -2604,21 +2892,23 @@ mergeprereadone(Tuplesortstate *state, int srcTape)
}
/*
- * dumptuples - remove tuples from heap and write to tape
+ * dumptuples - remove tuples from memtuples and write to tape
*
* This is used during initial-run building, but not during merging.
*
- * When alltuples = false, dump only enough tuples to get under the
- * availMem limit (and leave at least one tuple in the heap in any case,
- * since puttuple assumes it always has a tuple to compare to). We also
- * insist there be at least one free slot in the memtuples[] array.
+ * When alltuples = false and replacement selection is still active, dump
+ * only enough tuples to get under the availMem limit (and leave at least
+ * one tuple in memtuples, since puttuple will then assume it is a heap that
+ * has a tuple to compare to). We also insist there be at least one free
+ * slot in the memtuples[] array.
*
- * When alltuples = true, dump everything currently in memory.
+ * When alltuples = true, always batch dump everything currently in memory.
* (This case is only used at end of input data.)
*
- * If we empty the heap, close out the current run and return (this should
- * only happen at end of input data). If we see that the tuple run number
- * at the top of the heap has changed, start a new run.
+ * If, when replacement selection is active, we see that the tuple run
+ * number at the top of the heap has changed, start a new run. This must be
+ * the first run, because replacement selection is subsequently abandoned
+ * for all further runs.
*/
static void
dumptuples(Tuplesortstate *state, bool alltuples)
@@ -2627,21 +2917,39 @@ dumptuples(Tuplesortstate *state, bool alltuples)
(LACKMEM(state) && state->memtupcount > 1) ||
state->memtupcount >= state->memtupsize)
{
+ if (state->replaceActive && !alltuples)
+ {
+ /*
+ * Still holding out for a case favorable to replacement selection;
+ * perhaps there will be a single-run in the event of almost sorted
+ * input, or perhaps work_mem hasn't been exceeded by too much, and
+ * a "quicksort with spillover" remains possible.
+ *
+ * Dump the heap's frontmost entry, and sift up to remove it from
+ * the heap.
+ */
+ Assert(state->memtupcount > 1);
+ WRITETUP(state, state->tp_tapenum[state->destTape],
+ &state->memtuples[0]);
+ tuplesort_heap_siftup(state, true);
+ }
+ else
+ {
+ /*
+ * Once committed to quicksorting runs, never incrementally
+ * spill
+ */
+ dumpbatch(state, alltuples);
+ break;
+ }
+
/*
- * Dump the heap's frontmost entry, and sift up to remove it from the
- * heap.
+ * If top run number has changed, we've finished the current run
+ * (this can only be the first run, run 0), and will no longer spill
+ * incrementally.
*/
Assert(state->memtupcount > 0);
- WRITETUP(state, state->tp_tapenum[state->destTape],
- &state->memtuples[0]);
- tuplesort_heap_siftup(state, true);
-
- /*
- * If the heap is empty *or* top run number has changed, we've
- * finished the current run.
- */
- if (state->memtupcount == 0 ||
- state->currentRun != state->memtuples[0].tupindex)
+ if (state->memtuples[0].tupindex != 0)
{
markrunend(state, state->tp_tapenum[state->destTape]);
state->currentRun++;
@@ -2650,24 +2958,94 @@ dumptuples(Tuplesortstate *state, bool alltuples)
#ifdef TRACE_SORT
if (trace_sort)
- elog(LOG, "finished writing%s run %d to tape %d: %s",
- (state->memtupcount == 0) ? " final" : "",
+ elog(LOG, "finished writing heapsorted run %d to tape %d: %s",
state->currentRun, state->destTape,
pg_rusage_show(&state->ru_start));
#endif
/*
- * Done if heap is empty, else prepare for new run.
+ * Heap cannot be empty, so prepare for new run and give up on
+ * replacement selection.
*/
- if (state->memtupcount == 0)
- break;
- Assert(state->currentRun == state->memtuples[0].tupindex);
selectnewtape(state);
+ /* All future runs will only use dumpbatch/quicksort */
+ state->replaceActive = false;
}
}
}
/*
+ * dumpbatch - sort and dump all memtuples, forming one run on tape
+ *
+ * Unlike classic replacement selection sort, second or subsequent runs are
+ * never heapified by this module (although heapification still respects run
+ * number differences between the first and second runs). This helper
+ * handles the case where replacement selection is abandoned, and all tuples
+ * are quicksorted and dumped in memtuples-sized batches. This alternative
+ * strategy is a simple hybrid sort-merge strategy, with quicksorting of
+ * memtuples-sized runs.
+ *
+ * In rare cases, this routine may add to an on-tape run already storing
+ * tuples.
+ */
+static void
+dumpbatch(Tuplesortstate *state, bool alltuples)
+{
+ int memtupwrite;
+ int i;
+
+ Assert(state->status == TSS_BUILDRUNS);
+
+ /* Final call might be unnecessary */
+ if (state->memtupcount == 0)
+ {
+ Assert(alltuples);
+ return;
+ }
+ state->currentRun++;
+
+#ifdef TRACE_SORT
+ if (trace_sort)
+ elog(LOG, "starting quicksort of run %d: %s",
+ state->currentRun, pg_rusage_show(&state->ru_start));
+#endif
+
+ tuplesort_quicksort(state);
+
+#ifdef TRACE_SORT
+ if (trace_sort)
+ elog(LOG, "finished quicksorting run %d: %s",
+ state->currentRun, pg_rusage_show(&state->ru_start));
+#endif
+
+ /*
+ * This should be adopted to perform asynchronous I/O one day, as
+ * dumping in batch represents a good opportunity to overlap I/O
+ * and computation.
+ */
+ memtupwrite = state->memtupcount;
+ for (i = 0; i < memtupwrite; i++)
+ {
+ WRITETUP(state, state->tp_tapenum[state->destTape],
+ &state->memtuples[i]);
+ state->memtupcount--;
+ }
+ markrunend(state, state->tp_tapenum[state->destTape]);
+ state->tp_runs[state->destTape]++;
+ state->tp_dummy[state->destTape]--; /* per Alg D step D2 */
+
+#ifdef TRACE_SORT
+ if (trace_sort)
+ elog(LOG, "finished writing run %d to tape %d: %s",
+ state->currentRun, state->destTape,
+ pg_rusage_show(&state->ru_start));
+#endif
+
+ if (!alltuples)
+ selectnewtape(state);
+}
+
+/*
* tuplesort_rescan - rewind and replay the scan
*/
void
@@ -2777,7 +3155,8 @@ void
tuplesort_get_stats(Tuplesortstate *state,
const char **sortMethod,
const char **spaceType,
- long *spaceUsed)
+ long *spaceUsed,
+ int *rowsSortedMem)
{
/*
* Note: it might seem we should provide both memory and disk usage for a
@@ -2806,15 +3185,23 @@ tuplesort_get_stats(Tuplesortstate *state,
*sortMethod = "top-N heapsort";
else
*sortMethod = "quicksort";
+ *rowsSortedMem = state->memtupcount;
break;
case TSS_SORTEDONTAPE:
*sortMethod = "external sort";
+ *rowsSortedMem = 0;
+ break;
+ case TSS_MEMTAPEMERGE:
+ *sortMethod = "quicksort with spillover";
+ *rowsSortedMem = state->memtupcount;
break;
case TSS_FINALMERGE:
*sortMethod = "external merge";
+ *rowsSortedMem = 0;
break;
default:
*sortMethod = "still in progress";
+ *rowsSortedMem = -1;
break;
}
}
@@ -2825,10 +3212,19 @@ tuplesort_get_stats(Tuplesortstate *state,
*
* Compare two SortTuples. If checkIndex is true, use the tuple index
* as the front of the sort key; otherwise, no.
+ *
+ * Note that for checkIndex callers, the heap invariant is never maintained
+ * beyond the first run, and so there are no COMPARETUP() calls beyond the
+ * first run. It is assumed that checkIndex callers are maintaining the
+ * heap invariant for a replacement selection priority queue, but those
+ * callers do not go on to trust the heap to be fully-heapified past the
+ * first run. Once currentRun isn't the first, memtuples is no longer a
+ * heap at all.
*/
#define HEAPCOMPARE(tup1,tup2) \
- (checkIndex && ((tup1)->tupindex != (tup2)->tupindex) ? \
+ (checkIndex && ((tup1)->tupindex != (tup2)->tupindex || \
+ (tup1)->tupindex != 0) ? \
((tup1)->tupindex) - ((tup2)->tupindex) : \
COMPARETUP(state, tup1, tup2))
@@ -2927,6 +3323,33 @@ sort_bounded_heap(Tuplesortstate *state)
}
/*
+ * Sort all memtuples using quicksort.
+ *
+ * Quicksort is tuplesort's internal sort algorithm. It is also generally
+ * preferred to replacement selection of runs during external sorts, except
+ * where incrementally spilling may be particularly beneficial. Quicksort
+ * will generally be much faster than replacement selection's heapsort
+ * because modern CPUs are usually bottlenecked on memory access, and
+ * quicksort is a cache-oblivious algorithm.
+ */
+static void
+tuplesort_quicksort(Tuplesortstate *state)
+{
+ if (state->memtupcount > 1)
+ {
+ /* Can we use the single-key sort function? */
+ if (state->onlyKey != NULL)
+ qsort_ssup(state->memtuples, state->memtupcount,
+ state->onlyKey);
+ else
+ qsort_tuple(state->memtuples,
+ state->memtupcount,
+ state->comparetup,
+ state);
+ }
+}
+
+/*
* Insert a new tuple into an empty or existing heap, maintaining the
* heap invariant. Caller is responsible for ensuring there's room.
*
@@ -2954,6 +3377,17 @@ tuplesort_heap_insert(Tuplesortstate *state, SortTuple *tuple,
memtuples = state->memtuples;
Assert(state->memtupcount < state->memtupsize);
+ /*
+ * Once incremental heap spilling is abandoned, this routine should not be
+ * called when loading runs. memtuples will be an array of tuples in no
+ * significant order, so calling here is inappropriate. Even when
+ * incremental spilling is still in progress, this routine does not handle
+ * the second run's tuples (those are heapified to a limited extent that
+ * they are appended, and thus kept away from those tuples in the first
+ * run).
+ */
+ Assert(!checkIndex || tupleindex == 0);
+
CHECK_FOR_INTERRUPTS();
/*
@@ -2985,6 +3419,13 @@ tuplesort_heap_siftup(Tuplesortstate *state, bool checkIndex)
int i,
n;
+ /*
+ * Once incremental heap spilling is abandoned, this routine should not be
+ * called when loading runs. memtuples will be an array of tuples in no
+ * significant order, so calling here is inappropriate.
+ */
+ Assert(!checkIndex || state->currentRun == 0);
+
if (--state->memtupcount <= 0)
return;
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index de6fc56..3679815 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -109,7 +109,8 @@ extern void tuplesort_end(Tuplesortstate *state);
extern void tuplesort_get_stats(Tuplesortstate *state,
const char **sortMethod,
const char **spaceType,
- long *spaceUsed);
+ long *spaceUsed,
+ int *rowsSortedMem);
extern int tuplesort_merge_order(int64 allowedMem);
--
1.9.1
I will sketch a simple implementation of parallel sorting based on the
patch series that may be workable, and requires relatively little
implementation effort compare to other ideas that were raised at
various times:
Hello,
I've only a very superficial understanding on your work,
please apologize if this is off topic or if this was already discussed...
Have you considered performances for cases where multiple CREATE INDEX are running in parallel?
One of our typical use case are large daily tables (50-300 Mio rows) with up to 6 index creations
that start simultaneously.
Our servers have 40-60 GB RAM , ca. 12 CPUs and we set maintenance mem to 1-2 GB for this.
If the create index themselves start using parallelism, I guess that we might need to review our workflow...
best regards,
Marc Mamin
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Sep 6, 2015 at 1:51 AM, Marc Mamin <M.Mamin@intershop.de> wrote:
Have you considered performances for cases where multiple CREATE INDEX are running in parallel?
One of our typical use case are large daily tables (50-300 Mio rows) with up to 6 index creations
that start simultaneously.
Our servers have 40-60 GB RAM , ca. 12 CPUs and we set maintenance mem to 1-2 GB for this.
If the create index themselves start using parallelism, I guess that we might need to review our workflow...
Not particularly. I imagine that that case would be helped a lot here
(probably more than a simpler case involving only one CREATE INDEX),
because each core would be require fewer main memory accesses overall.
Maybe you can test it and let us know how it goes.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Aug 19, 2015 at 7:24 PM, Peter Geoghegan <pg@heroku.com> wrote:
I'll start a new thread for this, since my external sorting patch has
now evolved well past the original "quicksort with spillover"
idea...although not quite how I anticipated it would. It seems like
I've reached a good point to get some feedback.
Corey Huinker has once again assisted me with this work, by doing some
benchmarking on an AWS instance of his:
32 cores (c3.8xlarge, I suppose)
MemTotal: 251902912 kB
I believe it had one EBS volume.
This testing included 2 data sets:
* A data set that he happens to have that is representative of his
production use-case. Corey had some complaints about the sort
performance of PostgreSQL, particularly prior to 9.5, and I like to
link any particular performance optimization to an improvement in an
actual production workload, if at all possible.
* A tool that I wrote, that works on top of sortbenchmark.org's
"gensort" [1]http://sortbenchmark.org/FAQ-2015.html -- Peter Geoghegan data generation tool. It seems reasonable to me to drive
this work in part with a benchmark devised by Jim Gray. He did after
all receive a Turing award for this contribution to transaction
processing. I'm certainly a fan of his work. A key practical advantage
of that is that is has reasonable guarantees about determinism, making
these results relatively easy to recreate independently.
The modified "gensort" is available from
https://github.com/petergeoghegan/gensort
The python script postgres_load.py, which performs bulk-loading for
Postgres using COPY FREEZE. It ought to be fairly self-documenting:
$:~/gensort$ ./postgres_load.py --help
usage: postgres_load.py [-h] [-w WORKERS] [-m MILLION] [-s] [-l] [-c]
optional arguments:
-h, --help show this help message and exit
-w WORKERS, --workers WORKERS
Number of gensort workers (default: 4)
-m MILLION, --million MILLION
Generate n million tuples (default: 100)
-s, --skew Skew distribution of output keys (default: False)
-l, --logged Use logged PostgreSQL table (default: False)
-c, --collate Use default collation rather than C collation
(default: False)
For this initial report to the list, I'm going to focus on a case
involving 16 billion non-skewed tuples generated using the gensort
tool. I wanted to see how a sort of a ~1TB table (1017GB as reported
by psql, actually) could be improved, as compared to relatively small
volumes of data (in the multiple gigabyte range) that were so improved
by sorts on my laptop, which has enough memory to avoid blocking on
physical I/O much of the time. How the new approach deals with
hundreds of runs that are actually reasonably sized is also of
interest. This server does have a lot of memory, and many CPU cores.
It was kind of underpowered on I/O, though.
The initial load of 16 billion tuples (with a sortkey that is "C"
locale text) took about 10 hours. My tool supports parallel generation
of COPY format files, but serial performance of that stage isn't
especially fast. Further, in order to support COPY FREEZE, and in
order to ensure perfect determinism, the COPY operations occur
serially in a single transaction that creates the table that we
performed a CREATE INDEX on.
Patch, with 3GB maintenance_work_mem:
...
LOG: performsort done (except 411-way final merge): CPU
1017.95s/17615.74u sec elapsed 23910.99 sec
STATEMENT: create index on sort_test (sortkey );
LOG: external sort ended, 54740802 disk blocks used: CPU
2001.81s/31395.96u sec elapsed 41648.05 sec
STATEMENT: create index on sort_test (sortkey );
So just over 11 hours (11:34:08), then. The initial sorting for 411
runs took 06:38:30.99, as you can see.
Master branch:
...
LOG: finished writing run 202 to tape 201: CPU 1224.68s/31060.15u sec
elapsed 34409.16 sec
LOG: finished writing run 203 to tape 202: CPU 1230.48s/31213.55u sec
elapsed 34580.41 sec
LOG: finished writing run 204 to tape 203: CPU 1236.74s/31366.63u sec
elapsed 34750.28 sec
LOG: performsort starting: CPU 1241.70s/31501.61u sec elapsed 34898.63 sec
LOG: finished writing run 205 to tape 204: CPU 1242.19s/31516.52u sec
elapsed 34914.17 sec
LOG: finished writing final run 206 to tape 205: CPU
1243.23s/31564.23u sec elapsed 34963.03 sec
LOG: performsort done (except 206-way final merge): CPU
1243.86s/31570.58u sec elapsed 34974.08 sec
LOG: external sort ended, 54740731 disk blocks used: CPU
2026.98s/48448.13u sec elapsed 55299.24 sec
CREATE INDEX
Time: 55299315.220 ms
So 15:21:39 for master -- it's much improved, but this was still
disappointing given the huge improvements on relatively small cases.
Finished index was fairly large, which can be seen here by working
back from "total relation size":
postgres=# select pg_size_pretty(pg_total_relation_size('sort_test'));
pg_size_pretty
----------------
1487 GB
(1 row)
I think that this is probably due to the relatively slow I/O on this
server, and because the merge step is more of a bottleneck. As we
increase maintenance_work_mem, we're likely to then suffer from the
lack of explicit asynchronous I/O here. It helps, still, but not
dramatically. With with maintenance_work_mem = 30GB, patch is somewhat
faster (no reason to think that this would help master at all, so that
was untested):
...
LOG: starting quicksort of run 40: CPU 1815.99s/19339.80u sec elapsed
24910.38 sec
LOG: finished quicksorting run 40: CPU 1820.09s/19565.94u sec elapsed
25140.69 sec
LOG: finished writing run 40 to tape 39: CPU 1833.76s/19642.11u sec
elapsed 25234.44 sec
LOG: performsort starting: CPU 1849.46s/19803.28u sec elapsed 25499.98 sec
LOG: starting quicksort of run 41: CPU 1849.46s/19803.28u sec elapsed
25499.98 sec
LOG: finished quicksorting run 41: CPU 1852.37s/20000.73u sec elapsed
25700.43 sec
LOG: finished writing run 41 to tape 40: CPU 1864.89s/20069.09u sec
elapsed 25782.93 sec
LOG: performsort done (except 41-way final merge): CPU
1965.43s/20086.28u sec elapsed 25980.80 sec
LOG: external sort ended, 54740909 disk blocks used: CPU
3270.57s/31595.37u sec elapsed 40376.43 sec
CREATE INDEX
Time: 40383174.977 ms
So that takes 11:13:03 in total -- we only managed to shave about 20
minutes off the total time taken, despite a 10x increase in
maintenance_work_mem. Still, at least it gets moderately better, not
worse, which is certainly what I'd expect from the master branch. 60GB
was half way between 3GB and 30GB in terms of performance, so it
doesn't continue to help, but, again, at least things don't get much
worse.
Thoughts on these results:
* I'd really like to know the role of I/O here. Better, low-overhead
instrumentation is required to see when and how we are I/O bound. I've
been doing much of that on a more-or-less ad hoc basis so far, using
iotop. I'm looking into a way to usefully graph the I/O activity over
many hours, to correlate with the trace_sort output that I'll also
show. I'm open to suggestions on the easiest way of doing that. Having
used the "perf" tool for instrumenting I/O at all in the past.
* Parallelism would probably help us here *a lot*.
* As I said, I think we suffer from the lack of asynchronous I/O much
more at this scale. Will need to confirm that theory.
* It seems kind of ill-advised to make run size (which is always in
linear proportion to maintenance_work_mem with this new approach to
sorting) larger, because it probably will hurt writing runs more than
it will help in making merging cheaper (perhaps mostly due to the lack
of asynchronous I/O to hide the latency of writes -- Linux might not
do so well at this scale).
* Maybe adding actual I/O bandwidth is the way to go to get a better
picture. I wouldn't be surprised if we were very bottlenecked on I/O
here. Might be worth using many parallel EBS volumes here, for
example.
[1]: http://sortbenchmark.org/FAQ-2015.html -- Peter Geoghegan
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Nov 6, 2015 at 8:08 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Wed, Aug 19, 2015 at 7:24 PM, Peter Geoghegan <pg@heroku.com> wrote:
I'll start a new thread for this, since my external sorting patch has
now evolved well past the original "quicksort with spillover"
idea...although not quite how I anticipated it would. It seems like
I've reached a good point to get some feedback.Corey Huinker has once again assisted me with this work, by doing some
benchmarking on an AWS instance of his:32 cores (c3.8xlarge, I suppose)
MemTotal: 251902912 kBI believe it had one EBS volume.
This testing included 2 data sets:
* A data set that he happens to have that is representative of his
production use-case. Corey had some complaints about the sort
performance of PostgreSQL, particularly prior to 9.5, and I like to
link any particular performance optimization to an improvement in an
actual production workload, if at all possible.* A tool that I wrote, that works on top of sortbenchmark.org's
"gensort" [1] data generation tool. It seems reasonable to me to drive
this work in part with a benchmark devised by Jim Gray. He did after
all receive a Turing award for this contribution to transaction
processing. I'm certainly a fan of his work. A key practical advantage
of that is that is has reasonable guarantees about determinism, making
these results relatively easy to recreate independently.The modified "gensort" is available from
https://github.com/petergeoghegan/gensortThe python script postgres_load.py, which performs bulk-loading for
Postgres using COPY FREEZE. It ought to be fairly self-documenting:$:~/gensort$ ./postgres_load.py --help
usage: postgres_load.py [-h] [-w WORKERS] [-m MILLION] [-s] [-l] [-c]optional arguments:
-h, --help show this help message and exit
-w WORKERS, --workers WORKERS
Number of gensort workers (default: 4)
-m MILLION, --million MILLION
Generate n million tuples (default: 100)
-s, --skew Skew distribution of output keys (default: False)
-l, --logged Use logged PostgreSQL table (default: False)
-c, --collate Use default collation rather than C collation
(default: False)For this initial report to the list, I'm going to focus on a case
involving 16 billion non-skewed tuples generated using the gensort
tool. I wanted to see how a sort of a ~1TB table (1017GB as reported
by psql, actually) could be improved, as compared to relatively small
volumes of data (in the multiple gigabyte range) that were so improved
by sorts on my laptop, which has enough memory to avoid blocking on
physical I/O much of the time. How the new approach deals with
hundreds of runs that are actually reasonably sized is also of
interest. This server does have a lot of memory, and many CPU cores.
It was kind of underpowered on I/O, though.The initial load of 16 billion tuples (with a sortkey that is "C"
locale text) took about 10 hours. My tool supports parallel generation
of COPY format files, but serial performance of that stage isn't
especially fast. Further, in order to support COPY FREEZE, and in
order to ensure perfect determinism, the COPY operations occur
serially in a single transaction that creates the table that we
performed a CREATE INDEX on.Patch, with 3GB maintenance_work_mem:
...
LOG: performsort done (except 411-way final merge): CPU
1017.95s/17615.74u sec elapsed 23910.99 sec
STATEMENT: create index on sort_test (sortkey );
LOG: external sort ended, 54740802 disk blocks used: CPU
2001.81s/31395.96u sec elapsed 41648.05 sec
STATEMENT: create index on sort_test (sortkey );So just over 11 hours (11:34:08), then. The initial sorting for 411
runs took 06:38:30.99, as you can see.Master branch:
...
LOG: finished writing run 202 to tape 201: CPU 1224.68s/31060.15u sec
elapsed 34409.16 sec
LOG: finished writing run 203 to tape 202: CPU 1230.48s/31213.55u sec
elapsed 34580.41 sec
LOG: finished writing run 204 to tape 203: CPU 1236.74s/31366.63u sec
elapsed 34750.28 sec
LOG: performsort starting: CPU 1241.70s/31501.61u sec elapsed 34898.63 sec
LOG: finished writing run 205 to tape 204: CPU 1242.19s/31516.52u sec
elapsed 34914.17 sec
LOG: finished writing final run 206 to tape 205: CPU
1243.23s/31564.23u sec elapsed 34963.03 sec
LOG: performsort done (except 206-way final merge): CPU
1243.86s/31570.58u sec elapsed 34974.08 sec
LOG: external sort ended, 54740731 disk blocks used: CPU
2026.98s/48448.13u sec elapsed 55299.24 sec
CREATE INDEX
Time: 55299315.220 msSo 15:21:39 for master -- it's much improved, but this was still
disappointing given the huge improvements on relatively small cases.Finished index was fairly large, which can be seen here by working
back from "total relation size":postgres=# select pg_size_pretty(pg_total_relation_size('sort_test'));
pg_size_pretty
----------------
1487 GB
(1 row)I think that this is probably due to the relatively slow I/O on this
server, and because the merge step is more of a bottleneck. As we
increase maintenance_work_mem, we're likely to then suffer from the
lack of explicit asynchronous I/O here. It helps, still, but not
dramatically. With with maintenance_work_mem = 30GB, patch is somewhat
faster (no reason to think that this would help master at all, so that
was untested):...
LOG: starting quicksort of run 40: CPU 1815.99s/19339.80u sec elapsed
24910.38 sec
LOG: finished quicksorting run 40: CPU 1820.09s/19565.94u sec elapsed
25140.69 sec
LOG: finished writing run 40 to tape 39: CPU 1833.76s/19642.11u sec
elapsed 25234.44 sec
LOG: performsort starting: CPU 1849.46s/19803.28u sec elapsed 25499.98 sec
LOG: starting quicksort of run 41: CPU 1849.46s/19803.28u sec elapsed
25499.98 sec
LOG: finished quicksorting run 41: CPU 1852.37s/20000.73u sec elapsed
25700.43 sec
LOG: finished writing run 41 to tape 40: CPU 1864.89s/20069.09u sec
elapsed 25782.93 sec
LOG: performsort done (except 41-way final merge): CPU
1965.43s/20086.28u sec elapsed 25980.80 sec
LOG: external sort ended, 54740909 disk blocks used: CPU
3270.57s/31595.37u sec elapsed 40376.43 sec
CREATE INDEX
Time: 40383174.977 msSo that takes 11:13:03 in total -- we only managed to shave about 20
minutes off the total time taken, despite a 10x increase in
maintenance_work_mem. Still, at least it gets moderately better, not
worse, which is certainly what I'd expect from the master branch. 60GB
was half way between 3GB and 30GB in terms of performance, so it
doesn't continue to help, but, again, at least things don't get much
worse.Thoughts on these results:
* I'd really like to know the role of I/O here. Better, low-overhead
instrumentation is required to see when and how we are I/O bound. I've
been doing much of that on a more-or-less ad hoc basis so far, using
iotop. I'm looking into a way to usefully graph the I/O activity over
many hours, to correlate with the trace_sort output that I'll also
show. I'm open to suggestions on the easiest way of doing that. Having
used the "perf" tool for instrumenting I/O at all in the past.* Parallelism would probably help us here *a lot*.
* As I said, I think we suffer from the lack of asynchronous I/O much
more at this scale. Will need to confirm that theory.* It seems kind of ill-advised to make run size (which is always in
linear proportion to maintenance_work_mem with this new approach to
sorting) larger, because it probably will hurt writing runs more than
it will help in making merging cheaper (perhaps mostly due to the lack
of asynchronous I/O to hide the latency of writes -- Linux might not
do so well at this scale).* Maybe adding actual I/O bandwidth is the way to go to get a better
picture. I wouldn't be surprised if we were very bottlenecked on I/O
here. Might be worth using many parallel EBS volumes here, for
example.[1] http://sortbenchmark.org/FAQ-2015.html
--
Peter Geoghegan
The machine in question still exists, so if you have questions about it,
commands you'd like me to run to give you insight as to the I/O
capabilities of the machine, let me know. I can't guarantee we'll keep the
machine much longer.
On Wed, Aug 19, 2015 at 7:24 PM, Peter Geoghegan <pg@heroku.com> wrote:
Hi Peter,
Your most recent versions of this patch series (not the ones on the
email I am replying to) give a compiler warning:
tuplesort.c: In function 'mergeruns':
tuplesort.c:2741: warning: unused variable 'memNowUsed'
Multi-pass sorts
---------------------I believe, in general, that we should consider a multi-pass sort to be
a kind of inherently suspect thing these days, in the same way that
checkpoints occurring 5 seconds apart are: not actually abnormal, but
something that we should regard suspiciously. Can you really not
afford enough work_mem to only do one pass?
I don't think it is really about the cost of RAM. What people can't
afford is spending all of their time personally supervising all the
sorts on the system. It is pretty easy for a transient excursion in
workload to make a server swap itself to death and fall over. Not just
the PostgreSQL server, but the entire OS. Since we can't let that
happen, we have to be defensive about work_mem. Yes, we have far more
RAM than we used to. We also have far more things demanding access to
it at the same time.
I agree we don't want to optimize for low memory, but I don't think we
should throw it under the bus, either. Right now we are effectively
saying the CPU-cache problems with the heap start exceeding the larger
run size benefits at 64kb (the smallest allowed setting for work_mem).
While any number we pick is going to be a guess that won't apply to
all hardware, surely we can come up with a guess better than 64kb.
Like, 8 MB, say. If available memory for the sort is 8MB or smaller
and the predicted size anticipates a multipass merge, then we can use
the heap method rather than the quicksort method. Would a rule like
that complicate things much?
It doesn't matter to me personally at the moment, because the smallest
work_mem I run on a production system is 24MB. But if for some reason
I had to increase max_connections, or had to worry about plans with
many more possible concurrent work_mem allocations (like some
partitioning), then I might not need to rethink that setting downward.
In theory, the answer could be "yes", but it seems highly unlikely.
Not only is very little memory required to avoid a multi-pass merge
step, but as described above the amount required grows very slowly
relative to linear growth in input. I propose to add a
checkpoint_warning style warning (with a checkpoint_warning style GUC
to control it).
I'm skeptical about a warning for this. I think it is rather unlike
checkpointing, because checkpointing is done in a background process,
which greatly limits its visibility, while sorting is a foreground
thing. I know if my sorts are slow, without having to go look in the
log file. If we do have the warning, shouldn't it use a log-level
that gets sent to the front end where the person running the sort can
see it and locally change work_mem? And if we have a GUC, I think it
should be a dial, not a binary. If I have a sort that takes a 2-way
merge and then a final 29-way merge, I don't think that that is worth
reporting. So maybe, if the maximum number of runs on a tape exceeds
2 (rather than exceeds 1, which is the current behavior with the
patch) would be the setting I would want to use, if I were to use it
at all.
...
This patch continues to have tuplesort determine run size based on the
availability of work_mem only. It does not entirely fix the problem of
having work_mem sizing impact performance in counter-intuitive ways.
In other words, smaller work_mem sizes can still be faster. It does
make that general situation much better, though, because quicksort is
a cache oblivious algorithm. Smaller work_mem sizes are sometimes a
bit faster, but never dramatically faster.
Yes, that is what I found as well. I think the main reason it is
even that small bit slower at large memory is because writing and
sorting are not finely interleaved, like they are with heap selection.
Once you sit down to qsort 3GB of data, you are not going to write any
more tuples until that qsort is entirely done. I didn't do any
testing beyond 3GB of maintenance_work_mem, but I imagine this could
get more important if people used dozens or hundreds of GB.
One idea would be to stop and write out a just-sorted partition
whenever that partition is contiguous to the already-written portion.
If the qsort is tweaked to recurse preferentially into the left
partition first, this would result in tuples being written out at a
pretty study pace. If the qsort was unbalanced and the left partition
was always the larger of the two, then that approach would have to be
abandoned at some point. But I think there are already defenses
against that, and at worst you would give up and revert to the
sort-them-all then write-them-all behavior.
Overall this is very nice. Doing some real world index builds of
short text (~20 bytes ascii) identifiers, I could easily get speed ups
of 40% with your patch if I followed the philosophy of "give it as
much maintenance_work_mem as I can afford". If I fine-tuned the
maintenance_work_mem so that it was optimal for each sort method, then
the speed up quite a bit less, only 22%. But 22% is still very
worthwhile, and who wants to spend their time fine-tuning the memory
use for every index build?
Cheers,
Jeff
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi Jeff,
On Wed, Nov 18, 2015 at 10:31 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
tuplesort.c: In function 'mergeruns':
tuplesort.c:2741: warning: unused variable 'memNowUsed'
That was caused by a last-minute change to the mulitpass warning
message. I forgot to build at -O2, and missed this.
I believe, in general, that we should consider a multi-pass sort to be
a kind of inherently suspect thing these days, in the same way that
checkpoints occurring 5 seconds apart are: not actually abnormal, but
something that we should regard suspiciously. Can you really not
afford enough work_mem to only do one pass?I don't think it is really about the cost of RAM. What people can't
afford is spending all of their time personally supervising all the
sorts on the system. It is pretty easy for a transient excursion in
workload to make a server swap itself to death and fall over. Not just
the PostgreSQL server, but the entire OS. Since we can't let that
happen, we have to be defensive about work_mem. Yes, we have far more
RAM than we used to. We also have far more things demanding access to
it at the same time.
I agree with you, but I'm not sure that I've been completely clear on
what I mean. Even as the demand on memory has grown, the competitive
advantage of replacement selection in avoiding a multi-pass merge has
diminished far faster. You should simply not allow it to happen as a
DBA -- that's the advice that other systems' documentation makes.
Avoiding a multi-pass merge was always the appeal of replacement
selection, even in the 1970s, but it will rarely if ever make that
critical difference these days.
As I said, as the volume of data to be sorted in memory increases
linearly, the point at which a multi-pass merge phase happens
increases quadratically with my patch. The advantage of replacement
selection is therefore almost irrelevant. That is why, in general,
interest in replacement selection is far far lower today than it was
in the past.
The poor CPU cache characteristics of the heap (priority queue) are
only half the story about why replacement selection is more or less
obsolete these days.
I agree we don't want to optimize for low memory, but I don't think we
should throw it under the bus, either. Right now we are effectively
saying the CPU-cache problems with the heap start exceeding the larger
run size benefits at 64kb (the smallest allowed setting for work_mem).
While any number we pick is going to be a guess that won't apply to
all hardware, surely we can come up with a guess better than 64kb.
Like, 8 MB, say. If available memory for the sort is 8MB or smaller
and the predicted size anticipates a multipass merge, then we can use
the heap method rather than the quicksort method. Would a rule like
that complicate things much?
I'm already using replacement selection for the first run when it is
predicted by my new ad-hoc cost model that we can get away with a
"quicksort with spillover", avoiding almost all I/O. We only
incrementally spill as many tuples as needed right now, but it would
be pretty easy to not quicksort the remaining tuples, but continue to
incrementally spill everything. So no, it wouldn't be too hard to hang
on to the old behavior sometimes, if it looked worthwhile.
In principle, I have no problem with doing that. Through testing, I
cannot see any actual upside, though. Perhaps I just missed something.
Even 8MB is enough to avoid the multipass merge in the event of a
surprisingly high volume of data (my work laptop is elsewhere, so I
don't have my notes on this in front of me, but I figured out the
crossover point for a couple of cases).
In theory, the answer could be "yes", but it seems highly unlikely.
Not only is very little memory required to avoid a multi-pass merge
step, but as described above the amount required grows very slowly
relative to linear growth in input. I propose to add a
checkpoint_warning style warning (with a checkpoint_warning style GUC
to control it).I'm skeptical about a warning for this.
Other systems expose this explicitly, and, as I said, say in an
unqualified way that a multi-pass merge should be avoided. Maybe the
warning isn't the right way of communicating that message to the DBA
in detail, but I am confident that it ought to be communicated to the
DBA fairly clearly.
One idea would be to stop and write out a just-sorted partition
whenever that partition is contiguous to the already-written portion.
If the qsort is tweaked to recurse preferentially into the left
partition first, this would result in tuples being written out at a
pretty study pace. If the qsort was unbalanced and the left partition
was always the larger of the two, then that approach would have to be
abandoned at some point. But I think there are already defenses
against that, and at worst you would give up and revert to the
sort-them-all then write-them-all behavior.
Seems kind of invasive.
Overall this is very nice. Doing some real world index builds of
short text (~20 bytes ascii) identifiers, I could easily get speed ups
of 40% with your patch if I followed the philosophy of "give it as
much maintenance_work_mem as I can afford". If I fine-tuned the
maintenance_work_mem so that it was optimal for each sort method, then
the speed up quite a bit less, only 22%. But 22% is still very
worthwhile, and who wants to spend their time fine-tuning the memory
use for every index build?
Thanks, but I expected better than that. Was it a collated text
column? The C collation will put the patch in a much better light
(more strcoll() calls are needed with this new approach -- it's still
well worth it, but it is a downside that makes collated text not
especially sympathetic). Just sorting on an integer attribute is also
a good sympathetic case, FWIW.
How much time did the sort take in each case? How many runs? How much
time was spent merging? trace_sort output is very interesting here.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Nov 18, 2015 at 11:29 PM, Peter Geoghegan <pg@heroku.com> wrote:
Other systems expose this explicitly, and, as I said, say in an
unqualified way that a multi-pass merge should be avoided. Maybe the
warning isn't the right way of communicating that message to the DBA
in detail, but I am confident that it ought to be communicated to the
DBA fairly clearly.
I'm pretty convinced warnings from DML are a categorically bad idea.
In any OLTP load they're effectively fatal errors since they'll fill
up log files or client output or cause other havoc. Or they'll cause
no problem because nothing is reading them. Neither behaviour is
useful.
Perhaps the right thing to do is report a statistic to pg_stats so
DBAs can see how often sorts are in memory, how often they're on disk,
and how often the on disk sort requires n passes. That would put them
in the same category as "sequential scans" for DBAs that expect the
application to only run index-based OLTP queries for example. The
problem with this is that sorts are not tied to a particular relation
and without something to group on the stat will be pretty hard to act
on.
--
greg
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Nov 18, 2015 at 6:29 PM, Peter Geoghegan <pg@heroku.com> wrote:
In principle, I have no problem with doing that. Through testing, I
cannot see any actual upside, though. Perhaps I just missed something.
Even 8MB is enough to avoid the multipass merge in the event of a
surprisingly high volume of data (my work laptop is elsewhere, so I
don't have my notes on this in front of me, but I figured out the
crossover point for a couple of cases).
I'd be interested in seeing this analysis in some detail.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Nov 18, 2015 at 5:22 PM, Greg Stark <stark@mit.edu> wrote:
On Wed, Nov 18, 2015 at 11:29 PM, Peter Geoghegan <pg@heroku.com> wrote:
Other systems expose this explicitly, and, as I said, say in an
unqualified way that a multi-pass merge should be avoided. Maybe the
warning isn't the right way of communicating that message to the DBA
in detail, but I am confident that it ought to be communicated to the
DBA fairly clearly.I'm pretty convinced warnings from DML are a categorically bad idea.
In any OLTP load they're effectively fatal errors since they'll fill
up log files or client output or cause other havoc. Or they'll cause
no problem because nothing is reading them. Neither behaviour is
useful.
To be clear, this is a LOG level message, not a WARNING.
I think that if the DBA ever sees the multipass_warning message, he or
she does not have an OLTP workload. If you experience what might be
considered log spam due to multipass_warning, then the log spam is the
least of your problems. Besides, log_temp_files is a very similar
setting (albeit one that is not enabled by default), so I tend to
doubt that your view that that style of log message is categorically
bad is widely shared. Having said that, I'm not especially attached to
the idea of communicating the concern to the DBA using the mechanism
of a checkpoint_warning-style LOG message (multipass_warning).
Yes, I really do mean it when I say that the DBA is not supposed to
see this message, no matter how much or how little memory or data is
involved. There is no nuance intended here; it isn't sensible to allow
a multi-pass sort, just as it isn't sensible to allow checkpoints
every 5 seconds. Both of those things can be thought of as thrashing.
Perhaps the right thing to do is report a statistic to pg_stats so
DBAs can see how often sorts are in memory, how often they're on disk,
and how often the on disk sort requires n passes.
That might be better than what I came up with, but I hesitate to track
more things using the statistics collector in the absence of a clear
consensus to do so. I'd be more worried about the overhead of what you
suggest than the overhead of a LOG message, seen only in the case of
something that's really not supposed to happen.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 19 November 2015 at 01:22, Greg Stark <stark@mit.edu> wrote:
Perhaps the right thing to do is report a statistic to pg_stats so
DBAs can see how often sorts are in memory, how often they're on disk,
and how often the on disk sort requires n passes. That would put them
in the same category as "sequential scans" for DBAs that expect the
application to only run index-based OLTP queries for example. The
problem with this is that sorts are not tied to a particular relation
and without something to group on the stat will be pretty hard to act
on.
+1
We don't have a message appear when hash joins use go weird, and we
definitely don't want anything like that for sorts either.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Nov 19, 2015 at 6:56 PM, Peter Geoghegan <pg@heroku.com> wrote:
Yes, I really do mean it when I say that the DBA is not supposed to
see this message, no matter how much or how little memory or data is
involved. There is no nuance intended here; it isn't sensible to allow
a multi-pass sort, just as it isn't sensible to allow checkpoints
every 5 seconds. Both of those things can be thought of as thrashing.
Hm. So a bit of back-of-envelope calculation. If we have want to
buffer at least 1MB for each run -- I think we currently do more
actually -- and say that a 1GB work_mem ought to be enough to run
reasonably (that's per sort after all and there might be multiple
sorts to say nothing of other users on the system). That means we can
merge about 1,000 runs in the final merge. Each run will be about 2GB
currently but 1GB if we quicksort the runs. So the largest table we
can sort in a single pass is 1-2 TB.
If we go above those limits we have the choice of buffering less per
run or doing a whole second pass through the data. I suspect we would
get more horsepower out of buffering less though I'm not sure where
the break-even point is. Certainly if we did random I/O for every I/O
that's much more expensive than a factor of 2 over sequential I/O. We
could probably do the math based on random_page_cost and
sequential_page_cost to calculate the minimum amount of buffering
before it's worth doing an extra pass.
So I think you're kind of right and kind of wrong. The vast majority
of use cases are either sub 1TB or are in work environments designed
specifically for data warehouse queries where a user can obtain much
more memory for their queries. However I think it's within the
intended use cases that Postgres should be able to handle a few
terabytes of data on a moderately sized machine in a shared
environment too.
Our current defaults are particularly bad for this though. If you
initdb a new Postgres database today and create a table even a few
gigabytes and try to build an index on it it takes forever. The last
time I did a test I canceled it after it had run for hours, raised
maintenance_work_mem and built the index in a few minutes. The problem
is that if we just raise those limits then people will use more
resources when they don't need it. If it were safer for to have those
limits be much higher then we could make the defaults reflect what
people want when they do bigger jobs rather than just what they want
for normal queries or indexes.
I think that if the DBA ever sees the multipass_warning message, he or she does not have an OLTP workload.
Hm, that's pretty convincing. I guess this isn't the usual sort of
warning due to the time it would take to trigger.
--
greg
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Nov 18, 2015 at 6:19 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Nov 18, 2015 at 6:29 PM, Peter Geoghegan <pg@heroku.com> wrote:
In principle, I have no problem with doing that. Through testing, I
cannot see any actual upside, though. Perhaps I just missed something.
Even 8MB is enough to avoid the multipass merge in the event of a
surprisingly high volume of data (my work laptop is elsewhere, so I
don't have my notes on this in front of me, but I figured out the
crossover point for a couple of cases).I'd be interested in seeing this analysis in some detail.
Sure. Jeff mentioned 8MB as a work_mem setting, so let's examine a
case where that's the work_mem setting, and see experimentally where
the crossover point for a multi-pass sort ends up.
If this table is created:
postgres=# create unlogged table bar as select (random() * 1e9)::int4
idx, 'payload xyz'::text payload from generate_series(1, 10100000) i;
SELECT 10100000
Then, on my system, a work_mem setting of 8MB *just about* avoids
seeing the multipass_warning message with this query:
postgres=# select count(distinct idx) from bar ;
count
------------
10,047,433
(1 row)
A work_mem setting of 235MB is just enough to make the query's sort
fully internal.
Let's see how things change with a higher work_mem setting of 16MB. I
mentioned quadratic growth: Having doubled work_mem, let's *quadruple*
the number of tuples, to see where this leaves a 16MB setting WRT a
multi-pass merge:
postgres=# drop table bar ;
DROP TABLE
postgres=# create unlogged table bar as select (random() * 1e9)::int4
idx, 'payload xyz'::text payload from generate_series(1, 10100000 * 4)
i;
SELECT 40400000
Further experiments show that this is the exact point at which the
16MB work_mem setting similarly narrowly avoids a multi-pass warning.
This should be the dominant consideration, because now a fully
internal sort requires 4X the work_mem of my original 16MB work_mem
example table/query.
The quadratic growth in a simple hybrid sort-merge strategy's ability
to avoid a multi-pass merge phase (growth relative to linear increases
in work_mem) can be demonstrated with simple experiments.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Nov 19, 2015 at 8:35 PM, Greg Stark <stark@mit.edu> wrote:
Hm. So a bit of back-of-envelope calculation. If we have want to
buffer at least 1MB for each run -- I think we currently do more
actually -- and say that a 1GB work_mem ought to be enough to run
reasonably (that's per sort after all and there might be multiple
sorts to say nothing of other users on the system). That means we can
merge about 1,000 runs in the final merge. Each run will be about 2GB
currently but 1GB if we quicksort the runs. So the largest table we
can sort in a single pass is 1-2 TB.
For the sake of pedantry I fact checked myself. We calculate the
number of tapes based on wanting to buffer 32 blocks plus overhead so
about 256kB. So the actual maximum you can handle with 1GB of sort_mem
without multiple merges is on the order 4-8TB.
--
greg
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Nov 19, 2015 at 3:43 PM, Peter Geoghegan <pg@heroku.com> wrote:
I'd be interested in seeing this analysis in some detail.
Sure. Jeff mentioned 8MB as a work_mem setting, so let's examine a
case where that's the work_mem setting, and see experimentally where
the crossover point for a multi-pass sort ends up.If this table is created:
postgres=# create unlogged table bar as select (random() * 1e9)::int4
idx, 'payload xyz'::text payload from generate_series(1, 10100000) i;
SELECT 10100000Then, on my system, a work_mem setting of 8MB *just about* avoids
seeing the multipass_warning message with this query:postgres=# select count(distinct idx) from bar ;
count
------------
10,047,433
(1 row)A work_mem setting of 235MB is just enough to make the query's sort
fully internal.Let's see how things change with a higher work_mem setting of 16MB. I
mentioned quadratic growth: Having doubled work_mem, let's *quadruple*
the number of tuples, to see where this leaves a 16MB setting WRT a
multi-pass merge:postgres=# drop table bar ;
DROP TABLE
postgres=# create unlogged table bar as select (random() * 1e9)::int4
idx, 'payload xyz'::text payload from generate_series(1, 10100000 * 4)
i;
SELECT 40400000Further experiments show that this is the exact point at which the
16MB work_mem setting similarly narrowly avoids a multi-pass warning.
This should be the dominant consideration, because now a fully
internal sort requires 4X the work_mem of my original 16MB work_mem
example table/query.The quadratic growth in a simple hybrid sort-merge strategy's ability
to avoid a multi-pass merge phase (growth relative to linear increases
in work_mem) can be demonstrated with simple experiments.
OK, so reversing this analysis, with the default work_mem of 4MB, we'd
need a multi-pass merge for more than 235MB/4 = 58MB of data. That is
very, very far from being a can't-happen scenario, and I would not at
all think it would be acceptable to ignore such a case. Even ignoring
the possibility that someone with work_mem = 8MB will try to sort
235MB of data strikes me as out of the question. Those seem like
entirely reasonable things for users to do. Greg's example of someone
with work_mem = 1GB trying to sort 4TB does not seem like a crazy
thing to me. Yeah, in all of those cases you might think that users
should set work_mem higher, but that doesn't mean that they actually
do. Most systems have to set work_mem very conservatively to make
sure they don't start swapping under heavily load.
I think you need to revisit your assumptions here.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Nov 19, 2015 at 12:35 PM, Greg Stark <stark@mit.edu> wrote:
So I think you're kind of right and kind of wrong. The vast majority
of use cases are either sub 1TB or are in work environments designed
specifically for data warehouse queries where a user can obtain much
more memory for their queries. However I think it's within the
intended use cases that Postgres should be able to handle a few
terabytes of data on a moderately sized machine in a shared
environment too.
Maybe I've made this more complicated than it needs to be. The fact is
that my recent 16MB example is still faster than the master branch
when a multiple pass merge is performed (e.g. when work_mem is 15MB,
or even 12MB). More on that later.
Our current defaults are particularly bad for this though. If you
initdb a new Postgres database today and create a table even a few
gigabytes and try to build an index on it it takes forever. The last
time I did a test I canceled it after it had run for hours, raised
maintenance_work_mem and built the index in a few minutes. The problem
is that if we just raise those limits then people will use more
resources when they don't need it.
I think that the bigger problems are:
* There is a harsh discontinuity in the cost function -- performance
suddenly falls off a cliff when a sort must be performed externally.
* Replacement selection is obsolete. It's very slow on machines from
the last 20 years.
If it were safer for to have those
limits be much higher then we could make the defaults reflect what
people want when they do bigger jobs rather than just what they want
for normal queries or indexes.
Or better yet, make it so that it doesn't really matter that much,
even while you're still using the same amount of memory as before.
If you're saying that the whole work_mem model isn't a very good one,
then I happen to agree. It would be very nice to have some fancy
admission control feature, but I'd still appreciate a cost model that
dynamically sets work_mem. The model avoids an excessively high
setting where there is only about half the memory available for a 10GB
sort. You should probably have 5 runs sized 2GB, rather than 2 runs
sized 5GB, even if you can afford the memory for the latter. It would
still make sense to have very high work_mem settings when you can
dynamically set it so high that the sort does complete internally,
though.
I think that if the DBA ever sees the multipass_warning message, he or she does not have an OLTP workload.
Hm, that's pretty convincing. I guess this isn't the usual sort of
warning due to the time it would take to trigger.
I would like more opinions on the multipass_warning message. I can
write a patch that creates a new system view, detailing how sort were
completed, if there is demand.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Nov 19, 2015 at 2:35 PM, Robert Haas <robertmhaas@gmail.com> wrote:
OK, so reversing this analysis, with the default work_mem of 4MB, we'd
need a multi-pass merge for more than 235MB/4 = 58MB of data. That is
very, very far from being a can't-happen scenario, and I would not at
all think it would be acceptable to ignore such a case.
I think you need to revisit your assumptions here.
Which assumption? Are we talking about multipass_warning, or my patch
series in general? Obviously those are two very differently things. As
I've said, we could address the visibility aspect of this differently.
I'm fine with that.
I'll now talk about my patch series in general -- the actual
consequences of not avoiding a single pass merge phase when the master
branch would have done so.
The latter 16MB work_mem example query/table is still faster with a
12MB work_mem than master, even with multiple passes. Quite a bit
faster, in fact: about 37 seconds on master, to about 24.7 seconds
with the patches (same for higher settings short of 16MB).
Now, that's probably slightly unfair on the master branch, because the
patches still have the benefit of the memory pooling during the merge
phase, which is nothing to do with what we're talking about, and
because my laptop still has plenty of ram.
I should point out that there is no evidence that any case has been
regressed, let alone written off entirely or ignored. I looked. I
probably have not been completely exhaustive, and I'd be willing to
believe there is something that I've missed, but it's still quite
possible that there is no downside to any of this.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Nov 19, 2015 at 2:53 PM, Peter Geoghegan <pg@heroku.com> wrote:
The latter 16MB work_mem example query/table is still faster with a
12MB work_mem than master, even with multiple passes. Quite a bit
faster, in fact: about 37 seconds on master, to about 24.7 seconds
with the patches (same for higher settings short of 16MB).
I made the same comparison with work_mem sizes of 2MB and 6MB for
master/patch, and the patch *still* came out ahead, often by over 10%.
This was more than fair, though, because sometimes the final
on-the-fly merge for the master branch starting at a point at which
the patch series has already completed its sort. (Of course, I don't
believe that any user would ever be well served with such a low
work_mem setting for these queries -- I'm looking for a bad case,
though).
I guess this is a theoretical downside of my approach, that is more
than made up for elsewhere (even leaving aside the final, unrelated
patch in the series, addressing the merge bottleneck directly). So, to
summarize such downsides (downsides of a hybrid sort-merge strategy as
compared to replacement selection):
* As mentioned just now, the fact that there are more runs -- merging
can be slower (although tuples can be returned earlier, which could
also help with CREATE INDEX). This is more of a problem when random
I/O is expensive, and less of a problem when the OS cache buffers
things nicely.
* One run can be created with replacement selection, where a
hyrbid-sort merge strategy needs to create and then merge many runs.
When I started work on this patch, I was pretty sure that case would
be noticeably regressed. I was wrong.
* Abbreviated key comparisons are used less because runs are smaller.
This is why sorts of types like numeric are not especially sympathetic
to the patch. Still, we manage to come out well ahead overall.
You can perhaps show the patch to be almost as slow as the master
branch with a very unsympathetic case involving the union of all these
3. I couldn't regress a case with integers with just the first two,
though.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Nov 20, 2015 at 12:54 AM, Peter Geoghegan <pg@heroku.com> wrote:
* One run can be created with replacement selection, where a
hyrbid-sort merge strategy needs to create and then merge many runs.
When I started work on this patch, I was pretty sure that case would
be noticeably regressed. I was wrong.
Hm. Have you tested a nearly-sorted input set around 1.5x the size of
work_mem? That should produce a single run using the heap to generate
runs but generate two runs if, AIUI, you're just filling work_mem,
running quicksort, dumping that run entirely and starting fresh.
I don't mean to say it's representative but if you're looking for a
worst case...
--
greg
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Nov 19, 2015 at 5:32 PM, Greg Stark <stark@mit.edu> wrote:
Hm. Have you tested a nearly-sorted input set around 1.5x the size of
work_mem? That should produce a single run using the heap to generate
runs but generate two runs if, AIUI, you're just filling work_mem,
running quicksort, dumping that run entirely and starting fresh.
Yes. Actually, even with a random ordering, on average replacement
selection sort will produce runs twice as long as the patch series.
With nearly ordered input, there is no limit to how log runs can be --
you could definitely have cases where *no* merge step is required. We
just return tuples from one long run. And yet, it isn't worth it in
cases that I tested.
Please don't take my word for it -- try yourself.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Nov 19, 2015 at 5:42 PM, Peter Geoghegan <pg@heroku.com> wrote:
I would like more opinions on the multipass_warning message. I can
write a patch that creates a new system view, detailing how sort were
completed, if there is demand.
I think a warning message is a terrible idea, and a system view is a
needless complication. If the patch is as fast or faster than what we
have now in all cases, then we should adopt it (assuming it's also
correct and well-commented and all that other good stuff). If it's
not, then we need to analyze the cases where it's slower and decide
whether they are significant enough to care about.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Nov 19, 2015 at 5:53 PM, Peter Geoghegan <pg@heroku.com> wrote:
I'll now talk about my patch series in general -- the actual
consequences of not avoiding a single pass merge phase when the master
branch would have done so.
That's what I was asking about. It seemed to me that you were saying
we could ignore those cases, which doesn't seem to me to be true.
The latter 16MB work_mem example query/table is still faster with a
12MB work_mem than master, even with multiple passes. Quite a bit
faster, in fact: about 37 seconds on master, to about 24.7 seconds
with the patches (same for higher settings short of 16MB).
Is this because we save enough by quicksorting rather than heapsorting
to cover the cost of the additional merge phase?
If not, then why is it happening like this?
I should point out that there is no evidence that any case has been
regressed, let alone written off entirely or ignored. I looked. I
probably have not been completely exhaustive, and I'd be willing to
believe there is something that I've missed, but it's still quite
possible that there is no downside to any of this.
If that's so, it's excellent news.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Nov 20, 2015 at 12:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Nov 19, 2015 at 5:42 PM, Peter Geoghegan <pg@heroku.com> wrote:
I would like more opinions on the multipass_warning message. I can
write a patch that creates a new system view, detailing how sort were
completed, if there is demand.I think a warning message is a terrible idea, and a system view is a
needless complication. If the patch is as fast or faster than what we
have now in all cases, then we should adopt it (assuming it's also
correct and well-commented and all that other good stuff). If it's
not, then we need to analyze the cases where it's slower and decide
whether they are significant enough to care about.
Maybe I was mistaken to link the idea to this patch, but I think it
(or something involving a view) is a good idea. I linked it to the
patch because the patch makes it slightly more important than before.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Nov 20, 2015 at 12:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:
That's what I was asking about. It seemed to me that you were saying
we could ignore those cases, which doesn't seem to me to be true.
I've been around for long enough to know that there are very few cases
that can be ignored. :-)
The latter 16MB work_mem example query/table is still faster with a
12MB work_mem than master, even with multiple passes. Quite a bit
faster, in fact: about 37 seconds on master, to about 24.7 seconds
with the patches (same for higher settings short of 16MB).Is this because we save enough by quicksorting rather than heapsorting
to cover the cost of the additional merge phase?If not, then why is it happening like this?
I think it's because of caching effects alone, but I am not 100% sure
of that. I concede that it might not be enough to make up for the
additional I/O on some systems or platforms. The fact remains,
however, that the patch was faster on the unsympathetic case I ran on
the machine I had available (which has an SSD), and that I really have
not managed to find a case that is regressed after some effort.
I should point out that there is no evidence that any case has been
regressed, let alone written off entirely or ignored. I looked. I
probably have not been completely exhaustive, and I'd be willing to
believe there is something that I've missed, but it's still quite
possible that there is no downside to any of this.If that's so, it's excellent news.
As I mentioned up-thread, maybe I shouldn't have brought all the
theoretical justifications for killing replacement selection into the
discussion so early. Those observations on replacement selection
(which are not my own original insights) happen to be what spurred
this work. I spent so much time talking about how irrelevant
multi-pass merging was that people imagined that that was severely
regressed, when it really was not. That just happened to be the way I
came at the problem.
The numbers speak for themselves here. I just want to be clear about
the disadvantages of what I propose, even if it's well worth it
overall in most (all?) cases.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Nov 20, 2015 at 2:58 PM, Peter Geoghegan <pg@heroku.com> wrote:
The numbers speak for themselves here. I just want to be clear about
the disadvantages of what I propose, even if it's well worth it
overall in most (all?) cases.
There is a paper called "Critical Evaluation of Existing External
Sorting Methods in the Perspective of Modern Hardware":
http://ceur-ws.org/Vol-1343/paper8.pdf
This paper was not especially influential, and I don't agree with
every detail, or I at least don't think that every recommendation
should be adopted to Postgres. Even still, the paper is the best
summary I have seen so far. It clearly explains why there is plenty to
recommend a simple hybrid sort-merge strategy over replacement
selection, despite the fact that replacement selection is faster when
using 1970s hardware.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Nov 18, 2015 at 3:29 PM, Peter Geoghegan <pg@heroku.com> wrote:
Overall this is very nice. Doing some real world index builds of
short text (~20 bytes ascii) identifiers, I could easily get speed ups
of 40% with your patch if I followed the philosophy of "give it as
much maintenance_work_mem as I can afford". If I fine-tuned the
maintenance_work_mem so that it was optimal for each sort method, then
the speed up quite a bit less, only 22%. But 22% is still very
worthwhile, and who wants to spend their time fine-tuning the memory
use for every index build?Thanks, but I expected better than that.
It also might have been that you used a "quicksort with spillover".
That still uses a heap to some degree, in order to avoid most I/O, but
with a single backend sorting that can often be slower than the
(greatly overhauled) "external merge" sort method (both of these
algorithms are what you'll see in EXPLAIN ANALYZE, which can be a
little confusing because it isn't clear what the distinction is in
some cases).
You might also very occasionally see an "external sort" (this is also
a description from EXPLAIN ANALYZE), which is generally slower (it's a
case where we were unable to do a final on-the-fly merge, either
because random access is requested by the caller, or because multiple
passes were required -- thankfully this doesn't happen most of the
time).
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 20 November 2015 at 22:58, Peter Geoghegan <pg@heroku.com> wrote:
The numbers speak for themselves here. I just want to be clear about
the disadvantages of what I propose, even if it's well worth it
overall in most (all?) cases.
My feeling is that numbers rarely speak for themselves, without LSD. (Which
numbers?)
How are we doing here? Keen to see this work get committed, so we can move
onto parallel sort. What's the summary?
How about we commit it with a sort_algorithm = 'foo' parameter so we can
compare things before release of 9.6?
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Nov 24, 2015 at 3:32 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
My feeling is that numbers rarely speak for themselves, without LSD. (Which
numbers?)
Guffaw.
How are we doing here? Keen to see this work get committed, so we can move
onto parallel sort. What's the summary?
I showed a test case where a CREATE INDEX sort involving 5 runs and a
merge only took about 18% longer than an equivalent fully internal
sort [1]Message: /messages/by-id/CAM3SWZRiHaF7jdf923ZZ2qhDJiErqP5uU_+JPuMvUmeD0z9fFA@mail.gmail.com Attachment: /messages/by-id/attachment/39660/quicksort_external_test.txt using over 5 times the memory. That's about 2.5X faster than
the 9.5 performance on the same system with the same amount of memory.
Overall, the best cases I saw were the original "quicksort with
spillover" cases [2]/messages/by-id/CAM3SWZTzLT5Y=VY320NznAyz2z_em3us6x=7rXMEUma9Z9yN6Q@mail.gmail.com. They were just under 4X faster. I care about
that less, though, because that will happen way less often, and won't
help with larger sorts that are even more CPU bound.
There is a theoretical possibility that this is slower on systems
where multiple merge passes are required as a consequence of not
having runs as long as possible (due to not using replacement
selection heap). That will happen very infrequently [3]/messages/by-id/CAM3SWZTX5=nHxPpogPirQsH4cR+BpQS6r7Ktax0HMQiNLf-1qA@mail.gmail.com -- Peter Geoghegan, and is very
probably still worth it.
So, the bottom line is: This patch seems very good, is unlikely to
have any notable downside (no case has been shown to be regressed),
but has yet to receive code review. I am working on a new version with
the first two commits consolidated, and better comments, but that will
have the same code, unless I find bugs or am dissatisfied. It mostly
needs thorough code review, and to a lesser extent some more
performance testing.
Parallel sort is very important. Robert, Amit and I had a call about
this earlier today. We're all in agreement that this should be
extended in that direction, and have a rough idea about how it ought
to fit together with the parallelism primitives. Parallel sort in 9.6
could certainly happen -- that's what I'm aiming for. I haven't really
done preliminary research yet; I'll know more in a little while.
How about we commit it with a sort_algorithm = 'foo' parameter so we can
compare things before release of 9.6?
I had a debug GUC (like the existing one to disable top-N heapsorts)
that disabled "quicksort with spillover". That's almost the opposite
of what you're asking for, though, because that makes us never use a
heap. You're asking for me to write a GUC to always use a heap.
That's not a good way of testing this patch, because it's inconvenient
to consider the need to use a heap beyond the first run (something
that now exists solely for the benefit of "quicksort with spillover";
a heap will often never be used even for the first run). Besides, the
merge optimization is a big though independent part of this, and
doesn't make sense to control with the same GUC.
If I haven't gotten this right, we should not commit the patch. If the
patch isn't superior to the existing approach in virtually every way,
then there is no point in making it possible for end-users to disable
with messy GUCs -- it should be reverted.
[1]: Message: /messages/by-id/CAM3SWZRiHaF7jdf923ZZ2qhDJiErqP5uU_+JPuMvUmeD0z9fFA@mail.gmail.com Attachment: /messages/by-id/attachment/39660/quicksort_external_test.txt
Attachment:
/messages/by-id/attachment/39660/quicksort_external_test.txt
[2]: /messages/by-id/CAM3SWZTzLT5Y=VY320NznAyz2z_em3us6x=7rXMEUma9Z9yN6Q@mail.gmail.com
[3]: /messages/by-id/CAM3SWZTX5=nHxPpogPirQsH4cR+BpQS6r7Ktax0HMQiNLf-1qA@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 25 November 2015 at 00:33, Peter Geoghegan <pg@heroku.com> wrote:
Parallel sort is very important. Robert, Amit and I had a call about
this earlier today. We're all in agreement that this should be
extended in that direction, and have a rough idea about how it ought
to fit together with the parallelism primitives. Parallel sort in 9.6
could certainly happen -- that's what I'm aiming for. I haven't really
done preliminary research yet; I'll know more in a little while.
Glad to hear it, I was hoping to see that.
How about we commit it with a sort_algorithm = 'foo' parameter so we can
compare things before release of 9.6?I had a debug GUC (like the existing one to disable top-N heapsorts)
that disabled "quicksort with spillover". That's almost the opposite
of what you're asking for, though, because that makes us never use a
heap. You're asking for me to write a GUC to always use a heap.
I'm asking for a parameter to confirm results from various algorithms, so
we can get many eyeballs to confirm your work across its breadth. This is
similar to the original trace_sort parameter which we used to confirm
earlier sort improvements. I trust it will show this is good and can be
removed prior to release of 9.6.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Nov 24, 2015 at 4:46 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
I had a debug GUC (like the existing one to disable top-N heapsorts)
that disabled "quicksort with spillover". That's almost the opposite
of what you're asking for, though, because that makes us never use a
heap. You're asking for me to write a GUC to always use a heap.I'm asking for a parameter to confirm results from various algorithms, so we
can get many eyeballs to confirm your work across its breadth. This is
similar to the original trace_sort parameter which we used to confirm
earlier sort improvements. I trust it will show this is good and can be
removed prior to release of 9.6.
My patch updates trace_sort messages. trace_sort doesn't change the
behavior of anything. The only time we've ever done anything like this
was for Top-N heap sorts.
This is significantly more inconvenient than you think. See the
comments in the new dumpbatch() function.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Nov 25, 2015 at 12:33 AM, Peter Geoghegan <pg@heroku.com> wrote:
On Tue, Nov 24, 2015 at 3:32 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
My feeling is that numbers rarely speak for themselves, without LSD. (Which
numbers?)Guffaw.
Actually I kind of agree. What I would like to see is a series of
numbers for increasing sizes of sorts plotted against the same series
for the existing algorithm. Specifically with the sort size varying to
significantly more than the physical memory on the machine. For
example on a 16GB machine sorting data ranging from 1GB to 128GB.
There's a lot more information in a series of numbers than individual
numbers. We'll be able to see whether all our pontificating about the
rates of growth of costs of different algorithms or which costs
dominate at which scales are actually borne out in reality. And see
where the break points are where I/O overtakes memory costs. And it'll
be clearer where to look for problematic cases where the new algorithm
might not dominate the old one.
--
greg
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Nov 24, 2015 at 5:42 PM, Greg Stark <stark@mit.edu> wrote:
Actually I kind of agree. What I would like to see is a series of
numbers for increasing sizes of sorts plotted against the same series
for the existing algorithm. Specifically with the sort size varying to
significantly more than the physical memory on the machine. For
example on a 16GB machine sorting data ranging from 1GB to 128GB.
There already was a test case involving a 1TB/16 billion tuple sort
[1]: /messages/by-id/CAM3SWZQtdd=Q+EF1xSZaYG1CiOYQJ7sZFcL08GYqChpJtGnKMg@mail.gmail.com
large number of similar test cases across a variety of scales, but
there are only so many hours in the day. Disappointingly, the results
at that scale were merely good, not great, but there was probably
various flaws in how representative the hardware used was.
There's a lot more information in a series of numbers than individual
numbers. We'll be able to see whether all our pontificating about the
rates of growth of costs of different algorithms or which costs
dominate at which scales are actually borne out in reality.
You yourself said that 1GB is sufficient to get a single-pass merge
phase for a sort of about 4TB - 8TB, so I think the discussion of the
growth in costs tells us plenty about what can happen at the high end.
My approach might help less overall, but it certainly won't falter.
See the 1TB test case -- output from trace_sort is all there.
And see
where the break points are where I/O overtakes memory costs. And it'll
be clearer where to look for problematic cases where the new algorithm
might not dominate the old one.
I/O doesn't really overtake memory cost -- if it does, then it should
be worthwhile to throw more sequential I/O bandwidth at the problem,
which is a realistic, economical solution with a mature implementation
(unlike buying more memory bandwidth). I didn't do that with the 1TB
test case.
If you assume, as cost_sort() does, that it takes N log2(N)
comparisons to sort some tuples, then it breaks down like this:
10 items require 33 comparisons, ratio 3.32192809489
100 items require 664 comparisons, ratio 6.64385618977
1,000 items require 9,965 comparisons, ratio 9.96578428466
1,000,000 items require 19,931,568 comparisons, ratio 19.9315685693
1,000,000,000 items require 29,897,352,853 comparisons, ratio 29.897352854
16,000,000,000 items require 542,357,645,663 comparisons, ratio 33.897352854
The cost of writing out and reading runs should be more or less in
linear proportion to their size, which is a totally different story.
That's the main reason why "quicksort with spillover" is aimed at
relatively small sorts, which we expect more of overall.
I think the big issue is that a non-parallel sort is significantly
under-powered when you go to sort 16 billion tuples. It's probably not
very sensible to do so if you have a choice of parallelizing the sort.
There is no plausible way to do replacement selection in parallel,
since you cannot know ahead of time with any accuracy where to
partition workers, as runs can end up arbitrarily larger than memory
with presorted inputs. That might be the single best argument for what
I propose to do here.
This is what Corey's case showed for the final run with 30GB
maintenance_work_mem:
LOG: starting quicksort of run 40: CPU 1815.99s/19339.80u sec elapsed
24910.38 sec
LOG: finished quicksorting run 40: CPU 1820.09s/19565.94u sec elapsed
25140.69 sec
LOG: finished writing run 40 to tape 39: CPU 1833.76s/19642.11u sec
elapsed 25234.44 sec
(Note that the time taken to copy tuples comprising the final run is
not displayed or accounted for)
This is the second last run, run 40, so it uses the full 30GB of
maintenance_work_mem. We spend 00:01:33.75 writing the run. However,
we spent 00:03:50.31 just sorting the run. That's roughly the same
ratio that I see on my laptop with far smaller runs. I think the
difference isn't wider because the server is quite I/O bound -- but we
could fix that by adding more disks.
[1]: /messages/by-id/CAM3SWZQtdd=Q+EF1xSZaYG1CiOYQJ7sZFcL08GYqChpJtGnKMg@mail.gmail.com
[2]: https://github.com/petergeoghegan/gensort -- Peter Geoghegan
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Nov 24, 2015 at 6:31 PM, Peter Geoghegan <pg@heroku.com> wrote:
(Note that the time taken to copy tuples comprising the final run is
not displayed or accounted for)
I mean, comprising the second last run, the run shown, run 40.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Nov 25, 2015 at 2:31 AM, Peter Geoghegan <pg@heroku.com> wrote:
There already was a test case involving a 1TB/16 billion tuple sort
[1] (well, a 1TB gensort Postgres table [2]). Granted, I don't have a
large number of similar test cases across a variety of scales, but
there are only so many hours in the day. Disappointingly, the results
at that scale were merely good, not great, but there was probably
various flaws in how representative the hardware used was.
That's precisely why it's valuable to see a whole series of data
points rather than just one. Often when you see the shape of the
curve, especially any breaks or changes in the behaviour that helps
understand the limitations of the model. Perhaps it would be handy to
find a machine with a very small amount of physical memory so you
could run more reasonably sized tests on it. A VM would be fine if you
could be sure the storage layer isn't caching.
In short, I think you're right in theory and I want to make sure
you're right in practice. I'm afraid if we just look at a few data
points we'll miss out on a bug or a factor we didn't anticipate that
could have been addressed.
Just to double check though. My understanding is that your quicksort
algorithm is to fill work_mem with tuples, quicksort them, write out a
run, and repeat. When the inputs are done read work_mem/runs worth of
tuples from each run into memory and run a merge (using a heap?) like
we do currently. Is that right?
Incidentally one of the reasons abandoning the heap to generate runs
is attractive is that it opens up other sorting algorithms for us.
Instead of quicksort we might be able to plug in a GPU sort for
example.
--
greg
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Nov 25, 2015 at 4:10 AM, Greg Stark <stark@mit.edu> wrote:
That's precisely why it's valuable to see a whole series of data
points rather than just one. Often when you see the shape of the
curve, especially any breaks or changes in the behaviour that helps
understand the limitations of the model. Perhaps it would be handy to
find a machine with a very small amount of physical memory so you
could run more reasonably sized tests on it. A VM would be fine if you
could be sure the storage layer isn't caching.
I have access to the Power7 system that Robert and others sometimes
use for this stuff. I'll try to come up a variety of tests.
In short, I think you're right in theory and I want to make sure
you're right in practice. I'm afraid if we just look at a few data
points we'll miss out on a bug or a factor we didn't anticipate that
could have been addressed.
I am in favor of being comprehensive.
Just to double check though. My understanding is that your quicksort
algorithm is to fill work_mem with tuples, quicksort them, write out a
run, and repeat. When the inputs are done read work_mem/runs worth of
tuples from each run into memory and run a merge (using a heap?) like
we do currently. Is that right?
Yes, that's basically what I'm doing.
There are basically two extra bits:
* Without changing how merging actually works, I am clever about
allocating memory for the final on-the-fly merge. Allocation is done
once, in one huge batch. Importantly, I exploit locality by having
every "tuple proper" (e.g. IndexTuple) in contiguous memory, in sorted
(tape) order, per tape. This also greatly reduces palloc() overhead
for the final on-the-fly merge step.
* We do something special when we're just over work_mem, to avoid most
I/O -- "quicksort with spillover". This is a nice trick, but it's
certain way less important than the basic idea of simply always
quicksorting runs. I could easily not do this. This is why the heap
code was not significantly simplified to only cover the merge cases,
though -- this uses essentially the same replacement selection style
heap to incrementally spill to get us enough memory to mostly complete
the sort internally.
Incidentally one of the reasons abandoning the heap to generate runs
is attractive is that it opens up other sorting algorithms for us.
Instead of quicksort we might be able to plug in a GPU sort for
example.
Yes, it's true that we automatically benefit from optimizations for
the internal sort case now. That's already happening with the patch,
actually -- the "onlyKey" optimization (a more specialized quicksort
specialization, used in the one attribute heap case, and datum case)
is now automatically used. That was where the best 2012 numbers for
SortSupport were seen, so that makes a significant difference. As you
say, something like that could easily happen again.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Nov 18, 2015 at 3:29 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Wed, Nov 18, 2015 at 10:31 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
I agree we don't want to optimize for low memory, but I don't think we
should throw it under the bus, either. Right now we are effectively
saying the CPU-cache problems with the heap start exceeding the larger
run size benefits at 64kb (the smallest allowed setting for work_mem).
While any number we pick is going to be a guess that won't apply to
all hardware, surely we can come up with a guess better than 64kb.
Like, 8 MB, say. If available memory for the sort is 8MB or smaller
and the predicted size anticipates a multipass merge, then we can use
the heap method rather than the quicksort method. Would a rule like
that complicate things much?I'm already using replacement selection for the first run when it is
predicted by my new ad-hoc cost model that we can get away with a
"quicksort with spillover", avoiding almost all I/O. We only
incrementally spill as many tuples as needed right now, but it would
be pretty easy to not quicksort the remaining tuples, but continue to
incrementally spill everything. So no, it wouldn't be too hard to hang
on to the old behavior sometimes, if it looked worthwhile.In principle, I have no problem with doing that. Through testing, I
cannot see any actual upside, though. Perhaps I just missed something.
Even 8MB is enough to avoid the multipass merge in the event of a
surprisingly high volume of data (my work laptop is elsewhere, so I
don't have my notes on this in front of me, but I figured out the
crossover point for a couple of cases).
For me very large sorts (100,000,000 ints) with work_mem below 4MB do
better with unpatched than with your patch series, by about 5%. Not a
big deal, but also if it is easy to keep the old behavior then I think
we should. Yes, it is dumb to do large sorts with work_mem below 4MB,
but if you have canned apps which do a mixture of workloads it is not
so easy to micromanage their work_mem. Especially as there are no
easy tools that let me as the DBA say "if you connect from this IP
address, you get this work_mem".
I didn't collect trace_sort on those ones because of the high volume
it would generate.
In theory, the answer could be "yes", but it seems highly unlikely.
Not only is very little memory required to avoid a multi-pass merge
step, but as described above the amount required grows very slowly
relative to linear growth in input. I propose to add a
checkpoint_warning style warning (with a checkpoint_warning style GUC
to control it).I'm skeptical about a warning for this.
Other systems expose this explicitly, and, as I said, say in an
unqualified way that a multi-pass merge should be avoided. Maybe the
warning isn't the right way of communicating that message to the DBA
in detail, but I am confident that it ought to be communicated to the
DBA fairly clearly.
I thinking about how many other places in the code could justify a
similar type of warning "If you just gave me 15% more memory, this
hash join would be much faster", and what that would make the logs
look like if future work went along with this precedence. If there
were some mechanism to put the warning in a system view counter
instead of the log file, that would be much cleaner. Or a way to
separate the server log file into streams. But since we don't have
those, I guess I can't really object much to the proposed behavior.
One idea would be to stop and write out a just-sorted partition
whenever that partition is contiguous to the already-written portion.
If the qsort is tweaked to recurse preferentially into the left
partition first, this would result in tuples being written out at a
pretty study pace. If the qsort was unbalanced and the left partition
was always the larger of the two, then that approach would have to be
abandoned at some point. But I think there are already defenses
against that, and at worst you would give up and revert to the
sort-them-all then write-them-all behavior.Seems kind of invasive.
I agree, but I wonder if it won't become much more important at 30GB
of work_mem. Of course if there is no reason to ever set work_mem
that high, then it wouldn't matter--but there is always a reason to do
so, if you have so much memory to spare. So better than that invasive
work, I guess would be to make sort use less than work_mem if it gets
no benefit from using all of it. Anyway, ideas for future work,
either way.
Overall this is very nice. Doing some real world index builds of
short text (~20 bytes ascii) identifiers, I could easily get speed ups
of 40% with your patch if I followed the philosophy of "give it as
much maintenance_work_mem as I can afford". If I fine-tuned the
maintenance_work_mem so that it was optimal for each sort method, then
the speed up quite a bit less, only 22%. But 22% is still very
worthwhile, and who wants to spend their time fine-tuning the memory
use for every index build?Thanks, but I expected better than that. Was it a collated text
column? The C collation will put the patch in a much better light
(more strcoll() calls are needed with this new approach -- it's still
well worth it, but it is a downside that makes collated text not
especially sympathetic). Just sorting on an integer attribute is also
a good sympathetic case, FWIW.
It was UTF8 encoded (although all characters were actually ASCII), but
C collated.
I've never seen improvements of 3 fold or more like you saw, under any
conditions, so I wonder if your test machine doesn't have unusually
slow main memory.
How much time did the sort take in each case? How many runs? How much
time was spent merging? trace_sort output is very interesting here.
My largest test, which took my true table and extrapolated it out for
a few years growth, had about 500,000,000 rows.
At 3GB maintainance_work_mem, it took 13 runs patched and 7 runs
unpatched to build the index, with timings of 3168.66 sec and 5713.07
sec.
The final merging is intermixed with whatever other work goes on to
build the actual index files out of the sorted data, so I don't know
exactly what the timing of just the merge part was. But it was
certainly a minority of the time, even if you assume the actual index
build were free. For the patched code, the majority of the time goes
to the quick sorting stages.
When I test each version of the code at its own most efficient
maintenance_work_mem, I get
3007.2 seconds at 1GB for patched and 3836.46 seconds at 64MB for unpatched.
I'm attaching the trace_sort output from the client log for all 4 of
those scenarios. "sort_0005" means all 5 of your patches were
applied, "origin" means none of them were.
Cheers,
Jeff
Attachments:
On Thu, Nov 19, 2015 at 12:35 PM, Greg Stark <stark@mit.edu> wrote:
On Thu, Nov 19, 2015 at 6:56 PM, Peter Geoghegan <pg@heroku.com> wrote:
Yes, I really do mean it when I say that the DBA is not supposed to
see this message, no matter how much or how little memory or data is
involved. There is no nuance intended here; it isn't sensible to allow
a multi-pass sort, just as it isn't sensible to allow checkpoints
every 5 seconds. Both of those things can be thought of as thrashing.Hm. So a bit of back-of-envelope calculation. If we have want to
buffer at least 1MB for each run -- I think we currently do more
actually -- and say that a 1GB work_mem ought to be enough to run
reasonably (that's per sort after all and there might be multiple
sorts to say nothing of other users on the system). That means we can
merge about 1,000 runs in the final merge. Each run will be about 2GB
currently but 1GB if we quicksort the runs. So the largest table we
can sort in a single pass is 1-2 TB.If we go above those limits we have the choice of buffering less per
run or doing a whole second pass through the data.
If we only go slightly above the limits, it is much more graceful. It
will happily do a 3 way merge followed by a 1023 way final merge (or
something like that) so only 0.3 percent of the data needs a second
pass, not all of it. If course by the time you get a factor of 2 over
the limit, you are making an entire second pass one way or another.
Cheers,
Jeff
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Nov 28, 2015 at 2:04 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
For me very large sorts (100,000,000 ints) with work_mem below 4MB do
better with unpatched than with your patch series, by about 5%. Not a
big deal, but also if it is easy to keep the old behavior then I think
we should. Yes, it is dumb to do large sorts with work_mem below 4MB,
but if you have canned apps which do a mixture of workloads it is not
so easy to micromanage their work_mem. Especially as there are no
easy tools that let me as the DBA say "if you connect from this IP
address, you get this work_mem".
I'm not very concerned about a regression that is only seen when
work_mem is set below the (very conservative) postgresql.conf default
value of 4MB when sorting 100 million integers. Thank you for
characterizing the regression, though -- it's good to have a better
idea of how much of a problem that is in practice.
I can still preserve the old behavior with a GUC, but it isn't
completely trivial, and I don't want to complicate things any further
without a real benefit, which I still don't see. I'm still using a
replacement selection style heap, and I think that there will be
future uses for the heap (e.g. dynamic duplicate removal within
tuplesort), though.
Other systems expose this explicitly, and, as I said, say in an
unqualified way that a multi-pass merge should be avoided. Maybe the
warning isn't the right way of communicating that message to the DBA
in detail, but I am confident that it ought to be communicated to the
DBA fairly clearly.I thinking about how many other places in the code could justify a
similar type of warning "If you just gave me 15% more memory, this
hash join would be much faster", and what that would make the logs
look like if future work went along with this precedence. If there
were some mechanism to put the warning in a system view counter
instead of the log file, that would be much cleaner. Or a way to
separate the server log file into streams. But since we don't have
those, I guess I can't really object much to the proposed behavior.
I'm going to let this go, actually. Not because I don't think that
avoiding a multi-pass sort is a good goal for DBAs to have, but
because a multi-pass sort doesn't appear to be a point at which
performance tanks these days, with modern block devices. Also, I just
don't have time to push something non-essential that there is
resistance to.
One idea would be to stop and write out a just-sorted partition
whenever that partition is contiguous to the already-written portion.
If the qsort is tweaked to recurse preferentially into the left
partition first, this would result in tuples being written out at a
pretty study pace. If the qsort was unbalanced and the left partition
was always the larger of the two, then that approach would have to be
abandoned at some point. But I think there are already defenses
against that, and at worst you would give up and revert to the
sort-them-all then write-them-all behavior.Seems kind of invasive.
I agree, but I wonder if it won't become much more important at 30GB
of work_mem. Of course if there is no reason to ever set work_mem
that high, then it wouldn't matter--but there is always a reason to do
so, if you have so much memory to spare. So better than that invasive
work, I guess would be to make sort use less than work_mem if it gets
no benefit from using all of it. Anyway, ideas for future work,
either way.
I hope to come up with a fairly robust model for automatically sizing
an "effective work_mem" in the context of external sorts. There should
be a heuristic that balances fan-in against other considerations. I
think that doing this with the existing external sort code would be
completely hopeless. This is a problem that is well understood by the
research community, although balances things well in the context of
PostgreSQL is a little trickier.
I also think it's a little arbitrary that the final on-the-fly merge
step uses a work_mem-ish sized buffer, much like the sorting of runs,
as if there is a good reason to be consistent. Maybe that's fine,
though.
There are advantages to returning tuples earlier in the context of
parallelism, which recommends smaller effective work_mem sizes
(provided they're above a certain threshold). For this reason, having
larger runs may not be a useful goal in general, even without
considering the cost in cache misses paid in the pursuit that goal.
Thanks, but I expected better than that. Was it a collated text
column? The C collation will put the patch in a much better light
(more strcoll() calls are needed with this new approach -- it's still
well worth it, but it is a downside that makes collated text not
especially sympathetic). Just sorting on an integer attribute is also
a good sympathetic case, FWIW.It was UTF8 encoded (although all characters were actually ASCII), but
C collated.
I think that I should have considered that you'd hand-optimized the
work_mem setting for each case in reacting here -- I was at a
conference when I responded. You can show the existing code in a
better light by doing that, as you have, but I think it's all but
irrelevant. It isn't even practical for experts to do that, so the
fact that it is possible is only really a footnote. My choice of
work_mem for my tests tended to be round numbers, like 1GB, because
that was the first thing I thought of.
I've never seen improvements of 3 fold or more like you saw, under any
conditions, so I wonder if your test machine doesn't have unusually
slow main memory.
I think that there is a far simpler explanation. Any time I reported a
figure over ~2.5x, it was for "quicksort with spillover", and with a
temp tablespace on tmpfs to simulate lots of I/O bandwidth (but with
hardly any actual writing to tape -- that's the whole point of that
case). I also think that the heap structure does very badly with low
cardinality sets, which is where the 3.25X - 4X numbers came from. You
haven't tested "quicksort with spillover" here at all, which is fine,
since it is less important. Finally, as I said, I did not give the
master branch the benefit of fine-tuning work_mem (which I think is
fair and representative).
My largest test, which took my true table and extrapolated it out for
a few years growth, had about 500,000,000 rows.
Cool.
At 3GB maintainance_work_mem, it took 13 runs patched and 7 runs
unpatched to build the index, with timings of 3168.66 sec and 5713.07
sec.The final merging is intermixed with whatever other work goes on to
build the actual index files out of the sorted data, so I don't know
exactly what the timing of just the merge part was. But it was
certainly a minority of the time, even if you assume the actual index
build were free. For the patched code, the majority of the time goes
to the quick sorting stages.
I'm not sure what you mean here. I agree that the work of (say)
inserting leaf tuples as part of an index build is kind of the same
cost as the merge step itself, or doesn't vary markedly between the
CREATE INDEX case, and other cases (where there is some analogous
processing of final sorted output).
I would generally expect that the merge phase takes significantly less
than sorting runs, regardless of how we sort runs, unless parallelism
is involved, where merging could dominate. The master branch has a
faster merge step, at least proportionally, because it has larger
runs.
When I test each version of the code at its own most efficient
maintenance_work_mem, I get
3007.2 seconds at 1GB for patched and 3836.46 seconds at 64MB for unpatched.
As I said, it seems a little bit unfair to hand-tune work_mem or
maintenance_work_mem like that. Who can afford to do that? I think you
agree that it's untenable to have DBAs allocate work_mem differently
for cases where an internal sort or external sort is expected;
workloads are just far too complicated and changeable.
I'm attaching the trace_sort output from the client log for all 4 of
those scenarios. "sort_0005" means all 5 of your patches were
applied, "origin" means none of them were.
Thanks for looking at this. This is very helpful. It looks like the
server you used here had fairly decent disks, and that we tended to be
CPU bound more often than not. That's a useful testing ground.
Consider run #7 (of 13 total) with 3GB maintenance_work_mem, for
example (this run was picked at random):
...
LOG: finished writing run 6 to tape 5: CPU 35.13s/1028.44u sec
elapsed 1080.43 sec
LOG: starting quicksort of run 7: CPU 38.15s/1051.68u sec elapsed 1108.19 sec
LOG: finished quicksorting run 7: CPU 38.16s/1228.09u sec elapsed 1284.87 sec
LOG: finished writing run 7 to tape 6: CPU 40.21s/1235.36u sec
elapsed 1295.19 sec
LOG: starting quicksort of run 8: CPU 42.73s/1257.59u sec elapsed 1321.09 sec
...
So there was 27.76 seconds spent copying tuples into local memory
ahead of the quicksort, 2 minutes 56.68 seconds spent actually
quicksorting, and a trifling 10.32 seconds actually writing the run! I
bet that the quicksort really didn't use up too much memory bandwidth
on the system as a whole, since abbreviated keys are used with a cache
oblivious internal sorting algorithm.
This suggests that this case would benefit rather a lot from parallel
workers doing this for each run at the same time (once my code is
adopted to do that, of course). This is something I'm currently
researching. I think that (roughly speaking) each core on this system
is likely slower than the cores on a 4-core consumer desktop/laptop,
which is very normal, particularly with x86_64 systems. That also
makes it more representative than my previous tests.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Nov 28, 2015 at 4:05 PM, Peter Geoghegan <pg@heroku.com> wrote:
So there was 27.76 seconds spent copying tuples into local memory
ahead of the quicksort, 2 minutes 56.68 seconds spent actually
quicksorting, and a trifling 10.32 seconds actually writing the run! I
bet that the quicksort really didn't use up too much memory bandwidth
on the system as a whole, since abbreviated keys are used with a cache
oblivious internal sorting algorithm.
Uh, actually, that isn't so:
LOG: begin index sort: unique = f, workMem = 1048576, randomAccess = f
LOG: bttext_abbrev: abbrev_distinct after 160: 1.000489
(key_distinct: 40.802210, norm_abbrev_card: 0.006253, prop_card:
0.200000)
LOG: bttext_abbrev: aborted abbreviation at 160 (abbrev_distinct:
1.000489, key_distinct: 40.802210, prop_card: 0.200000)
Abbreviation is aborted in all cases that you tested. Arguably this
should happen significantly less frequently with the "C" locale,
possibly almost never, but it makes this case less than representative
of most people's workloads. I think that at least the first several
hundred leading attribute tuples are duplicates.
BTW, roughly what does this CREATE INDEX look like? Is it a composite
index, for example?
It would also be nice to see pg_stats entries for each column being
indexed. Data distributions are certainly of interest here.
Thanks
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Nov 29, 2015 at 2:01 AM, Peter Geoghegan <pg@heroku.com> wrote:
I think that at least the first several
hundred leading attribute tuples are duplicates.
I mean duplicate abbreviated keys. There are 40 distinct keys overall
in the first 160 tuples, which is why abbreviation is aborted -- this
can be seen from the trace_sort output, of course.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Nov 28, 2015 at 02:04:16PM -0800, Jeff Janes wrote:
On Wed, Nov 18, 2015 at 3:29 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Wed, Nov 18, 2015 at 10:31 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
I agree we don't want to optimize for low memory, but I don't think we
should throw it under the bus, either. Right now we are effectively
saying the CPU-cache problems with the heap start exceeding the larger
run size benefits at 64kb (the smallest allowed setting for work_mem).
While any number we pick is going to be a guess that won't apply to
all hardware, surely we can come up with a guess better than 64kb.
Like, 8 MB, say. If available memory for the sort is 8MB or smaller
and the predicted size anticipates a multipass merge, then we can use
the heap method rather than the quicksort method. Would a rule like
that complicate things much?I'm already using replacement selection for the first run when it is
predicted by my new ad-hoc cost model that we can get away with a
"quicksort with spillover", avoiding almost all I/O. We only
incrementally spill as many tuples as needed right now, but it would
be pretty easy to not quicksort the remaining tuples, but continue to
incrementally spill everything. So no, it wouldn't be too hard to hang
on to the old behavior sometimes, if it looked worthwhile.In principle, I have no problem with doing that. Through testing, I
cannot see any actual upside, though. Perhaps I just missed something.
Even 8MB is enough to avoid the multipass merge in the event of a
surprisingly high volume of data (my work laptop is elsewhere, so I
don't have my notes on this in front of me, but I figured out the
crossover point for a couple of cases).For me very large sorts (100,000,000 ints) with work_mem below 4MB do
better with unpatched than with your patch series, by about 5%. Not a
big deal, but also if it is easy to keep the old behavior then I think
we should. Yes, it is dumb to do large sorts with work_mem below 4MB,
but if you have canned apps which do a mixture of workloads it is not
so easy to micromanage their work_mem. Especially as there are no
easy tools that let me as the DBA say "if you connect from this IP
address, you get this work_mem".
That's certainly doable with pgbouncer, for example. What would you
have in mind for the more general capability? It seems to me that
bloating up pg_hba.conf would be undesirable, but maybe I'm picturing
this as bigger than it actually needs to be.
Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com
Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Nov 28, 2015 at 4:05 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Sat, Nov 28, 2015 at 2:04 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
...
The final merging is intermixed with whatever other work goes on to
build the actual index files out of the sorted data, so I don't know
exactly what the timing of just the merge part was. But it was
certainly a minority of the time, even if you assume the actual index
build were free. For the patched code, the majority of the time goes
to the quick sorting stages.I'm not sure what you mean here.
I had no point to make here, I was just trying to answer one of your
questions about how much time was spent merging. I don't know, because
it is interleaved with and not separately instrumented from the index
build.
I would generally expect that the merge phase takes significantly less
than sorting runs, regardless of how we sort runs, unless parallelism
is involved, where merging could dominate. The master branch has a
faster merge step, at least proportionally, because it has larger
runs.When I test each version of the code at its own most efficient
maintenance_work_mem, I get
3007.2 seconds at 1GB for patched and 3836.46 seconds at 64MB for unpatched.As I said, it seems a little bit unfair to hand-tune work_mem or
maintenance_work_mem like that. Who can afford to do that? I think you
agree that it's untenable to have DBAs allocate work_mem differently
for cases where an internal sort or external sort is expected;
workloads are just far too complicated and changeable.
Right, I agree with all that. But I think it is important to know
where the benefits come from. It looks like about half comes from
being more robust to overly-large memory usage, and half from absolute
improvements which you get at each implementations own best setting.
Also, if someone had previously restricted work_mem (or more likely
maintenance_work_mem) simply to avoid the large memory penalty, they
need to know to revisit that decision. Although they still don't get
any actual benefit from using too much memory, just a reduced penalty.
I'm kind of curious as to why the optimal for the patched code appears
at 1GB and not lower. If I get a chance to rebuild the test, I will
look into that more.
I'm attaching the trace_sort output from the client log for all 4 of
those scenarios. "sort_0005" means all 5 of your patches were
applied, "origin" means none of them were.Thanks for looking at this. This is very helpful. It looks like the
server you used here had fairly decent disks, and that we tended to be
CPU bound more often than not. That's a useful testing ground.
It has a Perc H710 RAID controller with 15,000 RPM drives, but it is
also a virtualized system that has other stuff going on. The disks
are definitely better than your average household computer, but I
don't think they are anything special as far as real database hardware
goes. It is hard to saturate the disks for sequential reads. It will
be interesting to see what parallel builds can do.
What would be next in reviewing the patches? Digging into the C-level
implementation?
Cheers,
Jeff
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Nov 29, 2015 at 8:02 PM, David Fetter <david@fetter.org> wrote:
For me very large sorts (100,000,000 ints) with work_mem below 4MB do
better with unpatched than with your patch series, by about 5%. Not a
big deal, but also if it is easy to keep the old behavior then I think
we should. Yes, it is dumb to do large sorts with work_mem below 4MB,
but if you have canned apps which do a mixture of workloads it is not
so easy to micromanage their work_mem. Especially as there are no
easy tools that let me as the DBA say "if you connect from this IP
address, you get this work_mem".That's certainly doable with pgbouncer, for example.
I had not considered that. How would you do it with pgbouncer? The
think I can think of would be to put it in server_reset_query, which
doesn't seem correct.
What would you
have in mind for the more general capability? It seems to me that
bloating up pg_hba.conf would be undesirable, but maybe I'm picturing
this as bigger than it actually needs to be.
I would envision something like "ALTER ROLE set ..." only for
application_name and IP address instead of ROLE. I have no idea how I
would implement that, it is just how I would like to use it as the end
user.
Cheers,
Jeff
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Nov 30, 2015 at 9:51 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
As I said, it seems a little bit unfair to hand-tune work_mem or
maintenance_work_mem like that. Who can afford to do that? I think you
agree that it's untenable to have DBAs allocate work_mem differently
for cases where an internal sort or external sort is expected;
workloads are just far too complicated and changeable.Right, I agree with all that. But I think it is important to know
where the benefits come from. It looks like about half comes from
being more robust to overly-large memory usage, and half from absolute
improvements which you get at each implementations own best setting.
Also, if someone had previously restricted work_mem (or more likely
maintenance_work_mem) simply to avoid the large memory penalty, they
need to know to revisit that decision. Although they still don't get
any actual benefit from using too much memory, just a reduced penalty.
Well, to be clear, they do get a benefit with much larger memory
sizes. It's just that the benefit does not continue indefinitely. I
agree with this assessment, though.
I'm kind of curious as to why the optimal for the patched code appears
at 1GB and not lower. If I get a chance to rebuild the test, I will
look into that more.
I think that the availability of abbreviated keys (or something that
allows most comparisons made by quicksort/the heap to be resolved at
the SortTuple level) could make a big difference for things like this.
Bear in mind that the merge phase has better cache characteristics
when many attributes must be compared, and not mostly just leading
attributes. Alphasort [1]http://www.cs.berkeley.edu/~rxin/db-papers/alphasort.pdf -- Peter Geoghegan merges in-memory runs (built with quicksort)
to create on-disk runs for this reason. (I tried that, and it didn't
help -- maybe I get that benefit from merging on-disk runs, since
modern machines have so much more memory than in 1994).
It has a Perc H710 RAID controller with 15,000 RPM drives, but it is
also a virtualized system that has other stuff going on. The disks
are definitely better than your average household computer, but I
don't think they are anything special as far as real database hardware
goes.
What I meant was that it's better than my laptop. :-)
What would be next in reviewing the patches? Digging into the C-level
implementation?
Yes, certainly, but let me post a revised version first. I have
improved the comments, and performed some consolidation of commits.
Also, I am going to get a bunch of test results from the POWER7
system. I think I might see more benefits with higher
maintenance_work_mem settings that you saw, primarily because my case
can mostly just use abbreviated keys during the quicksort operations.
Also, I find it very very useful that while (for example) your 3GB
test case was slower than your 1GB test case, it was only 5% slower. I
have a lot of hope that we can have a cost model for sizing an
effective maintenance_work_mem for this reason -- the consequences of
being wrong are really not that severe. It's unfortunate that we
currently waste so much memory by blindly adhering to
work_mem/maintenance_work_mem. This matters a lot more when we have
parallel sort.
[1]: http://www.cs.berkeley.edu/~rxin/db-papers/alphasort.pdf -- Peter Geoghegan
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Nov 30, 2015 at 12:29 PM, Peter Geoghegan <pg@heroku.com> wrote:
I'm kind of curious as to why the optimal for the patched code appears
at 1GB and not lower. If I get a chance to rebuild the test, I will
look into that more.I think that the availability of abbreviated keys (or something that
allows most comparisons made by quicksort/the heap to be resolved at
the SortTuple level) could make a big difference for things like this.
Using the Hydra POWER7 server [1]http://rhaas.blogspot.com/2012/03/performance-and-scalability-on-ibm.html + the gensort benchmark [2]https://github.com/petergeoghegan/gensort -- Peter Geoghegan, which
uses the C collation, and has abbreviated keys that have lots of
entropy, I see benefits with higher and higher maintenance_work_mem
settings.
I will present a variety of cases, which seemed like something Greg
Stark is particularly interested in. On the whole, I am quite pleased
with how things are shown to be improved in a variety of different
scenarios.
Looking at CREATE INDEX build times on an (unlogged) gensort table
with 50 million, 100 million, 250 million, and 500 million tuples,
with maintenance_work_mem settings of 512MB, 1GB, 10GB, and 15GB,
there are sustained improvements as more memory is made available. I'm
not saying that that would be the case with low cardinality leading
attribute tuples -- probably not -- but it seems pretty nice that this
case can sustain improvements as more memory is made available. The
server used here has reasonably good disks (Robert goes into this in
his blogpost), but nothing spectacular.
This is what a 500 million tuple gensort table looks like:
postgres=# \dt+
List of relations
Schema | Name | Type | Owner | Size | Description
--------+-----------+-------+-------+-------+-------------
public | sort_test | table | pg | 32 GB |
(1 row)
Results:
50 million tuple table (best of 3):
------------------------------------------
512MB: (8-way final merge) external sort ended, 171058 disk blocks
used: CPU 4.11s/79.30u sec elapsed 83.60 sec
1GB: (4-way final merge) external sort ended, 171063 disk blocks used:
CPU 4.29s/71.34u sec elapsed 75.69 sec
10GB: N/A
15GB: N/A
1GB (master branch): (3-way final merge) external sort ended, 171064
disk blocks used: CPU 6.19s/163.00u sec elapsed 170.84 sec
100 million tuple table (best of 3):
--------------------------------------------
512MB: (16-way final merge) external sort ended, 342114 disk blocks
used: CPU 8.61s/177.77u sec elapsed 187.03 sec
1GB: (8-way final merge) external sort ended, 342124 disk blocks used:
CPU 8.07s/165.15u sec elapsed 173.70 sec
10GB: N/A
15GB: N/A
1GB (master branch): (5-way final merge) external sort ended, 342129
disk blocks used: CPU 11.68s/358.17u sec elapsed 376.41 sec
250 million tuple table (best of 3):
--------------------------------------------
512MB: (39-way final merge) external sort ended, 855284 disk blocks
used: CPU 19.96s/486.57u sec elapsed 507.89 sec
1GB: (20-way final merge) external sort ended, 855306 disk blocks
used: CPU 22.63s/475.33u sec elapsed 499.09 sec
10GB: (2-way final merge) external sort ended, 855326 disk blocks
used: CPU 21.99s/341.34u sec elapsed 366.15 sec
15GB: (2-way final merge) external sort ended, 855326 disk blocks
used: CPU 23.23s/322.18u sec elapsed 346.97 sec
1GB (master branch): (11-way final merge) external sort ended, 855315
disk blocks used: CPU 30.56s/973.00u sec elapsed 1015.63 sec
500 million tuple table (best of 3):
--------------------------------------------
512MB: (77-way final merge) external sort ended, 1710566 disk blocks
used: CPU 45.70s/1016.70u sec elapsed 1069.02 sec
1GB: (39-way final merge) external sort ended, 1710613 disk blocks
used: CPU 44.34s/1013.26u sec elapsed 1067.16 sec
10GB: (4-way final merge) external sort ended, 1710649 disk blocks
used: CPU 46.46s/772.97u sec elapsed 841.35 sec
15GB: (3-way final merge) external sort ended, 1710652 disk blocks
used: CPU 51.55s/729.88u sec elapsed 809.68 sec
1GB (master branch): (20-way final merge) external sort ended, 1710632
disk blocks used: CPU 69.35s/2013.21u sec elapsed 2113.82 sec
I attached a detailed account of these benchmarks, for those that
really want to see the nitty-gritty. This includes a 1GB case for
patch without memory prefetching (which is not described in this
message).
[1]: http://rhaas.blogspot.com/2012/03/performance-and-scalability-on-ibm.html
[2]: https://github.com/petergeoghegan/gensort -- Peter Geoghegan
--
Peter Geoghegan
Attachments:
Hm. Here is a log-log chart of those results (sorry for html mail). I'm not
really sure if log-log is the right tool to use for a O(nlog(n)) curve
though.
I think the take-away is that this is outside the domain where any
interesting break points occur. Maybe run more tests on the low end to find
where the tapesort can generate a single tape and avoid the merge and see
where the discontinuity is with quicksort for the various work_mem sizes.
And can you calculate an estimate where the domain would be where multiple
passes would be needed for this table at these work_mem sizes? Is it
feasible to test around there?
[image: Inline image 1]
--
greg
Attachments:
image.pngimage/png; name=image.pngDownload
�PNG
IHDR � � �4� IDATx^��xT����%����4��W�
�P�]���E����
;b���� ��n��J����/����$����&lv�=g�����yDr������3����9q�����@ @ @ @ @�5q���FA �@�#@ @ @ @ @�e�.Gq � � @���� � � � � D8r @ @ @ @ \&