Sorting Improvements for 8.4

Started by Simon Riggsabout 18 years ago51 messages
#1Simon Riggs
simon@2ndquadrant.com

Just wanted to review a few thoughts and ideas around improving external
sorts, as recently encouraged to do by Jim Nasby.

Current issues/opportunities are these:

ISSUES

a) Memory is always in short supply, so using what we have more
effectively is going to be welcome.

b) Heap sort has a reasonably strong anti-memory effect, meaning that
there is an optimum amount of memory for any sort. This shows itself
with the CPU time increasing during run forming, making this stage of
the sort CPU bound.

c) Many sorts are performed prior to aggregation. It might be possible
to aggregate prior to writing to disk, as a way of reducing the overall
I/O cost. Benefit would occur when the total CPU cost was same no matter
when aggregation occurred; that would not apply in all cases, so we
would need to sense when benefit was possible.

d) Generally reducing the I/O cost of sorting may help the merging
stages of a sort.

SOLUTIONS

The ideas that Greg Stark, Jim Nasby, Heikki and myself have discussed
to date were the following:

1. Sort I/O Compression
2. Aggregation during Sort
3. Memory Pools
4. Dynamic Heap Management
5. Dynamic Run Handling

I've added (5) to the list as well, which hasn't yet been discussed.

1. SORT I/O COMPRESSION

This idea is not dead yet, it just needs a full set of tests to confirm
that there is benefit in all cases. If there's not benefit in all cases,
we may be able to work out which cases those are, so we know when to use
it.

2. AGGREGATION DURING SORT

Many sorts are preliminary steps before aggregation. Aggregation during
run forming would potentially reduce size of heap and reduce number of
comparisons. For many types of aggregate this would not theoretically
increase the number of ops since sum(), avg(), min(), max() are all
commutative according to their inputs. We would probably need to add
another option to Aggregate Functions to indicate the possibility of
calculating the aggregate in this way, since some aggregates might rely
on the current situation that they expect all their inputs at once in
sorted order. (Windowed aggregates are unlikely to be this way).

3. MEMORY POOLS

Solving a) could be done by sensible management and allocation of
resources. Discussed before, so not rehashed here.

4. DYNAMIC HEAP MANAGEMENT

The size of the active heap required to produce the fewest number of
runs varies as the sort progresses. For example, sorting an already
sorted input needs a trivial heap size.

Larger heap sizes simply avoid forming more runs, which is not
necessarily a bad thing. More runs only become bad things when we go
beyond our ability to perform a single final merge (see Dynamic Run
Handling below).

Smaller heap sizes reduce the number of comparisons required, plus
increase the L2+ cache efficiencies. Those two things are the cause of
the anti-memory effect.

Because of b), optimising the size of the heap could potentially be a
good thing. This can make a considerable difference for nearly sorted
data (measurements required...).

When we have M amount of memory available to us, we don't start by using
it all. We start with m memory and only increase up to M if required.
Runs are built with memory set at m. If a tuple arrives that would force
the formation of a new run we assess

i) do we care if another run is formed? Use our knowledge of the likely
amount of data coming our way, compared with number of runs formed so
far and see if we really care. If we don't care, allow the new run to be
formed and carry on with just heap size of m. (see Dynamic Run Handling
later).

ii) if we do care about number of runs, then allow the heap to grow by
increments up to the full size of M. Increments would be at least x2 and
possibly x4. That way we always have work space to rearrange the heap.

All of this dances too cleverly around the exact technique and potential
costs of rearranging the heap. That is not to be ignored and is the next
task in evaluating and accepting/dismissing this potential technique.

In combination with memory pooling this technique might also allow
memory to be better distributed to other users.

5. DYNAMIC RUN HANDLING (in Final Merge)

Another way of addressing a) is to simply make better use of memory
itself. Let's look at that in more detail:

Number of runs that can be merged at once is currently fixed, based upon
available memory. This has the underlying assumption that all runs will
be concurrently active during final merging, which may not always be
true.

If we have random data then almost all runs will overlap with all other
runs, i.e. the min and max values are sufficiently wide that the runs do
all overlap. In many cases, data arrives in somewhat sorted order, e.g.
financial data is fairly regular with some late payers but not many, and
those trail off with a fairly tight decay. In the somewhat sorted case
we find that the actual overlap is less than total, so there are many
later runs that don't overlap the earlier ones. In the best case we
would find run 1 and 2 overlap, runs 2 and 3 overlap, then 3 and 4
overlap.

This is also the point where I suggest breaking away from Knuth
completely. All of the main algorithms described by Knuth are tape
sorts. A run is written to a particular tape and then stays there until
"moved" to another tape. That means we have to get super-clever about
how runs should be written and formed (see Knuth). If we realise that
the runs aren't fixed to particular tapes they are all just independent
runs, we can radically rethink sorting. There is no need to implement
Cascade Sort, but we do need to rethink merging from the ground up. (All
of which is a relief, because Knuth et al are definitely smarter than
me, but I've got disks and lots of memory and those guys had tapes.).

If we track the min and max values for each run, when run building is
finished we will be able to build a merging plan that allows us to be
smart about the runs we should bring together. We start with the run
with the lowest min value, as well as all runs that overlap that run.
When that run is exhausted we move to the next lowest and at that point
start merging all runs that overlap that one.

This then means we may be able to begin final merging with more runs
than the current cut-off. It's possible that we could merge an infinite
number of runs in final merge with fixed memory. If we *do* need to
merge we can work out which runs should be our best pre-merge
candidates, based upon how big they are and which other runs they
overlap. (That's much better than being forced to merge tapes 2, 7 and
17 because some bizarre math says so (see Knuth).)

Anyway, claiming to have found a better way than Knuth makes me feel a
little nervous, so some searching questions on this are very welcome.

Interestingly, if we combine this technique with dynamic heap management
we may be able to allow a very large number of efficiently written runs
to form without it causing any merging.

mac_man recently noted the possibility that some runs don't overlap at
all and so can be merged for free. That's true, though doesn't actually
improve the basic idea here which is building a merge plan after runs
have been formed, with an eye on minimizing and potentially elimination
the merge phase.

There's probably some typos or thinkos above, so go easy on me Greg!
They aren't there because I want to skim over anything.

I'm not likely to get a chance to do all of this in the near future, so
documenting it now should help others to carry things forward.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

#2Sam Mason
sam@samason.me.uk
In reply to: Simon Riggs (#1)
Re: Sorting Improvements for 8.4

On Tue, Nov 27, 2007 at 06:03:46PM +0000, Simon Riggs wrote:

Just wanted to review a few thoughts and ideas around improving external
sorts, as recently encouraged to do by Jim Nasby.

Is there any way of PG knowing that having an index on a subset of the
sorted columns is sometimes a win? For example, if we have:

CREATE TABLE foo (
a INTEGER NOT NULL PRIMARY KEY,
b INTEGER NOT NULL,
c INTEGER
);

and we request:

SELECT * FROM foo ORDER BY a,b LIMIT 10;

then it may be a win to do smaller sorts for each value of "a", rather
than one big sort after all the data has been pulled out. Obviously,
it would depend on the distribution of "a", large numbers of distinct
values for "a" being good, and a small number being bad.

I think this would help in a number of other situations as well, but
that's just the most obvious case.

Sam

#3Jeff Davis
pgsql@j-davis.com
In reply to: Simon Riggs (#1)
Re: Sorting Improvements for 8.4

On Tue, 2007-11-27 at 18:03 +0000, Simon Riggs wrote:

5. DYNAMIC RUN HANDLING (in Final Merge)

Another way of addressing a) is to simply make better use of memory
itself. Let's look at that in more detail:

Number of runs that can be merged at once is currently fixed, based upon
available memory. This has the underlying assumption that all runs will
be concurrently active during final merging, which may not always be
true.

If we have random data then almost all runs will overlap with all other
runs, i.e. the min and max values are sufficiently wide that the runs do
all overlap. In many cases, data arrives in somewhat sorted order, e.g.
financial data is fairly regular with some late payers but not many, and
those trail off with a fairly tight decay. In the somewhat sorted case
we find that the actual overlap is less than total, so there are many
later runs that don't overlap the earlier ones. In the best case we
would find run 1 and 2 overlap, runs 2 and 3 overlap, then 3 and 4
overlap.

I have spoken with Len Shapiro, a professor at Portland State
University, regarding sorting before.

He suggests that PostgreSQL should implement forecasting, which is
similar to what you're describing. Forecasting does not require that
entire runs are disjoint, it works by tracking the maximum values from
the last block read from every run. This allows you to know which run
you will need more blocks from the soonest.

I'm still looking into the problem to understand it better, but the
algorithm is in Knuth Vol 3.

I can look at it in more detail, but have you already looked into this
idea? Is there a reason we don't do this currently?

Regards,
Jeff Davis

#4Simon Riggs
simon@2ndquadrant.com
In reply to: Jeff Davis (#3)
Re: Sorting Improvements for 8.4

On Fri, 2007-11-30 at 12:07 -0800, Jeff Davis wrote:

On Tue, 2007-11-27 at 18:03 +0000, Simon Riggs wrote:

5. DYNAMIC RUN HANDLING (in Final Merge)

Another way of addressing a) is to simply make better use of memory
itself. Let's look at that in more detail:

Number of runs that can be merged at once is currently fixed, based upon
available memory. This has the underlying assumption that all runs will
be concurrently active during final merging, which may not always be
true.

If we have random data then almost all runs will overlap with all other
runs, i.e. the min and max values are sufficiently wide that the runs do
all overlap. In many cases, data arrives in somewhat sorted order, e.g.
financial data is fairly regular with some late payers but not many, and
those trail off with a fairly tight decay. In the somewhat sorted case
we find that the actual overlap is less than total, so there are many
later runs that don't overlap the earlier ones. In the best case we
would find run 1 and 2 overlap, runs 2 and 3 overlap, then 3 and 4
overlap.

I have spoken with Len Shapiro, a professor at Portland State
University, regarding sorting before.

He suggests that PostgreSQL should implement forecasting, which is
similar to what you're describing. Forecasting does not require that
entire runs are disjoint, it works by tracking the maximum values from
the last block read from every run. This allows you to know which run
you will need more blocks from the soonest.

I'm still looking into the problem to understand it better, but the
algorithm is in Knuth Vol 3.

I can look at it in more detail, but have you already looked into this
idea? Is there a reason we don't do this currently?

Interesting, I hadn't read that part.

Knuth's Algorithm F covers how to do a P-way merge using 2P + 2 buffers.
My ideas cover how to do a P-way merge when you don't have enough memory
for that many buffers.

The current sort code makes two assumptions, amongst others

1. minimizing number of runs is always worth it
2. there is a single fixed maximum size of P, depending upon memory

I'm challenging both of those. Only runs that overlap need to be merged
simultaneously, so if the runs aren't overlapping then its OK to allow
more runs to be formed. If its OK to allow more runs, then reducing heap
size to allow better CPU efficiency is possible.

So Algorithm F is somewhat orthogonal to what I've proposed, though
maybe still interesting. What we do now is fairly close, but you should
look at the code in tuplesort.c and logtape.c to see how well it
matches. That might lead to an increase in the limit of the number of
concurrent runs mergeable at any one time.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

#5Jeff Davis
pgsql@j-davis.com
In reply to: Simon Riggs (#4)
Re: Sorting Improvements for 8.4

On Mon, 2007-12-03 at 11:51 +0000, Simon Riggs wrote:

So Algorithm F is somewhat orthogonal to what I've proposed, though
maybe still interesting. What we do now is fairly close, but you should
look at the code in tuplesort.c and logtape.c to see how well it
matches. That might lead to an increase in the limit of the number of
concurrent runs mergeable at any one time.

tuplesort.c:

* When merging runs, we use a heap containing just the frontmost tuple from
* each source run; we repeatedly output the smallest tuple and insert the
* next tuple from its source tape (if any). When the heap empties, the merge
* is complete. The basic merge algorithm thus needs very little memory ---
* only M tuples for an M-way merge, and M is constrained to a small number.
* However, we can still make good use of our full workMem allocation by
* pre-reading additional tuples from each source tape. Without prereading,
* our access pattern to the temporary file would be very erratic; on average
* we'd read one block from each of M source tapes during the same time that
* we're writing M blocks to the output tape, so there is no sequentiality of
* access at all, defeating the read-ahead methods used by most Unix kernels.
* Worse, the output tape gets written into a very random sequence of blocks
* of the temp file, ensuring that things will be even worse when it comes
* time to read that tape. A straightforward merge pass thus ends up doing a
* lot of waiting for disk seeks. We can improve matters by prereading from
* each source tape sequentially, loading about workMem/M bytes from each tape
* in turn. Then we run the merge algorithm, writing but not reading until
* one of the preloaded tuple series runs out. Then we switch back to preread
* mode, fill memory again, and repeat. This approach helps to localize both
* read and write accesses.

The idea of prefetching, as I understand it, is that we don't blindly
preread workMem/M bytes from each of M tapes; instead we predict
which tapes we will need tuples from next through forecasting.

If I understand correctly, we just keep track of the maximum value of
the last block read from each run, and then always read from the run in
which the last block read has the lowest maximum.

It seems as if this would allow a variable number of runs to be merged
at once, but if the data really *is* random, we'd want it to degrade
gracefully something resembling the current implementation.

I'm being somewhat vague here because I haven't taken the time to
really understand it. If you think this idea has potential I will look
into it in more detail.

Regards,
Jeff Davis

#6Simon Riggs
simon@2ndquadrant.com
In reply to: Jeff Davis (#5)
Re: Sorting Improvements for 8.4

On Mon, 2007-12-03 at 10:32 -0800, Jeff Davis wrote:

If I understand correctly, we just keep track of the maximum value of
the last block read from each run, and then always read from the run in
which the last block read has the lowest maximum.

Yep, sounds like Algorithm F

It seems as if this would allow a variable number of runs to be merged
at once, but if the data really *is* random, we'd want it to degrade
gracefully something resembling the current implementation.

If we also keep track of the endpoints of runs that we haven't yet read
from, then yes that would link my ideas with Algorithm F, so we just
have a single implementation. (F++ ?)

Probably easiest to store the endpoint tuples directly, with some sane
limits for when we have very large tuples.

You'll still need to do run-level forecasting as I had proposed to tell
whether you need to do any intermediate merging prior to the final
merge. So the two sets of ideas can't be brought together completely.

I'm being somewhat vague here because I haven't taken the time to
really understand it. If you think this idea has potential I will look
into it in more detail.

Yes, F++ sound like it will use memory more effectively than we do
currently. That's likely to improve performance when the number of runs
approaches the limit for the size of work_mem. So this will improve
external sorts with too small memory allocations, but it won't do
anything about sorts with too large a memory allocation. That's probably
the order of importance for tackling sort performance, so thats good.

Probably best to test with
- 1M - 4M work_mem, so we see the full benefit of any improvements in
memory utilisation in a typical context
- number of runs is nearly at limit for memory
- total sort is very large, so we see real I/O issues starkly

You'll need to instrument things carefully so you can tell how many runs
are being merged at any one time and how that effects elapsed time/row.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

#7Gregory Stark
stark@enterprisedb.com
In reply to: Simon Riggs (#6)
Re: Sorting Improvements for 8.4

"Simon Riggs" <simon@2ndquadrant.com> writes:

On Mon, 2007-12-03 at 10:32 -0800, Jeff Davis wrote:

If I understand correctly, we just keep track of the maximum value of
the last block read from each run, and then always read from the run in
which the last block read has the lowest maximum.

So it sounds like the use case where this is the biggest win would be a
situation where you have presorted input which has been sliced up. So for
example sorting by "zip code" in a table which was clustered by city. The
alphabetic order of the cities isn't correlated to the results but all the zip
codes for a city are in a contiguous block somewhere in the output.

In such a case after doing a single pass we would have a bunch of tapes each
of which corresponded to a single city and was able to completely reorder the
zip codes in that city to be ordered. So the desired results would be, for
example, all the tuples from tape 17 (NYC) followed by all the tuples from
tape 3 (Buffalo) followed by all the tuples from tape 1 (Albuquerque), etc.

We currently preread an equal amount from each tape and then would empty all
the preread tuples from tape 17, refill them, preread them again, repeat until
tape 17 is empty then move on to tape 3. All the tuples except the currently
active tape are completely idle.

I think the way to do what you're proposing is to preread one tuple from each
tape, then when one preread bunch is emptied refill it with twice as many and
repeat. In this case you would end up with nearly all of workmem full of
tuples from NYC until you're done with NYC. That would increase the prereading
block size by a factor of 20 in this case.

So the question is just how many seeks are we doing during sorting. If we're
doing 0.1% seeks and 99.9% sequential i/o then eliminating the 1% entirely
(which we can't do) isn't going to speed up seeking all that much. If we're
doing 20% seeks and can get that down to 10% it might be worthwhile.

I'm not sure where the idea of keeping the current bounds of the input tapes
comes into it. We only preread when we run out of tuples anyways and then we
don't really have a choice about which tape we want to preread from. And it's
a good thing too since maintaining such a list of bounds and finding the
lowest or highest would mean maintaining a second heap which would basically
double the cpu cost of sorting.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!

#8Simon Riggs
simon@2ndquadrant.com
In reply to: Gregory Stark (#7)
Re: Sorting Improvements for 8.4

On Mon, 2007-12-03 at 20:40 +0000, Gregory Stark wrote:

I think the way to do what you're proposing is...

Don't understand that. Algorithm F covers that already doesn't it?

So the question is just how many seeks are we doing during sorting. If we're
doing 0.1% seeks and 99.9% sequential i/o then eliminating the 1% entirely
(which we can't do) isn't going to speed up seeking all that much. If we're
doing 20% seeks and can get that down to 10% it might be worthwhile.

The buffer size at max tapes is an optimum - a trade off between
avoiding intermediate merging and merging efficiently. Freeing more
memory is definitely going to help in the case of low work_mem and lots
of runs.

You're right that there is a limit to the benefit you can get. I wrote a
patch in 2005/6 to optimise the memory usage when there were few runs
and lots of memory. I still think there's value in that.

I'm not sure where the idea of keeping the current bounds of the input tapes
comes into it. We only preread when we run out of tuples anyways and then we
don't really have a choice about which tape we want to preread from.

You have to decide whether to perform intermediate merges or whether you
can do everything at the final merge. Otherwise you can't merge more
runs than you have buffers for, since you'd at some point freeze up and
not be able to input.

And it's

a good thing too since maintaining such a list of bounds and finding the
lowest or highest would mean maintaining a second heap which would basically
double the cpu cost of sorting.

I think you're not understanding me.

You only need to record the lowest or highest when a run
completes/starts. When all runs have been written we then have a table
of the highest and lowest values for each run. We then scan that to see
whether we can perform merging in one pass, or if not what kind of
intermediate merging is required. We keep the merge plan in memory and
then follow it. So probably very small % of total sort cost, though
might save you doing intermediate merges with huge costs.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

#9Gregory Stark
stark@enterprisedb.com
In reply to: Simon Riggs (#8)
Re: Sorting Improvements for 8.4

"Simon Riggs" <simon@2ndquadrant.com> writes:

The buffer size at max tapes is an optimum - a trade off between
avoiding intermediate merging and merging efficiently. Freeing more
memory is definitely going to help in the case of low work_mem and lots
of runs.

I can't follow these abstract arguments. That's why I tried to spell out a
concrete example.

I think you're not understanding me.

You only need to record the lowest or highest when a run
completes/starts. When all runs have been written we then have a table
of the highest and lowest values for each run. We then scan that to see
whether we can perform merging in one pass, or if not what kind of
intermediate merging is required. We keep the merge plan in memory and
then follow it. So probably very small % of total sort cost, though
might save you doing intermediate merges with huge costs.

Ok, that's a very different concept than what I was thinking.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's On-Demand Production Tuning

#10Jeff Davis
pgsql@j-davis.com
In reply to: Gregory Stark (#7)
Re: Sorting Improvements for 8.4

On Mon, 2007-12-03 at 20:40 +0000, Gregory Stark wrote:

So the question is just how many seeks are we doing during sorting. If we're
doing 0.1% seeks and 99.9% sequential i/o then eliminating the 1% entirely
(which we can't do) isn't going to speed up seeking all that much. If we're
doing 20% seeks and can get that down to 10% it might be worthwhile.

It's not just about eliminating seeks, it's about being able to merge
more runs at one time.

If you are merging 10 runs at once, and only two of those runs overlap
and the rest are much greater values, you might be spending 99% of the
time in sequential I/O.

But the point is, we're wasting the memory holding those other 8 runs in
memory (wasting 80% of the memory you're using), so we really could be
merging a lot more than 10 runs at once. This might eliminate stages
from the merge process.

My point is just that "how many seeks are we doing" is not the only
question. We could be doing 99% sequential I/O and still make huge wins.

In reality, of course, the runs aren't going to be disjoint completely,
but they may be partially disjoint. That's where forecasting comes in:
you preread from the tapes you will actually need tuples from soonest.

Regards,
Jeff Davis

#11Jeff Davis
pgsql@j-davis.com
In reply to: Gregory Stark (#7)
Re: Sorting Improvements for 8.4

On Mon, 2007-12-03 at 20:40 +0000, Gregory Stark wrote:

I'm not sure where the idea of keeping the current bounds of the input tapes
comes into it. We only preread when we run out of tuples anyways and then we
don't really have a choice about which tape we want to preread from. And it's
a good thing too since maintaining such a list of bounds and finding the
lowest or highest would mean maintaining a second heap which would basically
double the cpu cost of sorting.

You're only keeping track of the maximum value for each run, which
should be cheap to track. The only time it changes is when you're
reading more data from that run, in which case it increases.

The tradeoff that's happening right now is: we want to merge many runs
at once because it reduces the number of merge phases, but the problem
is that it increases the seeking because we read one block from one run,
then one block from another run, etc., especially if the input is
random.

If we reduce the number of runs, then we can preread more efficiently.
See:

tuplesort.c:

* as sorted runs, we can eliminate any repeated I/O at all. In the
current
* code we determine the number of tapes M on the basis of workMem: we
want
* workMem/M to be large enough that we read a fair amount of data each
time
* we preread from a tape, so as to maintain the locality of access
described
* above. Nonetheless, with large workMem we can have many tapes.

So, for workMem/M to be "large enough", M has to be small enough. But a
small M means we have to do more merge phases, which is expensive.

Forecasting improves this trade. Forecasting no longer _blindly_
prereads from each tape, it uses information that it already has (the
max value of the last block read from each run) to determine the runs
from which we need tuples the soonest.

Then, it prereads the _correct_ data.

Regards,
Jeff Davis

#12Ron Mayer
rm_pg@cheapcomplexdevices.com
In reply to: Simon Riggs (#1)
Re: Sorting Improvements for 8.4

Has anyone looked into sorting algorithms that could use
more than one CPU or core at a time?

Benchmarks I see[1]http://www.cs.cmu.edu/~damon2005/damonpdf/4%20best%20paper%20-%20multithreaded%20architectures%20and%20the%20sort%20benchmark.pdf[2]http://delivery.acm.org/10.1145/1120000/1114254/DaMoN_103.pdf?key1=1114254&amp;key2=5713023711&amp;coll=&amp;dl=ACM&amp;CFID=15151515&amp;CFTOKEN=6184618 suggest that sorting is an area that
improves greatly with multiple processors and even with
multi-threading on a single core processor.

"For 1-processor and 2-threads (1p2t), the algorithm sorts
the relation about 48% faster than the single-threaded version
with a speedup of 31% during the quicksort and 58% during the
mergesort. The dual-processor (2p2t) version provides an even
faster total speedup of 86% over the single-threaded version
with a speedup of 60% during the quicksort and 100% during
the merge sort."
[from the acm paper on link 2 below]

PS: Yeah, I know multi-threading is a hot-button on these
lists; but sorting seems a relatively isolated of the code
and I'd wonder if it'd be isolate-able enough that multiple
CPUs could be used there.

[1]: http://www.cs.cmu.edu/~damon2005/damonpdf/4%20best%20paper%20-%20multithreaded%20architectures%20and%20the%20sort%20benchmark.pdf
[2]: http://delivery.acm.org/10.1145/1120000/1114254/DaMoN_103.pdf?key1=1114254&amp;key2=5713023711&amp;coll=&amp;dl=ACM&amp;CFID=15151515&amp;CFTOKEN=6184618

#13Dimitri Fontaine
dfontaine@hi-media.com
In reply to: Ron Mayer (#12)
Re: Sorting Improvements for 8.4

Hi,

Le mardi 18 décembre 2007, Ron Mayer a écrit :

Has anyone looked into sorting algorithms that could use
more than one CPU or core at a time?

[...]

PS: Yeah, I know multi-threading is a hot-button on these
lists; but sorting seems a relatively isolated of the code
and I'd wonder if it'd be isolate-able enough that multiple
CPUs could be used there.

And before that objection to multi-threading implementation and portability
concerns arise, what about using a coroutine BSD-licenced portable
implementation such as Protothreads to have backend code use several CPU at a
time?
http://www.sics.se/~adam/pt/

With such a tool, would it be possible to think about producer/consumer
parallel executions for sorting, aggregates nodes or other parts of the
executor?

Hope this helps, regards,
--
dim

#14Simon Riggs
simon@2ndquadrant.com
In reply to: Ron Mayer (#12)
Re: Sorting Improvements for 8.4

On Mon, 2007-12-17 at 16:34 -0800, Ron Mayer wrote:

PS: Yeah, I know multi-threading is a hot-button on these
lists; but sorting seems a relatively isolated of the code
and I'd wonder if it'd be isolate-able enough that multiple
CPUs could be used there.

I'm not sure multi-threading is the issue you think. Threads is, but
only for architectural reasons. Using multiple processes to complete a
task seems very sensible to me.

Yeh, sorting is isolated enough to try out some of those ideas on. I was
unaware of the work on finding medians, so thats a good way of dividing
the workloads for parallelism.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

#15Jeff Davis
pgsql@j-davis.com
In reply to: Simon Riggs (#14)
Re: Sorting Improvements for 8.4

On Tue, 2007-12-18 at 09:31 +0000, Simon Riggs wrote:

On Mon, 2007-12-17 at 16:34 -0800, Ron Mayer wrote:

PS: Yeah, I know multi-threading is a hot-button on these
lists; but sorting seems a relatively isolated of the code
and I'd wonder if it'd be isolate-able enough that multiple
CPUs could be used there.

I'm not sure multi-threading is the issue you think. Threads is, but
only for architectural reasons. Using multiple processes to complete a
task seems very sensible to me.

My first thought would be that we would need a new executor node (e.g.
"ParallelSort") that would only be chosen when the cost of the sort is
large enough to outweigh other factors (such as process creation time,
dividing available work_mem, and any necessary IPC).

It seems to me the simplest way to do it would be to allow each sub
process to allocate work_mem/P where P is the degree of parallelization.
However, that somewhat works against our schemes for dynamic run
handling and forecasting, and may lead to more random reading from disk.
Any other scheme I can think of would involve more IPC, which might make
the idea just too complex.

Regards,
Jeff Davis

#16Mark Mielke
mark@mark.mielke.cc
In reply to: Jeff Davis (#15)
Re: Sorting Improvements for 8.4

Jeff Davis wrote:

My first thought would be that we would need a new executor node (e.g.
"ParallelSort") that would only be chosen when the cost of the sort is
large enough to outweigh other factors (such as process creation time,
dividing available work_mem, and any necessary IPC).

It seems to me the simplest way to do it would be to allow each sub
process to allocate work_mem/P where P is the degree of parallelization.
However, that somewhat works against our schemes for dynamic run
handling and forecasting, and may lead to more random reading from disk.
Any other scheme I can think of would involve more IPC, which might make
the idea just too complex.

I am curious - what algorithms exist to efficiently do a parallel sort?
Do you mean if sorting 1 million items, it is possible to separate this
into 2 sets of 500 thousand each, execute them in separate threads
(with task administration and synchronization overhead) , combine the
results, and complete the task in significantly less time than doing it
in one thread? I am skeptical that this is possible, and suspect that
the overall efficiency of the system would go down even if the
throughput of a single execution increases.

Or do you mean being able to perform parts of the query plan fully in
parallel? If this, then one would need a lot more than ParallelSort...

Cheers,
mark

--
Mark Mielke <mark@mielke.cc>

#17Michał Zaborowski
michal.zaborowski@gmail.com
In reply to: Mark Mielke (#16)
Re: Sorting Improvements for 8.4

2007/12/19, Mark Mielke <mark@mark.mielke.cc>:

Jeff Davis wrote:

My first thought would be that we would need a new executor node (e.g.
"ParallelSort") that would only be chosen when the cost of the sort is
large enough to outweigh other factors (such as process creation time,
dividing available work_mem, and any necessary IPC).

It seems to me the simplest way to do it would be to allow each sub
process to allocate work_mem/P where P is the degree of parallelization.
However, that somewhat works against our schemes for dynamic run
handling and forecasting, and may lead to more random reading from disk.
Any other scheme I can think of would involve more IPC, which might make
the idea just too complex.

I am curious - what algorithms exist to efficiently do a parallel sort?
Do you mean if sorting 1 million items, it is possible to separate this
into 2 sets of 500 thousand each, execute them in separate threads
(with task administration and synchronization overhead) , combine the
results, and complete the task in significantly less time than doing it
in one thread? I am skeptical that this is possible, and suspect that
the overall efficiency of the system would go down even if the
throughput of a single execution increases.

Ok - we want to sort table with quick sort and we want to do it on - N threads.
Every thread - gets begin and end of indices of the table. First step starts
at 0 and lasts with count -1. Single step: find medium value and move
lover before it and bigger after. In normal case - we use recursive call - so
the same procedure is being called for that two parts. In thread we can put
indices at side list - and use queue of threads to pick up data from the list.
We can use common table, access to side list with indices has to be serialized.

Or do you mean being able to perform parts of the query plan fully in
parallel? If this, then one would need a lot more than ParallelSort...

Nice to have, but rather for data warehouses. In other cases... IE - backend
for Internet - there are many requests and every processor / core works nice.

--
Regards,
Michał Zaborowski (TeXXaS)

#18Andreas Joseph Krogh
andreak@officenet.no
In reply to: Dimitri Fontaine (#13)
Re: Sorting Improvements for 8.4

On Tuesday 18 December 2007 10:03:25 Dimitri Fontaine wrote:

Hi,

Le mardi 18 décembre 2007, Ron Mayer a écrit :

Has anyone looked into sorting algorithms that could use
more than one CPU or core at a time?

[...]

PS: Yeah, I know multi-threading is a hot-button on these
lists; but sorting seems a relatively isolated of the code
and I'd wonder if it'd be isolate-able enough that multiple
CPUs could be used there.

And before that objection to multi-threading implementation and portability
concerns arise, what about using a coroutine BSD-licenced portable
implementation such as Protothreads to have backend code use several CPU at
a time?
http://www.sics.se/~adam/pt/

With such a tool, would it be possible to think about producer/consumer
parallel executions for sorting, aggregates nodes or other parts of the
executor?

Hope this helps, regards,

And remember; Users don't care about portability-issues, they care about
performance. If multi-threading is a way to speed up sorting considerably, it
should, IMHO, be considered seriously.

--
Andreas Joseph Krogh

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andreas Joseph Krogh (#18)
Re: Sorting Improvements for 8.4

Andreas Joseph Krogh <andreak@officenet.no> writes:

And remember; Users don't care about portability-issues, they care about
performance.

Nonsense. If Postgres stops working on their machine, they'll care.

regards, tom lane

#20Mark Mielke
mark@mark.mielke.cc
In reply to: Michał Zaborowski (#17)
Re: Sorting Improvements for 8.4

Michaďż˝ Zaborowski wrote:

Ok - we want to sort table with quick sort and we want to do it on - N threads.
Every thread - gets begin and end of indices of the table. First step starts
at 0 and lasts with count -1. Single step: find medium value and move
lover before it and bigger after. In normal case - we use recursive call - so
the same procedure is being called for that two parts. In thread we can put
indices at side list - and use queue of threads to pick up data from the list.
We can use common table, access to side list with indices has to be serialized.

Stupid question #2: Is it well recognized that the CPU is the bottleneck
in the PostgreSQL sorting mechanism? Or might it be memory bandwidth and
I/O?

It would seem to me that any sort worth parallelizing (administrative
and synchronization overhead), must have data larger than the L2 cache.
If larger than the L2 cache, it becomes real memory speed. If real
memory speed, wouldn't one CPU without hardware synchronization, be able
to fill the memory read/write pipe? If 'divide and conquer' to
parallize, wouldn't the values written
from one thread, often (1 / N) need to be read from another thread,
requiring hardware data synchronization?

I see the wikipedia.org page describes how easy it is to parallelize
quick sort, and scale performance linearly with the number of
processors, but I don't see references to back this claim.
At least some of these steps seem difficult or impractical to
parallelize. For example, the initial partition reorder that moves items
lower than the pivot to the left, and items higher than the pivot to the
right, would not be easy to parallelize using an in-place re-order. It
needs to move one partition down before it can 'divide and conquer'.
They say no synchronization is required, but I think they are missing
the hardware synchronization required (especially in the inner most
loops where the thread task becomes shorter, and starts to fit in
L1/L2). They say linear, but then talk about a 'new thread being
created'. New thread creation has a cost, and if reduced to using a
thread pool, then synchronization *is* required.

It sounds like a 'nice in theory' idea. :-) Which doesn't mean it is
wrong...

I am curious enough to write a test...

Or do you mean being able to perform parts of the query plan fully in
parallel? If this, then one would need a lot more than ParallelSort..

Nice to have, but rather for data warehouses. In other cases... IE - backend
for Internet - there are many requests and every processor / core works nice.

I'm a fan of the 'each plan item is a task, that is assigned to the
pool, with each CPU grabbing tasks from the pool'. Another 'nice in
theory' idea (used by DB2?). As it is, though, I think PostgreSQL
planning is heavily designed to maximize performance on a single CPU,
and single queries would not easily scale to multiple CPUs. (Perhaps
hashing could be done on another CPU, or as you describe above, sorting)

Cheers,
mark

--
Mark Mielke <mark@mielke.cc>

#21Jeff Davis
pgsql@j-davis.com
In reply to: Mark Mielke (#20)
Re: Sorting Improvements for 8.4

On Wed, 2007-12-19 at 12:08 -0500, Mark Mielke wrote:

Stupid question #2: Is it well recognized that the CPU is the
bottleneck in the PostgreSQL sorting mechanism? Or might it be memory
bandwidth and I/O?

I think it depends a lot on several factors. It's probably a different
bottleneck for integers versus localized text, and depends on the
available memory and I/O characteristics.

It would seem to me that any sort worth parallelizing (administrative
and synchronization overhead), must have data larger than the L2
cache. If larger than the L2 cache, it becomes real memory speed. If
real memory speed, wouldn't one CPU without hardware synchronization,
be able to fill the memory read/write pipe?

We do an external merge sort, which involves merging M runs together.
You seem to be implying that we can generate the output run at disk
speed, and therefore the CPU speed is unimportant.

I suspect that the comparison costs are enough that the above statement
isn't true in all cases, particularly in the case of localized text.

Also, there is probably a lot of memory copying going on, and that
probably destroys a lot of the effectiveness of L2 caching. When L2
caching is ineffective, the CPU spends a lot of time just waiting on
memory. In that case, it's better to have P threads of execution all
waiting on memory operations in parallel.

This would explain why "1p2t" would outperform a "1p1t" in Ron's
reference above.

These are just my first thoughts, however. There is a lot of existing
research out there that we can look into, and also a lot of tests that
we can run before jumping into this.

I think parallel sorting can be looked into separately from the other
sorting improvements.

Regards,
Jeff Davis

#22Ron Mayer
rm_pg@cheapcomplexdevices.com
In reply to: Mark Mielke (#16)
Re: Sorting Improvements for 8.4

Mark Mielke wrote:

I am curious - what algorithms exist to efficiently do a parallel sort?
Do you mean if sorting 1 million items, it is possible to separate this
into 2 sets of 500 thousand each, execute them in separate threads
(with task administration and synchronization overhead) , combine the
results, and complete the task in significantly less time than doing it
in one thread? I am skeptical that this is possible...

The link in the beginning of the thread points to articles
that seem to describe one such algorithm; along with benchmarks.
(http://tinyurl.com/3bvu4u, http://tinyurl.com/32wg2m)
The improvements were pretty consistent from set sizes ranging
from very small sets (hundreds) to quite large ones (hundreds of K).

Interestingly, even multi-threading helped a lot.

"Our tests correlate well with previous research that showed
Intel’s implementation of SMT (Hyper-Threading) to be
adept at hiding this latency [6, 20, 12].Table 4 shows that by
having two threads access memory at the same time, performance
improved over 80% when compared to the singlethreaded version.

It uses both quicksort phases and merge phases; for the merge phase
using 2CPUs (no hyperthreading) apparently gave more than 2X speed
improvement; apparently because it could parallelize memory access
with CPU more.

Or do you mean being able to perform parts of the query plan fully in
parallel? If this, then one would need a lot more than ParallelSort...

I wouldn't recommend that - it seems like a Hard Problem.

My guess is that the best way to use multiple threads in one backend
would be to find specific algorithms like sorting that would be
easier to isolate.

#23Mark Mielke
mark@mark.mielke.cc
In reply to: Ron Mayer (#22)
Re: Sorting Improvements for 8.4

Ron Mayer wrote:

The link in the beginning of the thread points to articles
that seem to describe one such algorithm; along with benchmarks.
(http://tinyurl.com/3bvu4u, http://tinyurl.com/32wg2m)
The improvements were pretty consistent from set sizes ranging
from very small sets (hundreds) to quite large ones (hundreds of K).

Interestingly, even multi-threading helped a lot.

"Our tests correlate well with previous research that showed
Intel’s implementation of SMT (Hyper-Threading) to be
adept at hiding this latency [6, 20, 12].Table 4 shows that by
having two threads access memory at the same time, performance
improved over 80% when compared to the singlethreaded version.

It uses both quicksort phases and merge phases; for the merge phase
using 2CPUs (no hyperthreading) apparently gave more than 2X speed
improvement; apparently because it could parallelize memory access
with CPU more.

Good points. I had forgotten about DDR and DDR2 having high throughput
at the cost of high latency. Somewhere in there, having the most number
of memory requests in the queue would allow hardware to eliminate this
high latency effect.

Or do you mean being able to perform parts of the query plan fully in
parallel? If this, then one would need a lot more than ParallelSort...

I wouldn't recommend that - it seems like a Hard Problem.

My guess is that the best way to use multiple threads in one backend
would be to find specific algorithms like sorting that would be
easier to isolate.

Also a good point. :-)

Cheers,
mark

--
Mark Mielke <mark@mielke.cc>

#24Mark Mielke
mark@mark.mielke.cc
In reply to: Jeff Davis (#21)
Re: Sorting Improvements for 8.4

Jeff Davis wrote:

On Wed, 2007-12-19 at 12:08 -0500, Mark Mielke wrote:

Stupid question #2: Is it well recognized that the CPU is the
bottleneck in the PostgreSQL sorting mechanism? Or might it be memory
bandwidth and I/O?

I think it depends a lot on several factors. It's probably a different
bottleneck for integers versus localized text, and depends on the
available memory and I/O characteristics.

Makes sense.

It would seem to me that any sort worth parallelizing (administrative
and synchronization overhead), must have data larger than the L2
cache. If larger than the L2 cache, it becomes real memory speed. If
real memory speed, wouldn't one CPU without hardware synchronization,
be able to fill the memory read/write pipe?

We do an external merge sort, which involves merging M runs together.
You seem to be implying that we can generate the output run at disk
speed, and therefore the CPU speed is unimportant.

Correct. Or, alternatively, you could achieve the same effect using
asychronous I/O or read ahead.

I suspect that the comparison costs are enough that the above statement
isn't true in all cases, particularly in the case of localized text.

That sounds possible, but I still feel myself suspecting that disk reads
will be much slower than localized text comparison. Perhaps I am
overestimating the performance of the comparison function?

Also, there is probably a lot of memory copying going on, and that
probably destroys a lot of the effectiveness of L2 caching. When L2
caching is ineffective, the CPU spends a lot of time just waiting on
memory. In that case, it's better to have P threads of execution all
waiting on memory operations in parallel.

I didn't consider the high throughput / high latency effect. This could
be true if the CPU prefetch isn't effective enough.

This would explain why "1p2t" would outperform a "1p1t" in Ron's
reference above.

These are just my first thoughts, however. There is a lot of existing
research out there that we can look into, and also a lot of tests that
we can run before jumping into this.

I think parallel sorting can be looked into separately from the other
sorting improvements.

Yep - I started to read up on it. It still sounds like it's a hard-ish
problem (to achieve near N times speedup for N CPU cores without
degrading performance for existing loads), but that doesn't mean
impossible. :-)

Cheers,
mark

--
Mark Mielke <mark@mielke.cc>

#25Dann Corbit
DCorbit@connx.com
In reply to: Mark Mielke (#24)
Re: Sorting Improvements for 8.4

As long as sorting improvements are being considered, may I suggest an
experiment that uses a very simple model?

Assuming that you have K subfiles created by the initial sorting pass,
insert the top record of each file into a priority queue.

Then, emit records from the queue until the priority queue is empty.

Now, there will be the objection that we will be jumping willy-nilly all
over the disk because of reading one record at a time, but (depending on
how it is implemented) generally several records are buffered during a
read.

So (as a gentle suggestion) I suggest testing the model. It works great
for a single CPU or multiple CPU system for the work that *I* do. I
have no idea if it will be a benefit for PostgreSQL or not, but it
should be a very simple matter to try it. As long as someone is doing
the work right now, it would be a good time to give it a go.

I am not very familiar with PostgreSQL internals, but I would be willing
to give a hand with it (not really sure how much time I can guarantee,
though, since I would be doing it on my free time).

#26Jeff Davis
pgsql@j-davis.com
In reply to: Mark Mielke (#24)
Re: Sorting Improvements for 8.4

On Wed, 2007-12-19 at 15:51 -0500, Mark Mielke wrote:

That sounds possible, but I still feel myself suspecting that disk
reads will be much slower than localized text comparison. Perhaps I am
overestimating the performance of the comparison function?

I think this simple test will change your perceptions:

Do an initdb with --locale="en_US.UTF-8" and start postgres.

test=> create table sorter(t text, b bytea, f float); CREATE TABLE
test=> insert into sorter select r AS rt, r::text::bytea AS rb, r AS rf
from (select random() as r from generate_series(1,1000000)) a;
INSERT 0 1000000
test=> select pg_size_pretty(pg_total_relation_size('sorter'));
pg_size_pretty
----------------
70 MB
(1 row)

test=> explain analyze select * from sorter order by t;
test=> explain analyze select * from sorter order by b;
test=> explain analyze select * from sorter order by f;

On my machine this table fits easily in memory (so there aren't any disk
reads at all). Sorting takes 7 seconds for floats, 9 seconds for binary
data, and 20 seconds for localized text. That's much longer than it
would take to read that data from disk, since it's only 70MB (which
takes a fraction of a second on my machine).

I think this disproves your hypothesis that sorting happens at disk
speed.

Yep - I started to read up on it. It still sounds like it's a hard-ish
problem (to achieve near N times speedup for N CPU cores without
degrading performance for existing loads), but that doesn't mean
impossible. :-)

You don't even need multiple cores to achieve a speedup, according to
Ron's reference.

Regards,
Jeff Davis

#27Jeff Davis
pgsql@j-davis.com
In reply to: Dann Corbit (#25)
Re: Sorting Improvements for 8.4

On Wed, 2007-12-19 at 14:41 -0800, Dann Corbit wrote:

As long as sorting improvements are being considered, may I suggest an
experiment that uses a very simple model?

Assuming that you have K subfiles created by the initial sorting pass,
insert the top record of each file into a priority queue.

Then, emit records from the queue until the priority queue is empty.

What is the principle difference between that idea and our existing sort
algorithm?

There's a good explanation in the comment at the top of tuplesort.c.

Regards,
Jeff Davis

#28Dann Corbit
DCorbit@connx.com
In reply to: Jeff Davis (#27)
Re: Sorting Improvements for 8.4

-----Original Message-----
From: Jeff Davis [mailto:pgsql@j-davis.com]
Sent: Wednesday, December 19, 2007 3:10 PM
To: Dann Corbit
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Sorting Improvements for 8.4

On Wed, 2007-12-19 at 14:41 -0800, Dann Corbit wrote:

As long as sorting improvements are being considered, may I suggest

an

experiment that uses a very simple model?

Assuming that you have K subfiles created by the initial sorting

pass,

insert the top record of each file into a priority queue.

Then, emit records from the queue until the priority queue is empty.

What is the principle difference between that idea and our existing

sort

algorithm?

There's a good explanation in the comment at the top of tuplesort.c.

According to the comments, PostgreSQL uses replacement selection.
Replacement selection is a wonderful thing because it creates runs that
are twice as long as normal due to the snowplow effect. See (for
instance):
http://ieeexplore.ieee.org/Xplore/login.jsp?url=/iel5/69/27216/01209012.
pdf

Then, the merge routine will have half as many runs to merge the files
together.

So (for instance) without replacement selection, if you create 1024
subfiles, then replacement selection will create 512. That saves one
merge pass.

The algorithm that I am suggesting will take exactly one pass to merge
all of the files.

It works like this...

Imagine an array of pointers to the subfiles:
[*subfile][*subfile]...[*subfile]

Step 0:
We sort the array by a comparison operator that examines the top element
of each subfile. So now the array is ordered such that the record with
the smallest key is in array slot 0.

Step 1:
We remove the first record from the subfile in array slot 0. Now, the
priority of the first element *may* have changed. So if it is no longer
smaller than the subfile immediately to the right, we do a binary
insertion to put this subfile in its new location, moving the contents
of array slot[1] to array slot 0 if it is needed.

Step 2:
Is the entire list of subfiles empty? If yes, then terminate, if no
then go to Step 1.

Like I said, it is ultra-simple and it sorts the entire contents of all
subfiles to the output with a single pass.

Consider the way that current replacement selection works. The actual
O(f(N)) behavior of replacement selection is just terrible O(n^2). But
because we save one full merge pass, it is usually worth it anyway,
since memory access is much faster than disk. And if we only have a few
subfiles, the savings will be large. In the case of a priority queue
merge, we only have one single merge pass no matter how many subfiles
there are.

#29Dann Corbit
DCorbit@connx.com
In reply to: Dann Corbit (#28)
Re: Sorting Improvements for 8.4

P.S.
A beautiful paper on replacement selection is found here:
http://students.fim.uni-passau.de/~fickensc/Proseminar/Proseminar.pdf

#30Jeff Davis
pgsql@j-davis.com
In reply to: Dann Corbit (#28)
Re: Sorting Improvements for 8.4

On Wed, 2007-12-19 at 15:19 -0800, Dann Corbit wrote:

The algorithm that I am suggesting will take exactly one pass to merge
all of the files.

From tuplesort.c:

"In the current code we determine the number of tapes M on the basis of
workMem: we want workMem/M to be large enough that we read a fair amount
of data each time we preread from a tape, so as to maintain the locality
of access described above. Nonetheless, with large workMem we can have
many tapes."

It seems like you are just choosing M to be equal to the number of
initial runs, whereas the current code takes into account the cost of
having workMem/M too small.

We do want to increase the number of runs that can be merged at once;
that's what dynamic run handling and forecasting are all about. But we
want to avoid unnecessary seeking, also.

Regards,
Jeff Davis

#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mark Mielke (#24)
Re: Sorting Improvements for 8.4

Mark Mielke <mark@mark.mielke.cc> writes:

Jeff Davis wrote:

Also, there is probably a lot of memory copying going on, and that
probably destroys a lot of the effectiveness of L2 caching. When L2
caching is ineffective, the CPU spends a lot of time just waiting on
memory. In that case, it's better to have P threads of execution all
waiting on memory operations in parallel.

I didn't consider the high throughput / high latency effect. This could
be true if the CPU prefetch isn't effective enough.

Note that if this is the argument, then there's a ceiling on the speedup
you can expect to get: it's just the extent of mismatch between the CPU
and memory speeds. I can believe that suitable test cases would show
2X improvement for 2 threads, but it doesn't follow that you will get
10X improvement with 10 threads, or even 4X with 4.

regards, tom lane

#32Gregory Stark
stark@enterprisedb.com
In reply to: Jeff Davis (#26)
Re: Sorting Improvements for 8.4

"Jeff Davis" <pgsql@j-davis.com> writes:

test=> explain analyze select * from sorter order by t;
test=> explain analyze select * from sorter order by b;
test=> explain analyze select * from sorter order by f;

On my machine this table fits easily in memory (so there aren't any disk
reads at all). Sorting takes 7 seconds for floats, 9 seconds for binary
data, and 20 seconds for localized text. That's much longer than it
would take to read that data from disk, since it's only 70MB (which
takes a fraction of a second on my machine).

I think this disproves your hypothesis that sorting happens at disk
speed.

I suspect most of that is spent just copying the data around. Which would not
be helped by having multiple threads doing the copying -- and in fact might be
exacerbated if it required an extra copy to consolidate all the data in the
end.

How long does a "explain analyze sinmple select * from sorter" take?

And assuming you're doing disk sorts (in disk cache) you're doing quite a lot
of copying to temporary files (in disk cache) and then back to memory.

Note that speeding up a query from 20s to 5s isn't terribly useful. If it's
OLTP you can't be using all your cores for each user anyways. And if it's DSS
20s isn't a problem.

Where parallel processing like this becomes attractive is when you're running
a 2 hour query on a machine sequentially running scheduled batch jobs which
can be sped up to 30 minutes. But in that case you're almost certainly being
limited by your disk bandwidth, not your cpu speed.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's PostGIS support!

#33Dann Corbit
DCorbit@connx.com
In reply to: Gregory Stark (#32)
Re: Sorting Improvements for 8.4

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
owner@postgresql.org] On Behalf Of Gregory Stark
Sent: Wednesday, December 19, 2007 5:26 PM
To: Jeff Davis
Cc: Mark Mielke; Michał Zaborowski; Simon Riggs; Ron Mayer; pgsql-
hackers@postgresql.org
Subject: Re: [HACKERS] Sorting Improvements for 8.4

"Jeff Davis" <pgsql@j-davis.com> writes:

test=> explain analyze select * from sorter order by t;
test=> explain analyze select * from sorter order by b;
test=> explain analyze select * from sorter order by f;

On my machine this table fits easily in memory (so there aren't any disk
reads at all). Sorting takes 7 seconds for floats, 9 seconds for binary
data, and 20 seconds for localized text. That's much longer than it
would take to read that data from disk, since it's only 70MB (which
takes a fraction of a second on my machine).

I think this disproves your hypothesis that sorting happens at disk
speed.

I suspect most of that is spent just copying the data around. Which would
not
be helped by having multiple threads doing the copying -- and in fact
might be
exacerbated if it required an extra copy to consolidate all the data in
the
end.

Benchmarking a single system will really only explain that system.
Someone may have a disk farm with 2GB/Sec throughput:
http://www.sql-server-performance.com/articles/per/system_storage_configuration_p1.aspx

But such a configuration is very unlikely.

Someone may have 10GB/S NIC cards, but those too, are rare.

So for any benchmark, we will really just end up with a number for that system.

Typically, disk is the bottleneck.
I found this on the net somewhere, but it's quite a useful table for capacity planning (to find the weak link in the chain using back of the envelope calculations):

Interface Width Frequency Bytes/Sec Bits/Sec
4-way interleaved PC1600 (DDR200) SDRAM 4 x 64bits 100 MHz DDR 6.4 GB/s 51 Gbps
Opteron HyperTransport memory bus 128bits 200 MHz DDR 6.4 GB/s 51 Gbps
Pentium 4 "800 MHz" FSB 64bits 200 MHz QDR 6.4 GB/s 51 Gbps
PC2 6400 (DDR-II 800) SDRAM 64bits 400 MHz DDR 6.4 GB/s 51 Gbps
PC2 5300 (DDR-II 667) SDRAM 64bits 333 MHz DDR 5.3 GB/s 43 Gbps
Pentium 4 "533 MHz" FSB 64bits 133 MHz QDR 4.3 GB/s 34 Gbps
PC2 4300 (DDR-II 533) SDRAM 64bits 266 MHz DDR 4.3 GB/s 34 Gbps
2-channel PC1066 RDRAM 2 x 16bits 533 MHz DDR 4.3 GB/s 34 Gbps
PCI-X 533 64bits 533 MHz 4.3 GB/s 34 Gbps
PCI-Express x16 serial/16lanes 2.5 GHz 4 GB/s 32 Gbps
Pentium 4 "400 MHz" FSB 64bits 100 MHz QDR 3.2 GB/s 25.6 Gbps
2-channel PC800 RDRAM 2 x 16bits 400 MHz DDR 3.2 GB/s 25.6 Gbps
2-way interleaved PC1600 (DDR200) SDRAM 2 x 64bits 100 MHz DDR 3.2 GB/s 25.6 Gbps
PC2 3200 (DDR-II 400) SDRAM 64bits 200 MHz DDR 3.2 GB/s 25.6 Gbps
PC3200 (DDR400) SDRAM 64bits 200 MHz DDR 3.2 GB/s 25.6 Gbps
PC2700 (DDR333) SDRAM 64bits 167 MHz DDR 2.7 GB/s 21 Gbps
PC2100 (DDR266) SDRAM 64bits 133 MHz DDR 2.1 GB/s 17 Gbps
AGP 8x 32bits 533 MHz 2.1 GB/s 17 Gbps
PCI-X 266 64bits 266 MHz 2.1 GB/s 17 Gbps
PCI-Express x8 serial/8lanes 2.5 GHz 2 GB/s 16 Gbps
EV6 bus (Athlon/Duron FSB) 64bits 100 MHz DDR 1.6 GB/s 13 Gbps
PC1600 (DDR200) SDRAM 64bits 100 MHz DDR 1.6 GB/s 13 Gbps
PC800 RDRAM 16bits 400 MHz DDR 1.6 GB/s 13 Gbps
PC150 SDRAM 64bits 150 MHz 1.3 GB/s 10.2 Gbps
10 gigabit ethernet serial 10 GHz 1.25 GB/s 10 Gbps
OC-192 serial 9.953 GHz 1.24 GB/s 9.953 Gbps
133 MHz FSB 64bits 133 MHz 1.06 GB/s 8.5 Gbps
PC133 SDRAM 64bits 133 MHz 1.06 GB/s 8.5 Gbps
AGP 4x 32bits 266 MHz 1.06 GB/s 8.5 Gbps
PCI-X 64bits 133 MHz 1.06 GB/s 8.5 Gbps
PCI-Express x4 serial/4lanes 2.5 GHz 1 GB/s 8 Gbps
100 MHz FSB 64bits 100 MHz 800 MB/s 6.4 Gbps
PC100 SDRAM 64bits 100 MHz 800 MB/s 6.4 Gbps
PC66 SDRAM 64bits 66 MHz 533 MB/s 4.3 Gbps
fast/wide PCI 64bits 66 MHz 533 MB/s 4.3 Gbps
AGP 2x 32bits 133 MHz 533 MB/s 4.3 Gbps
single-link DVI 12bits 165 MHz DDR 495 MB/s 3.96 Gbps
Ultra-320 SCSI 16bits 160 MHz 320 MB/s 2.6 Gbps
OC-48 network serial 2.488 GHz 311 MB/s 2.488 Gbps
AGP 32bits 66 MHz 266 MB/s 2.1 Gbps
PCI-Express x1 serial 2.5 GHz 250 MB/s 2 Gbps
Serial ATA/1500 disk serial 1.5 GHz 187 MB/s 1.5 Gbps
Ultra-160 SCSI 16bits 80 MHz 160 MB/s 1.3 Gbps
OC-24 network serial 1.244 GHz 155 MB/s 1.244 Gbps
PCI 32bits 33 MHz 133 MB/s 1.06 Gbps
ATA/133 disk 8bits 66 MHz DDR 133 MB/s 1.06 Gbps
gigabit ethernet serial 1 GHz 125 MB/s 1 Gbps
ATA/100 disk 8bits 50 MHz DDR 100 MB/s 800 Mbps
IEEE 1394b serial 800 MHz 100 MB/s 800 Mbps
Ultra-2 Wide SCSI 16bits 40 MHz 80 MB/s 640 Mbps
OC-12 network serial 622.08 MHz 77.7 MB/s 622.08 Mbps
ATA/66 disk 8bits 33 MHz DDR 66 MB/s 533 Mbps
USB-2 serial 480 MHz 60 MB/s 480 Mbps
IEEE 1394 serial 400 MHz 50 MB/s 400 Mbps
Ultra Wide SCSI 16bits 20 MHz 40 MB/s 320 Mbps
ATA/33 disk 8bits 16.6 MHz DDR 33 MB/s 266 Mbps
Fast Wide SCSI 16bits 10 MHz 20 MB/s 160 Mbps
OC-3 network serial 155.52 MHz 19.4 MB/s 155.52 Mbps
100baseT ethernet serial 100 MHz 12.5 MB/s 100 Mbps
OC-1 network serial 51.84 MHz 6.5 MB/s 51.84 Mbps
T-3 network serial 45 MHz 5.6 MB/s 44.736 Mbps
USB serial 12 MHz 1.5 MB/s 12 Mbps
10baseT ethernet serial 10 MHz 1.25 MB/s 10 Mbps
IrDA-2 serial 4 MHz 500 KB/s 4 Mbps
T-1 network serial 1.5 MHz 193 KB/s 1.544 Mbps

How long does a "explain analyze sinmple select * from sorter" take?

And assuming you're doing disk sorts (in disk cache) you're doing quite a
lot
of copying to temporary files (in disk cache) and then back to memory.

Note that speeding up a query from 20s to 5s isn't terribly useful. If
it's
OLTP you can't be using all your cores for each user anyways. And if it's
DSS
20s isn't a problem.

Unless (of course) there are 20,000 users doing the queries that would take 20 seconds but now they take 5 (when run single-user). They will still have a bit of a wait, of course.

Where parallel processing like this becomes attractive is when you're
running
a 2 hour query on a machine sequentially running scheduled batch jobs
which
can be sped up to 30 minutes. But in that case you're almost certainly
being
limited by your disk bandwidth, not your cpu speed.

A linear speedup of 2 or more is always worth while[*]. Since sorting (e.g. for 'group by' and 'order by') and sort joins are a major database task, I guess that a linear speedup by a factor of 2 might make the database operations on the whole be 10% faster or so {OK, it's a SWAG}. I guess it would look good on the benchmarks, if nothing else.

[*] unless it is already fast enough. If, at peak load a query takes 1 ms, then making the query take 0.5 ms is not going to win you any medals, especially if the improvement costs $10,000.

#34Greg Smith
gsmith@gregsmith.com
In reply to: Dann Corbit (#33)
Re: Sorting Improvements for 8.4

On Wed, 19 Dec 2007, Dann Corbit wrote:

Benchmarking a single system will really only explain that system.
Someone may have a disk farm with 2GB/Sec throughput
But such a configuration is very unlikely.

If you believe comments like those at
http://www.c0t0d0s0.org/archives/1792-Do-it-yourself-X4500.html it's
possible to hit >2GB/s total to the 48 disks in one of the Sun X4500
servers, which start at $24K. May be unlikely to you, but I was reading
there after I set one up last night, and that's a boring standard
configuration for some Sun and Greenplum customers.

Also, that's today--by the time 8.4 is mainstream high-end machines will
be even faster. Wanna make a bet on how much disk throughput will be
available as SSD disks go mainstream in the next two years?

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#35Gregory Stark
stark@enterprisedb.com
In reply to: Dann Corbit (#33)
Re: Sorting Improvements for 8.4

"Dann Corbit" <DCorbit@connx.com> writes:

Note that speeding up a query from 20s to 5s isn't terribly useful. If it's
OLTP you can't be using all your cores for each user anyways. And if it's
DSS 20s isn't a problem.

Unless (of course) there are 20,000 users doing the queries that would take 20
seconds but now they take 5 (when run single-user). They will still have a bit
of a wait, of course.

I'm not exactly following. If you have 20,000 users then you're probably using
all the processors already. If you process them one by one on 4 cores in 5s
then you'll get the same throughput as if you ran them four at a time on 1
core each in 20s.

Where parallel processing like this becomes attractive is when you're
running a 2 hour query on a machine sequentially running scheduled batch
jobs which can be sped up to 30 minutes. But in that case you're almost
certainly being limited by your disk bandwidth, not your cpu speed.

A linear speedup of 2 or more is always worth while[*]. Since sorting (e.g. for
group by' and 'order by') and sort joins are a major database task, I guess
that a linear speedup by a factor of 2 might make the database operations on
the whole be 10% faster or so {OK, it's a SWAG}. I guess it would look good on
the benchmarks, if nothing else.

Except note that you're not getting this linear speedup for free. To get a
linear speedup of 2x you'll be using more than 2x the cpu resources. If there
is nothing else contending for that resource (such as the scenario I described
where you're running a single large batch query on a system and want to use
all available resources to run it as fast as possible), then you'll get a 2x
speedup.

But if there is more than one query running on the system then you're not
actually gaining anything. Each query will run faster but you won't be able to
run as many simultaneously without having them slow back down. And the
overhead of parallelizing the query will be a net loss.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's RemoteDBA services!

#36Gregory Stark
stark@enterprisedb.com
In reply to: Greg Smith (#34)
Re: Sorting Improvements for 8.4

"Greg Smith" <gsmith@gregsmith.com> writes:

On Wed, 19 Dec 2007, Dann Corbit wrote:

Benchmarking a single system will really only explain that system.
Someone may have a disk farm with 2GB/Sec throughput
But such a configuration is very unlikely.

If you believe comments like those at
http://www.c0t0d0s0.org/archives/1792-Do-it-yourself-X4500.html it's possible
to hit >2GB/s total to the 48 disks in one of the Sun X4500 servers, which
start at $24K. May be unlikely to you, but I was reading there after I set one
up last night, and that's a boring standard configuration for some Sun and
Greenplum customers.

Surely such machines have kickass memory backplanes too though? How could it
ever be reasonable to have an i/o controller with more bandwidth than your
memory?

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!

#37Ron Mayer
rm_pg@cheapcomplexdevices.com
In reply to: Tom Lane (#31)
Re: Sorting Improvements for 8.4

Tom Lane wrote:

...I can believe that suitable test cases would show
2X improvement for 2 threads,

One other thing I found interesting is that their test case
showed a near 2X improvement for hyperthreading; where I haven't
heard of many other ways to get hyperthreading to show improvements
for postgreql.

but it doesn't follow that you will get
10X improvement with 10 threads, or even 4X with 4.

Yeah - unless those 10 cores have additional I/O to the
memories compared to a 1 core system (which I'd hope
would be the case or else I'd expect many apps would be
run into memory bottlenecks on such systems, no?).

#38Greg Smith
gsmith@gregsmith.com
In reply to: Gregory Stark (#36)
Re: Sorting Improvements for 8.4

On Thu, 20 Dec 2007, Gregory Stark wrote:

Surely such machines have kickass memory backplanes too though? How could it
ever be reasonable to have an i/o controller with more bandwidth than your
memory?

Dann had the right general numbers here--max of 6.4GB/s between processors
and you might coax an aggregate of double that out of the DDR RAM with 2
4-way interleaved banks of memory. Let's call it 12GB/s theoretical max.
If the theoretical max of the disks is 2GB/s, that's only a 6:1 headroom,
and with a decent cache rate it's not outrageous to imagine you could
bottleneck on memory with some things before you run out of disk
throughput.

Right now I think a lot of the disk bottlenecks are seek-limited more than
anything, but SSD will knock that one out for apps that are more about
throughput than maximum storage. I could already switch to SDD usefully
today for some of what I do that's in that category, it's just a bit too
expensive to do yet; soon, though.

Just trying to usefully estimate where the edge of that back of the
envelope should go to.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#39Martijn van Oosterhout
kleptog@svana.org
In reply to: Ron Mayer (#37)
Re: Sorting Improvements for 8.4

On Wed, Dec 19, 2007 at 07:17:21PM -0800, Ron Mayer wrote:

but it doesn't follow that you will get
10X improvement with 10 threads, or even 4X with 4.

Yeah - unless those 10 cores have additional I/O to the
memories compared to a 1 core system (which I'd hope
would be the case or else I'd expect many apps would be
run into memory bottlenecks on such systems, no?).

I don't suppose you saw the document from Ulrich Drepper "What Every
Programmer Should Know About Memory". It's a fact that most machines
with multiple cores have less L2 cache/core than a single core
machines. And having multiple conduits to main memory isn't that common
at all. So having more threads sometimes *decreases* performance
because you cut the L2 cache and memory bandwidth in half.

The document is very useful for getting tips about how to work out
optimal thread/memory/datasize ratios.

The way around this is a NUMA architecture, but that's a whole
other ball of wax.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Those who make peaceful revolution impossible will make violent revolution inevitable.
-- John F Kennedy

#40Greg Smith
gsmith@gregsmith.com
In reply to: Martijn van Oosterhout (#39)
Re: Sorting Improvements for 8.4

On Thu, 20 Dec 2007, Martijn van Oosterhout wrote:

The way around this is a NUMA architecture, but that's a whole
other ball of wax.

Quick note for those reading Ulrich's paper: he refers in a couple of
places to Intel's upcoming CSI approach to NUMA. This has now been
renamed QuickPath, and it looks like it will be late 2008 before that even
makes it to Itanium processors.

The fact that AMD has a good NUMA implementation in their Opteron lines
while Intel's Xeon processors do not is one area AMD still has a clear
competative lead on. But you need memory bandwidth starved application
before that matters more than the fact that the current Xeons are faster
in general.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#41Brian Hurt
bhurt@janestcapital.com
In reply to: Dann Corbit (#25)
Re: Sorting Improvements for 8.4

While we're blue skying things, I've had an idea for a sorting algorithm
kicking around for a couple of years that might be interesting. It's a
variation on heapsort to make it significantly more block-friendly. I
have no idea if the idea would work, or how well it'd work, but it might
be worthwhile kicking around.

Now, the core idea of heapsort is that the array is put into heap order-
basically, that a[i] >= a[2i+1] and a[i] >= a[2i+2] (doing the 0-based
array version here). The problem is that, assuming that the length of a
is larger than memory, then a[2i+1] is likely going to be on a different
page or block than a[i]. That means every time you have to bubble down
a new element, you end up reading O(log N) blocks- this is *per element*.

The variation is to instead work with blocks, so you have a block of
entries b[i], and you change the definition of heap order, so that
min(b[i]) >= max(b[2i+1]) and min(b[i]) >= max(b[2i+2]). Also, during
bubble down, you need to be carefull to only change the minimum value of
one of the two child blocks b[2i+1] and b[2i+2]. Other than that, the
algorithm works as normal. The advantage of doing it this way is that
while each bubble down still takes O(log N) blocks being touched, you
get a entire block worth of results for your effort. Make your blocks
large enough (say, 1/4 the size of workmem) and you greatly reduce N,
the number of blocks you have to deal with, and get much better I/O
(when you're reading, you're reading megabytes at a shot).

Now, there are boatloads of complexities I'm glossing over here. This
is more of a sketch of the idea. But it's something to consider.

Brian

#42Dann Corbit
DCorbit@connx.com
In reply to: Brian Hurt (#41)
Re: Sorting Improvements for 8.4

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
owner@postgresql.org] On Behalf Of Brian Hurt
Sent: Thursday, December 20, 2007 6:42 AM
To: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Sorting Improvements for 8.4

While we're blue skying things, I've had an idea for a sorting

algorithm

kicking around for a couple of years that might be interesting. It's

a

variation on heapsort to make it significantly more block-friendly. I
have no idea if the idea would work, or how well it'd work, but it

might

be worthwhile kicking around.

Now, the core idea of heapsort is that the array is put into heap

order-

basically, that a[i] >= a[2i+1] and a[i] >= a[2i+2] (doing the 0-based
array version here). The problem is that, assuming that the length of

a

is larger than memory, then a[2i+1] is likely going to be on a

different

page or block than a[i]. That means every time you have to bubble

down

a new element, you end up reading O(log N) blocks- this is *per

element*.

The variation is to instead work with blocks, so you have a block of
entries b[i], and you change the definition of heap order, so that
min(b[i]) >= max(b[2i+1]) and min(b[i]) >= max(b[2i+2]). Also, during
bubble down, you need to be carefull to only change the minimum value

of

one of the two child blocks b[2i+1] and b[2i+2]. Other than that, the
algorithm works as normal. The advantage of doing it this way is that
while each bubble down still takes O(log N) blocks being touched, you
get a entire block worth of results for your effort. Make your blocks
large enough (say, 1/4 the size of workmem) and you greatly reduce N,
the number of blocks you have to deal with, and get much better I/O
(when you're reading, you're reading megabytes at a shot).

Now, there are boatloads of complexities I'm glossing over here. This
is more of a sketch of the idea. But it's something to consider.

It's an interesting idea to work with a "heap of heaps" where you try to
keep each heap page-sized. It reminds me of the B+ tree, where you
collect a whole bunch of nodes into a single page.

I don't know if you have examined weak-heaps, but there are some
interesting results for weak-heap approaches. As you know, heapsort
variants do not degenerate to O(N^2).

On this link:
http://www.jea.acm.org/2002/EdelkampHeapsort/

I highly recommend all the goodies he has embedded (papers, source,
etc.)

#43Ron Mayer
rm_pg@cheapcomplexdevices.com
In reply to: Gregory Stark (#32)
Re: Sorting Improvements for 8.4

Gregory Stark wrote:

Note that speeding up a query from 20s to 5s isn't terribly useful.

I disagree totally with that.

That is the difference between no chance of someone waiting for a web
page to load; vs. a good chance they'd wait. And 2s vs 0.5s is the
difference between a web site that feels responsive and one that doesn't.

If it's OLTP you can't be using all your cores for each user anyways.

Even so, I'd much rather keep each response time lower. If web page
requests are coming in at 1 a second, it's much nicer to respond to
each of them in 1 second than in 4 seconds -- even if the overall
throughput is identical.

#44Mark Mielke
mark@mark.mielke.cc
In reply to: Jeff Davis (#26)
Re: Sorting Improvements for 8.4

Jeff Davis wrote:

On Wed, 2007-12-19 at 15:51 -0500, Mark Mielke wrote:

That sounds possible, but I still feel myself suspecting that disk
reads will be much slower than localized text comparison. Perhaps I am
overestimating the performance of the comparison function?

I think this simple test will change your perceptions:

Yes - I received the same results (although my PostgreSQL doesn't have a
built in case ::text::bytea... :-) )

On my machine this table fits easily in memory (so there aren't any disk
reads at all). Sorting takes 7 seconds for floats, 9 seconds for binary
data, and 20 seconds for localized text. That's much longer than it
would take to read that data from disk, since it's only 70MB (which
takes a fraction of a second on my machine).

Might this mean that PostgreSQL performs too many copy operations? :-)

I think this disproves your hypothesis that sorting happens at disk
speed.

Yes.

Yep - I started to read up on it. It still sounds like it's a hard-ish
problem (to achieve near N times speedup for N CPU cores without
degrading performance for existing loads), but that doesn't mean
impossible. :-)

You don't even need multiple cores to achieve a speedup, according to
Ron's reference.

I think Ron's reference actually said that you don't need full cores to
achieve a speedup. It spoke of Intel's HT system. A single CPU with a
single execution pipeline is not going to function better with multiple
threads unless the single thread case is written wrong. Multiple threads
is always an overall loss without hardware support. The thinking on this
is that multiple threads can sometimes lead to cleaner designs, which
are sometimes more naturally written to be performing. In my experience,
the opposite is usually true.

But, if you do have HT, and the algorithm can be modified to take
advantage of it for an overall increase in speed - great.

Cheers,
mark

--
Mark Mielke <mark@mielke.cc>

#45Jeff Davis
pgsql@j-davis.com
In reply to: Gregory Stark (#32)
Re: Sorting Improvements for 8.4

On Thu, 2007-12-20 at 01:26 +0000, Gregory Stark wrote:

I suspect most of that is spent just copying the data around. Which would not
be helped by having multiple threads doing the copying -- and in fact might be
exacerbated if it required an extra copy to consolidate all the data in the
end.

The theory is that it could be helped by multiple threads, because of
the memory latency.

How long does a "explain analyze sinmple select * from sorter" take?

2 seconds, but the table is already in cache I'm sure (since it's so
small).

Note that speeding up a query from 20s to 5s isn't terribly useful. If it's
OLTP you can't be using all your cores for each user anyways. And if it's DSS
20s isn't a problem.

I'm not pushing for parallel sort, I'm just brainstorming. I think Ron's
idea has merit, but I realize it also has limitations.

Where parallel processing like this becomes attractive is when you're running
a 2 hour query on a machine sequentially running scheduled batch jobs which
can be sped up to 30 minutes. But in that case you're almost certainly being
limited by your disk bandwidth, not your cpu speed.

Are you sure that's always the case? My test seemed to indicate that
sorting took longer than it would to read the file from disk.

Regards,
Jeff Davis

#46Mark Mielke
mark@mark.mielke.cc
In reply to: Jeff Davis (#45)
Re: Sorting Improvements for 8.4

Jeff Davis wrote:

Where parallel processing like this becomes attractive is when you're running
a 2 hour query on a machine sequentially running scheduled batch jobs which
can be sped up to 30 minutes. But in that case you're almost certainly being
limited by your disk bandwidth, not your cpu speed.

Are you sure that's always the case? My test seemed to indicate that
sorting took longer than it would to read the file from disk.

It's probably not a relevant scenario either, as this discussion has
only been about improving the performance of the sort, and I suspect
there are very few database loads with performance characteristics
completely defined by the efficiency of the sort algorithm? :-)

So far I am getting:

1) Sort is slower than many people expect. (Jeff's test case
emphasizes this well)
2) White papers exist that document theoretical, simulated, and in
some cases actual execution where parallel sort can be beneficial.
3) White papers exist that document how parallel sort is difficult
to get right, and that characteristics of machines in use today prevent
full utilization.
4) PostgreSQL is not designed to spread a single query across
multiple execution units (whether CPUs, cores, or HT).

It's interesting discussion for me thus far.

Cheers,
mark

--
Mark Mielke <mark@mielke.cc>

#47Brian Hurt
bhurt@janestcapital.com
In reply to: Brian Hurt (#41)
Re: Sorting Improvements for 8.4

Brian Hurt wrote:

While we're blue skying things, I've had an idea for a sorting
algorithm kicking around for a couple of years that might be
interesting. It's a variation on heapsort to make it significantly
more block-friendly. I have no idea if the idea would work, or how
well it'd work, but it might be worthwhile kicking around.

Now, the core idea of heapsort is that the array is put into heap
order- basically, that a[i] >= a[2i+1] and a[i] >= a[2i+2] (doing the
0-based array version here). The problem is that, assuming that the
length of a is larger than memory, then a[2i+1] is likely going to be
on a different page or block than a[i]. That means every time you
have to bubble down a new element, you end up reading O(log N) blocks-
this is *per element*.

The variation is to instead work with blocks, so you have a block of
entries b[i], and you change the definition of heap order, so that
min(b[i]) >= max(b[2i+1]) and min(b[i]) >= max(b[2i+2]). Also, during
bubble down, you need to be carefull to only change the minimum value
of one of the two child blocks b[2i+1] and b[2i+2]. Other than that,
the algorithm works as normal. The advantage of doing it this way is
that while each bubble down still takes O(log N) blocks being touched,
you get a entire block worth of results for your effort. Make your
blocks large enough (say, 1/4 the size of workmem) and you greatly
reduce N, the number of blocks you have to deal with, and get much
better I/O (when you're reading, you're reading megabytes at a shot).

Now, there are boatloads of complexities I'm glossing over here. This
is more of a sketch of the idea. But it's something to consider.

Following up to myself (my apologies), but it's occurred to me that
there are three advantages to this proposal that I've since thought of:

1) The two child blocks b[2i+1] and b[2i+2]- the one with the larger
minimum element is the one we might replace. In other words, if
min(b[2i+1]) > min(b[2i+2]) and min(b[i]) < min(b[2i+1]), then we know
we're going to want the blocks b[4i+3] and b[4i+4]- before we're done
with blocks b[2i+1] and b[2i+2]. The point here is that this would work
wonders with the posix_fadvise/asyncio ideas kicking around. It'd be
easy for the code to keep 2 large writes and 2 large reads going pretty
constantly.

2) There is some easy parallelization available. I'm not sure how much
worth this is, but the bubble down code is fairly easy to parallelize.
If we have two bubble-downs going on in parallel, once they go down
different branches (one thread goes to block b[2i+1] while the other
goes to b[2i+2]) they no longer interact. Blocks near the root of the
heap would be contended over, and multiple threads means smaller blocks
to keep the total memory foot print the same. Personally, I think the
asyncio idea above is more likely to be worthwhile.

3) It's possible to perform the sort lazily. You have the initial O(N)
pass over the list, but then each block is only O(log N) cost. If it's
likely that only the first part of the result is needed, then much of
the work can be avoided.

Brian

#48Gregory Stark
stark@enterprisedb.com
In reply to: Brian Hurt (#47)
Re: Sorting Improvements for 8.4

"Brian Hurt" <bhurt@janestcapital.com> writes:

3) It's possible to perform the sort lazily. You have the initial O(N) pass
over the list, but then each block is only O(log N) cost. If it's likely that
only the first part of the result is needed, then much of the work can be
avoided.

Now that's a *fascinating* idea. I'm having trouble coming up with a really
killer use case for it since the bounded heap sort takes care of many cases
where it would seem to apply. But it seems rally promising.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's RemoteDBA services!

#49Joris Dobbelsteen
Joris@familiedobbelsteen.nl
In reply to: Simon Riggs (#1)
Re: Sorting Improvements for 8.4

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Ron Mayer
Sent: Wednesday, 19 December 2007 19:26
To: Mark Mielke; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Sorting Improvements for 8.4

Or do you mean being able to perform parts of the query plan

fully in

parallel? If this, then one would need a lot more than

ParallelSort...

I wouldn't recommend that - it seems like a Hard Problem.

My guess is that the best way to use multiple threads in one
backend would be to find specific algorithms like sorting that
would be easier to isolate.

To give my view on this problem: if I'm looking at a competing
(commercial) database product, they added some operations called
"parallize" and "combine". Basically they split the data across several
threads at one point and combine them later. This is basically what you
are also implementing for "parallelsort", but as a single step in the
query exeuction.

In my opinion your starting point is too narrow and specific, especially
since a fairly simple generalization is possible. Instead, the issue
becomes the spill-to-disk code that needs to operate in parallel (which
needs to be tackled sooner or later anyways).

If you can change the sort into three steps: parallelize, sort (multiple
parallel instances) and combine (merge) you still have the same base
case. However I believe such a thing is much easier to extend to more
operations.

Futhermore it seems that cache is a considered a major problem,
especially the varying sizes. Wouldn't a cache-oblivious algorithm, like
<http://erikdemaine.org/papers/BRICS2002/&gt; or
<http://etd.uwaterloo.ca/etd/afarzan2004.pdf&gt; be a good starting point
for refinements on sort algorithm itself?
I believe you can get a more consistent performance depending on the
cache sizes, but it might be slower than a well-tuned quicksort.

Just my EUR 0,02...

- Joris

#50James Mansion
james@mansionfamily.plus.com
In reply to: Ron Mayer (#22)
Re: Sorting Improvements for 8.4

Ron Mayer wrote:

Or do you mean being able to perform parts of the query plan fully in
parallel? If this, then one would need a lot more than ParallelSort...

I wouldn't recommend that - it seems like a Hard Problem.

Isn't it the case that the implicit unions from processing partitioned
data provides a
more-or-less-ideal opportunity here?

I certainly have sympathy for parallelising expensive queries to bring
the best response
time down, even if the average under full load goes up slightly, since
any implied locks
(including pinning of read-ahead ages) will be released sooner.

And when load is light, users who are online get more of the hardware
they paid for.

James

#51Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#1)
Re: Sorting Improvements for 8.4

Added to TODO:

* Consider being smarter about memory and external files used during
sorts

http://archives.postgresql.org/pgsql-hackers/2007-11/msg01101.php
http://archives.postgresql.org/pgsql-hackers/2007-12/msg00045.php

---------------------------------------------------------------------------

Simon Riggs wrote:

Just wanted to review a few thoughts and ideas around improving external
sorts, as recently encouraged to do by Jim Nasby.

Current issues/opportunities are these:

ISSUES

a) Memory is always in short supply, so using what we have more
effectively is going to be welcome.

b) Heap sort has a reasonably strong anti-memory effect, meaning that
there is an optimum amount of memory for any sort. This shows itself
with the CPU time increasing during run forming, making this stage of
the sort CPU bound.

c) Many sorts are performed prior to aggregation. It might be possible
to aggregate prior to writing to disk, as a way of reducing the overall
I/O cost. Benefit would occur when the total CPU cost was same no matter
when aggregation occurred; that would not apply in all cases, so we
would need to sense when benefit was possible.

d) Generally reducing the I/O cost of sorting may help the merging
stages of a sort.

SOLUTIONS

The ideas that Greg Stark, Jim Nasby, Heikki and myself have discussed
to date were the following:

1. Sort I/O Compression
2. Aggregation during Sort
3. Memory Pools
4. Dynamic Heap Management
5. Dynamic Run Handling

I've added (5) to the list as well, which hasn't yet been discussed.

1. SORT I/O COMPRESSION

This idea is not dead yet, it just needs a full set of tests to confirm
that there is benefit in all cases. If there's not benefit in all cases,
we may be able to work out which cases those are, so we know when to use
it.

2. AGGREGATION DURING SORT

Many sorts are preliminary steps before aggregation. Aggregation during
run forming would potentially reduce size of heap and reduce number of
comparisons. For many types of aggregate this would not theoretically
increase the number of ops since sum(), avg(), min(), max() are all
commutative according to their inputs. We would probably need to add
another option to Aggregate Functions to indicate the possibility of
calculating the aggregate in this way, since some aggregates might rely
on the current situation that they expect all their inputs at once in
sorted order. (Windowed aggregates are unlikely to be this way).

3. MEMORY POOLS

Solving a) could be done by sensible management and allocation of
resources. Discussed before, so not rehashed here.

4. DYNAMIC HEAP MANAGEMENT

The size of the active heap required to produce the fewest number of
runs varies as the sort progresses. For example, sorting an already
sorted input needs a trivial heap size.

Larger heap sizes simply avoid forming more runs, which is not
necessarily a bad thing. More runs only become bad things when we go
beyond our ability to perform a single final merge (see Dynamic Run
Handling below).

Smaller heap sizes reduce the number of comparisons required, plus
increase the L2+ cache efficiencies. Those two things are the cause of
the anti-memory effect.

Because of b), optimising the size of the heap could potentially be a
good thing. This can make a considerable difference for nearly sorted
data (measurements required...).

When we have M amount of memory available to us, we don't start by using
it all. We start with m memory and only increase up to M if required.
Runs are built with memory set at m. If a tuple arrives that would force
the formation of a new run we assess

i) do we care if another run is formed? Use our knowledge of the likely
amount of data coming our way, compared with number of runs formed so
far and see if we really care. If we don't care, allow the new run to be
formed and carry on with just heap size of m. (see Dynamic Run Handling
later).

ii) if we do care about number of runs, then allow the heap to grow by
increments up to the full size of M. Increments would be at least x2 and
possibly x4. That way we always have work space to rearrange the heap.

All of this dances too cleverly around the exact technique and potential
costs of rearranging the heap. That is not to be ignored and is the next
task in evaluating and accepting/dismissing this potential technique.

In combination with memory pooling this technique might also allow
memory to be better distributed to other users.

5. DYNAMIC RUN HANDLING (in Final Merge)

Another way of addressing a) is to simply make better use of memory
itself. Let's look at that in more detail:

Number of runs that can be merged at once is currently fixed, based upon
available memory. This has the underlying assumption that all runs will
be concurrently active during final merging, which may not always be
true.

If we have random data then almost all runs will overlap with all other
runs, i.e. the min and max values are sufficiently wide that the runs do
all overlap. In many cases, data arrives in somewhat sorted order, e.g.
financial data is fairly regular with some late payers but not many, and
those trail off with a fairly tight decay. In the somewhat sorted case
we find that the actual overlap is less than total, so there are many
later runs that don't overlap the earlier ones. In the best case we
would find run 1 and 2 overlap, runs 2 and 3 overlap, then 3 and 4
overlap.

This is also the point where I suggest breaking away from Knuth
completely. All of the main algorithms described by Knuth are tape
sorts. A run is written to a particular tape and then stays there until
"moved" to another tape. That means we have to get super-clever about
how runs should be written and formed (see Knuth). If we realise that
the runs aren't fixed to particular tapes they are all just independent
runs, we can radically rethink sorting. There is no need to implement
Cascade Sort, but we do need to rethink merging from the ground up. (All
of which is a relief, because Knuth et al are definitely smarter than
me, but I've got disks and lots of memory and those guys had tapes.).

If we track the min and max values for each run, when run building is
finished we will be able to build a merging plan that allows us to be
smart about the runs we should bring together. We start with the run
with the lowest min value, as well as all runs that overlap that run.
When that run is exhausted we move to the next lowest and at that point
start merging all runs that overlap that one.

This then means we may be able to begin final merging with more runs
than the current cut-off. It's possible that we could merge an infinite
number of runs in final merge with fixed memory. If we *do* need to
merge we can work out which runs should be our best pre-merge
candidates, based upon how big they are and which other runs they
overlap. (That's much better than being forced to merge tapes 2, 7 and
17 because some bizarre math says so (see Knuth).)

Anyway, claiming to have found a better way than Knuth makes me feel a
little nervous, so some searching questions on this are very welcome.

Interestingly, if we combine this technique with dynamic heap management
we may be able to allow a very large number of efficiently written runs
to form without it causing any merging.

mac_man recently noted the possibility that some runs don't overlap at
all and so can be merged for free. That's true, though doesn't actually
improve the basic idea here which is building a merge plan after runs
have been formed, with an eye on minimizing and potentially elimination
the merge phase.

There's probably some typos or thinkos above, so go easy on me Greg!
They aren't there because I want to skim over anything.

I'm not likely to get a chance to do all of this in the near future, so
documenting it now should help others to carry things forward.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +