Sorting Improvements for 8.4

Started by Simon Riggsover 18 years ago51 messageshackers
Jump to latest
#1Simon Riggs
simon@2ndQuadrant.com

Just wanted to review a few thoughts and ideas around improving external
sorts, as recently encouraged to do by Jim Nasby.

Current issues/opportunities are these:

ISSUES

a) Memory is always in short supply, so using what we have more
effectively is going to be welcome.

b) Heap sort has a reasonably strong anti-memory effect, meaning that
there is an optimum amount of memory for any sort. This shows itself
with the CPU time increasing during run forming, making this stage of
the sort CPU bound.

c) Many sorts are performed prior to aggregation. It might be possible
to aggregate prior to writing to disk, as a way of reducing the overall
I/O cost. Benefit would occur when the total CPU cost was same no matter
when aggregation occurred; that would not apply in all cases, so we
would need to sense when benefit was possible.

d) Generally reducing the I/O cost of sorting may help the merging
stages of a sort.

SOLUTIONS

The ideas that Greg Stark, Jim Nasby, Heikki and myself have discussed
to date were the following:

1. Sort I/O Compression
2. Aggregation during Sort
3. Memory Pools
4. Dynamic Heap Management
5. Dynamic Run Handling

I've added (5) to the list as well, which hasn't yet been discussed.

1. SORT I/O COMPRESSION

This idea is not dead yet, it just needs a full set of tests to confirm
that there is benefit in all cases. If there's not benefit in all cases,
we may be able to work out which cases those are, so we know when to use
it.

2. AGGREGATION DURING SORT

Many sorts are preliminary steps before aggregation. Aggregation during
run forming would potentially reduce size of heap and reduce number of
comparisons. For many types of aggregate this would not theoretically
increase the number of ops since sum(), avg(), min(), max() are all
commutative according to their inputs. We would probably need to add
another option to Aggregate Functions to indicate the possibility of
calculating the aggregate in this way, since some aggregates might rely
on the current situation that they expect all their inputs at once in
sorted order. (Windowed aggregates are unlikely to be this way).

3. MEMORY POOLS

Solving a) could be done by sensible management and allocation of
resources. Discussed before, so not rehashed here.

4. DYNAMIC HEAP MANAGEMENT

The size of the active heap required to produce the fewest number of
runs varies as the sort progresses. For example, sorting an already
sorted input needs a trivial heap size.

Larger heap sizes simply avoid forming more runs, which is not
necessarily a bad thing. More runs only become bad things when we go
beyond our ability to perform a single final merge (see Dynamic Run
Handling below).

Smaller heap sizes reduce the number of comparisons required, plus
increase the L2+ cache efficiencies. Those two things are the cause of
the anti-memory effect.

Because of b), optimising the size of the heap could potentially be a
good thing. This can make a considerable difference for nearly sorted
data (measurements required...).

When we have M amount of memory available to us, we don't start by using
it all. We start with m memory and only increase up to M if required.
Runs are built with memory set at m. If a tuple arrives that would force
the formation of a new run we assess

i) do we care if another run is formed? Use our knowledge of the likely
amount of data coming our way, compared with number of runs formed so
far and see if we really care. If we don't care, allow the new run to be
formed and carry on with just heap size of m. (see Dynamic Run Handling
later).

ii) if we do care about number of runs, then allow the heap to grow by
increments up to the full size of M. Increments would be at least x2 and
possibly x4. That way we always have work space to rearrange the heap.

All of this dances too cleverly around the exact technique and potential
costs of rearranging the heap. That is not to be ignored and is the next
task in evaluating and accepting/dismissing this potential technique.

In combination with memory pooling this technique might also allow
memory to be better distributed to other users.

5. DYNAMIC RUN HANDLING (in Final Merge)

Another way of addressing a) is to simply make better use of memory
itself. Let's look at that in more detail:

Number of runs that can be merged at once is currently fixed, based upon
available memory. This has the underlying assumption that all runs will
be concurrently active during final merging, which may not always be
true.

If we have random data then almost all runs will overlap with all other
runs, i.e. the min and max values are sufficiently wide that the runs do
all overlap. In many cases, data arrives in somewhat sorted order, e.g.
financial data is fairly regular with some late payers but not many, and
those trail off with a fairly tight decay. In the somewhat sorted case
we find that the actual overlap is less than total, so there are many
later runs that don't overlap the earlier ones. In the best case we
would find run 1 and 2 overlap, runs 2 and 3 overlap, then 3 and 4
overlap.

This is also the point where I suggest breaking away from Knuth
completely. All of the main algorithms described by Knuth are tape
sorts. A run is written to a particular tape and then stays there until
"moved" to another tape. That means we have to get super-clever about
how runs should be written and formed (see Knuth). If we realise that
the runs aren't fixed to particular tapes they are all just independent
runs, we can radically rethink sorting. There is no need to implement
Cascade Sort, but we do need to rethink merging from the ground up. (All
of which is a relief, because Knuth et al are definitely smarter than
me, but I've got disks and lots of memory and those guys had tapes.).

If we track the min and max values for each run, when run building is
finished we will be able to build a merging plan that allows us to be
smart about the runs we should bring together. We start with the run
with the lowest min value, as well as all runs that overlap that run.
When that run is exhausted we move to the next lowest and at that point
start merging all runs that overlap that one.

This then means we may be able to begin final merging with more runs
than the current cut-off. It's possible that we could merge an infinite
number of runs in final merge with fixed memory. If we *do* need to
merge we can work out which runs should be our best pre-merge
candidates, based upon how big they are and which other runs they
overlap. (That's much better than being forced to merge tapes 2, 7 and
17 because some bizarre math says so (see Knuth).)

Anyway, claiming to have found a better way than Knuth makes me feel a
little nervous, so some searching questions on this are very welcome.

Interestingly, if we combine this technique with dynamic heap management
we may be able to allow a very large number of efficiently written runs
to form without it causing any merging.

mac_man recently noted the possibility that some runs don't overlap at
all and so can be merged for free. That's true, though doesn't actually
improve the basic idea here which is building a merge plan after runs
have been formed, with an eye on minimizing and potentially elimination
the merge phase.

There's probably some typos or thinkos above, so go easy on me Greg!
They aren't there because I want to skim over anything.

I'm not likely to get a chance to do all of this in the near future, so
documenting it now should help others to carry things forward.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

#2Sam Mason
sam@samason.me.uk
In reply to: Simon Riggs (#1)
Re: Sorting Improvements for 8.4

On Tue, Nov 27, 2007 at 06:03:46PM +0000, Simon Riggs wrote:

Just wanted to review a few thoughts and ideas around improving external
sorts, as recently encouraged to do by Jim Nasby.

Is there any way of PG knowing that having an index on a subset of the
sorted columns is sometimes a win? For example, if we have:

CREATE TABLE foo (
a INTEGER NOT NULL PRIMARY KEY,
b INTEGER NOT NULL,
c INTEGER
);

and we request:

SELECT * FROM foo ORDER BY a,b LIMIT 10;

then it may be a win to do smaller sorts for each value of "a", rather
than one big sort after all the data has been pulled out. Obviously,
it would depend on the distribution of "a", large numbers of distinct
values for "a" being good, and a small number being bad.

I think this would help in a number of other situations as well, but
that's just the most obvious case.

Sam

#3Jeff Davis
pgsql@j-davis.com
In reply to: Simon Riggs (#1)
Re: Sorting Improvements for 8.4

On Tue, 2007-11-27 at 18:03 +0000, Simon Riggs wrote:

5. DYNAMIC RUN HANDLING (in Final Merge)

Another way of addressing a) is to simply make better use of memory
itself. Let's look at that in more detail:

Number of runs that can be merged at once is currently fixed, based upon
available memory. This has the underlying assumption that all runs will
be concurrently active during final merging, which may not always be
true.

If we have random data then almost all runs will overlap with all other
runs, i.e. the min and max values are sufficiently wide that the runs do
all overlap. In many cases, data arrives in somewhat sorted order, e.g.
financial data is fairly regular with some late payers but not many, and
those trail off with a fairly tight decay. In the somewhat sorted case
we find that the actual overlap is less than total, so there are many
later runs that don't overlap the earlier ones. In the best case we
would find run 1 and 2 overlap, runs 2 and 3 overlap, then 3 and 4
overlap.

I have spoken with Len Shapiro, a professor at Portland State
University, regarding sorting before.

He suggests that PostgreSQL should implement forecasting, which is
similar to what you're describing. Forecasting does not require that
entire runs are disjoint, it works by tracking the maximum values from
the last block read from every run. This allows you to know which run
you will need more blocks from the soonest.

I'm still looking into the problem to understand it better, but the
algorithm is in Knuth Vol 3.

I can look at it in more detail, but have you already looked into this
idea? Is there a reason we don't do this currently?

Regards,
Jeff Davis

#4Simon Riggs
simon@2ndQuadrant.com
In reply to: Jeff Davis (#3)
Re: Sorting Improvements for 8.4

On Fri, 2007-11-30 at 12:07 -0800, Jeff Davis wrote:

On Tue, 2007-11-27 at 18:03 +0000, Simon Riggs wrote:

5. DYNAMIC RUN HANDLING (in Final Merge)

Another way of addressing a) is to simply make better use of memory
itself. Let's look at that in more detail:

Number of runs that can be merged at once is currently fixed, based upon
available memory. This has the underlying assumption that all runs will
be concurrently active during final merging, which may not always be
true.

If we have random data then almost all runs will overlap with all other
runs, i.e. the min and max values are sufficiently wide that the runs do
all overlap. In many cases, data arrives in somewhat sorted order, e.g.
financial data is fairly regular with some late payers but not many, and
those trail off with a fairly tight decay. In the somewhat sorted case
we find that the actual overlap is less than total, so there are many
later runs that don't overlap the earlier ones. In the best case we
would find run 1 and 2 overlap, runs 2 and 3 overlap, then 3 and 4
overlap.

I have spoken with Len Shapiro, a professor at Portland State
University, regarding sorting before.

He suggests that PostgreSQL should implement forecasting, which is
similar to what you're describing. Forecasting does not require that
entire runs are disjoint, it works by tracking the maximum values from
the last block read from every run. This allows you to know which run
you will need more blocks from the soonest.

I'm still looking into the problem to understand it better, but the
algorithm is in Knuth Vol 3.

I can look at it in more detail, but have you already looked into this
idea? Is there a reason we don't do this currently?

Interesting, I hadn't read that part.

Knuth's Algorithm F covers how to do a P-way merge using 2P + 2 buffers.
My ideas cover how to do a P-way merge when you don't have enough memory
for that many buffers.

The current sort code makes two assumptions, amongst others

1. minimizing number of runs is always worth it
2. there is a single fixed maximum size of P, depending upon memory

I'm challenging both of those. Only runs that overlap need to be merged
simultaneously, so if the runs aren't overlapping then its OK to allow
more runs to be formed. If its OK to allow more runs, then reducing heap
size to allow better CPU efficiency is possible.

So Algorithm F is somewhat orthogonal to what I've proposed, though
maybe still interesting. What we do now is fairly close, but you should
look at the code in tuplesort.c and logtape.c to see how well it
matches. That might lead to an increase in the limit of the number of
concurrent runs mergeable at any one time.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

#5Jeff Davis
pgsql@j-davis.com
In reply to: Simon Riggs (#4)
Re: Sorting Improvements for 8.4

On Mon, 2007-12-03 at 11:51 +0000, Simon Riggs wrote:

So Algorithm F is somewhat orthogonal to what I've proposed, though
maybe still interesting. What we do now is fairly close, but you should
look at the code in tuplesort.c and logtape.c to see how well it
matches. That might lead to an increase in the limit of the number of
concurrent runs mergeable at any one time.

tuplesort.c:

* When merging runs, we use a heap containing just the frontmost tuple from
* each source run; we repeatedly output the smallest tuple and insert the
* next tuple from its source tape (if any). When the heap empties, the merge
* is complete. The basic merge algorithm thus needs very little memory ---
* only M tuples for an M-way merge, and M is constrained to a small number.
* However, we can still make good use of our full workMem allocation by
* pre-reading additional tuples from each source tape. Without prereading,
* our access pattern to the temporary file would be very erratic; on average
* we'd read one block from each of M source tapes during the same time that
* we're writing M blocks to the output tape, so there is no sequentiality of
* access at all, defeating the read-ahead methods used by most Unix kernels.
* Worse, the output tape gets written into a very random sequence of blocks
* of the temp file, ensuring that things will be even worse when it comes
* time to read that tape. A straightforward merge pass thus ends up doing a
* lot of waiting for disk seeks. We can improve matters by prereading from
* each source tape sequentially, loading about workMem/M bytes from each tape
* in turn. Then we run the merge algorithm, writing but not reading until
* one of the preloaded tuple series runs out. Then we switch back to preread
* mode, fill memory again, and repeat. This approach helps to localize both
* read and write accesses.

The idea of prefetching, as I understand it, is that we don't blindly
preread workMem/M bytes from each of M tapes; instead we predict
which tapes we will need tuples from next through forecasting.

If I understand correctly, we just keep track of the maximum value of
the last block read from each run, and then always read from the run in
which the last block read has the lowest maximum.

It seems as if this would allow a variable number of runs to be merged
at once, but if the data really *is* random, we'd want it to degrade
gracefully something resembling the current implementation.

I'm being somewhat vague here because I haven't taken the time to
really understand it. If you think this idea has potential I will look
into it in more detail.

Regards,
Jeff Davis

#6Simon Riggs
simon@2ndQuadrant.com
In reply to: Jeff Davis (#5)
Re: Sorting Improvements for 8.4

On Mon, 2007-12-03 at 10:32 -0800, Jeff Davis wrote:

If I understand correctly, we just keep track of the maximum value of
the last block read from each run, and then always read from the run in
which the last block read has the lowest maximum.

Yep, sounds like Algorithm F

It seems as if this would allow a variable number of runs to be merged
at once, but if the data really *is* random, we'd want it to degrade
gracefully something resembling the current implementation.

If we also keep track of the endpoints of runs that we haven't yet read
from, then yes that would link my ideas with Algorithm F, so we just
have a single implementation. (F++ ?)

Probably easiest to store the endpoint tuples directly, with some sane
limits for when we have very large tuples.

You'll still need to do run-level forecasting as I had proposed to tell
whether you need to do any intermediate merging prior to the final
merge. So the two sets of ideas can't be brought together completely.

I'm being somewhat vague here because I haven't taken the time to
really understand it. If you think this idea has potential I will look
into it in more detail.

Yes, F++ sound like it will use memory more effectively than we do
currently. That's likely to improve performance when the number of runs
approaches the limit for the size of work_mem. So this will improve
external sorts with too small memory allocations, but it won't do
anything about sorts with too large a memory allocation. That's probably
the order of importance for tackling sort performance, so thats good.

Probably best to test with
- 1M - 4M work_mem, so we see the full benefit of any improvements in
memory utilisation in a typical context
- number of runs is nearly at limit for memory
- total sort is very large, so we see real I/O issues starkly

You'll need to instrument things carefully so you can tell how many runs
are being merged at any one time and how that effects elapsed time/row.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

#7Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#6)
Re: Sorting Improvements for 8.4

"Simon Riggs" <simon@2ndquadrant.com> writes:

On Mon, 2007-12-03 at 10:32 -0800, Jeff Davis wrote:

If I understand correctly, we just keep track of the maximum value of
the last block read from each run, and then always read from the run in
which the last block read has the lowest maximum.

So it sounds like the use case where this is the biggest win would be a
situation where you have presorted input which has been sliced up. So for
example sorting by "zip code" in a table which was clustered by city. The
alphabetic order of the cities isn't correlated to the results but all the zip
codes for a city are in a contiguous block somewhere in the output.

In such a case after doing a single pass we would have a bunch of tapes each
of which corresponded to a single city and was able to completely reorder the
zip codes in that city to be ordered. So the desired results would be, for
example, all the tuples from tape 17 (NYC) followed by all the tuples from
tape 3 (Buffalo) followed by all the tuples from tape 1 (Albuquerque), etc.

We currently preread an equal amount from each tape and then would empty all
the preread tuples from tape 17, refill them, preread them again, repeat until
tape 17 is empty then move on to tape 3. All the tuples except the currently
active tape are completely idle.

I think the way to do what you're proposing is to preread one tuple from each
tape, then when one preread bunch is emptied refill it with twice as many and
repeat. In this case you would end up with nearly all of workmem full of
tuples from NYC until you're done with NYC. That would increase the prereading
block size by a factor of 20 in this case.

So the question is just how many seeks are we doing during sorting. If we're
doing 0.1% seeks and 99.9% sequential i/o then eliminating the 1% entirely
(which we can't do) isn't going to speed up seeking all that much. If we're
doing 20% seeks and can get that down to 10% it might be worthwhile.

I'm not sure where the idea of keeping the current bounds of the input tapes
comes into it. We only preread when we run out of tuples anyways and then we
don't really have a choice about which tape we want to preread from. And it's
a good thing too since maintaining such a list of bounds and finding the
lowest or highest would mean maintaining a second heap which would basically
double the cpu cost of sorting.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!

#8Simon Riggs
simon@2ndQuadrant.com
In reply to: Bruce Momjian (#7)
Re: Sorting Improvements for 8.4

On Mon, 2007-12-03 at 20:40 +0000, Gregory Stark wrote:

I think the way to do what you're proposing is...

Don't understand that. Algorithm F covers that already doesn't it?

So the question is just how many seeks are we doing during sorting. If we're
doing 0.1% seeks and 99.9% sequential i/o then eliminating the 1% entirely
(which we can't do) isn't going to speed up seeking all that much. If we're
doing 20% seeks and can get that down to 10% it might be worthwhile.

The buffer size at max tapes is an optimum - a trade off between
avoiding intermediate merging and merging efficiently. Freeing more
memory is definitely going to help in the case of low work_mem and lots
of runs.

You're right that there is a limit to the benefit you can get. I wrote a
patch in 2005/6 to optimise the memory usage when there were few runs
and lots of memory. I still think there's value in that.

I'm not sure where the idea of keeping the current bounds of the input tapes
comes into it. We only preread when we run out of tuples anyways and then we
don't really have a choice about which tape we want to preread from.

You have to decide whether to perform intermediate merges or whether you
can do everything at the final merge. Otherwise you can't merge more
runs than you have buffers for, since you'd at some point freeze up and
not be able to input.

And it's

a good thing too since maintaining such a list of bounds and finding the
lowest or highest would mean maintaining a second heap which would basically
double the cpu cost of sorting.

I think you're not understanding me.

You only need to record the lowest or highest when a run
completes/starts. When all runs have been written we then have a table
of the highest and lowest values for each run. We then scan that to see
whether we can perform merging in one pass, or if not what kind of
intermediate merging is required. We keep the merge plan in memory and
then follow it. So probably very small % of total sort cost, though
might save you doing intermediate merges with huge costs.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

#9Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#8)
Re: Sorting Improvements for 8.4

"Simon Riggs" <simon@2ndquadrant.com> writes:

The buffer size at max tapes is an optimum - a trade off between
avoiding intermediate merging and merging efficiently. Freeing more
memory is definitely going to help in the case of low work_mem and lots
of runs.

I can't follow these abstract arguments. That's why I tried to spell out a
concrete example.

I think you're not understanding me.

You only need to record the lowest or highest when a run
completes/starts. When all runs have been written we then have a table
of the highest and lowest values for each run. We then scan that to see
whether we can perform merging in one pass, or if not what kind of
intermediate merging is required. We keep the merge plan in memory and
then follow it. So probably very small % of total sort cost, though
might save you doing intermediate merges with huge costs.

Ok, that's a very different concept than what I was thinking.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's On-Demand Production Tuning

#10Jeff Davis
pgsql@j-davis.com
In reply to: Bruce Momjian (#7)
Re: Sorting Improvements for 8.4

On Mon, 2007-12-03 at 20:40 +0000, Gregory Stark wrote:

So the question is just how many seeks are we doing during sorting. If we're
doing 0.1% seeks and 99.9% sequential i/o then eliminating the 1% entirely
(which we can't do) isn't going to speed up seeking all that much. If we're
doing 20% seeks and can get that down to 10% it might be worthwhile.

It's not just about eliminating seeks, it's about being able to merge
more runs at one time.

If you are merging 10 runs at once, and only two of those runs overlap
and the rest are much greater values, you might be spending 99% of the
time in sequential I/O.

But the point is, we're wasting the memory holding those other 8 runs in
memory (wasting 80% of the memory you're using), so we really could be
merging a lot more than 10 runs at once. This might eliminate stages
from the merge process.

My point is just that "how many seeks are we doing" is not the only
question. We could be doing 99% sequential I/O and still make huge wins.

In reality, of course, the runs aren't going to be disjoint completely,
but they may be partially disjoint. That's where forecasting comes in:
you preread from the tapes you will actually need tuples from soonest.

Regards,
Jeff Davis

#11Jeff Davis
pgsql@j-davis.com
In reply to: Bruce Momjian (#7)
Re: Sorting Improvements for 8.4

On Mon, 2007-12-03 at 20:40 +0000, Gregory Stark wrote:

I'm not sure where the idea of keeping the current bounds of the input tapes
comes into it. We only preread when we run out of tuples anyways and then we
don't really have a choice about which tape we want to preread from. And it's
a good thing too since maintaining such a list of bounds and finding the
lowest or highest would mean maintaining a second heap which would basically
double the cpu cost of sorting.

You're only keeping track of the maximum value for each run, which
should be cheap to track. The only time it changes is when you're
reading more data from that run, in which case it increases.

The tradeoff that's happening right now is: we want to merge many runs
at once because it reduces the number of merge phases, but the problem
is that it increases the seeking because we read one block from one run,
then one block from another run, etc., especially if the input is
random.

If we reduce the number of runs, then we can preread more efficiently.
See:

tuplesort.c:

* as sorted runs, we can eliminate any repeated I/O at all. In the
current
* code we determine the number of tapes M on the basis of workMem: we
want
* workMem/M to be large enough that we read a fair amount of data each
time
* we preread from a tape, so as to maintain the locality of access
described
* above. Nonetheless, with large workMem we can have many tapes.

So, for workMem/M to be "large enough", M has to be small enough. But a
small M means we have to do more merge phases, which is expensive.

Forecasting improves this trade. Forecasting no longer _blindly_
prereads from each tape, it uses information that it already has (the
max value of the last block read from each run) to determine the runs
from which we need tuples the soonest.

Then, it prereads the _correct_ data.

Regards,
Jeff Davis

#12Ron Mayer
rm_pg@cheapcomplexdevices.com
In reply to: Simon Riggs (#1)
Re: Sorting Improvements for 8.4

Has anyone looked into sorting algorithms that could use
more than one CPU or core at a time?

Benchmarks I see[1]http://www.cs.cmu.edu/~damon2005/damonpdf/4%20best%20paper%20-%20multithreaded%20architectures%20and%20the%20sort%20benchmark.pdf[2]http://delivery.acm.org/10.1145/1120000/1114254/DaMoN_103.pdf?key1=1114254&amp;key2=5713023711&amp;coll=&amp;dl=ACM&amp;CFID=15151515&amp;CFTOKEN=6184618 suggest that sorting is an area that
improves greatly with multiple processors and even with
multi-threading on a single core processor.

"For 1-processor and 2-threads (1p2t), the algorithm sorts
the relation about 48% faster than the single-threaded version
with a speedup of 31% during the quicksort and 58% during the
mergesort. The dual-processor (2p2t) version provides an even
faster total speedup of 86% over the single-threaded version
with a speedup of 60% during the quicksort and 100% during
the merge sort."
[from the acm paper on link 2 below]

PS: Yeah, I know multi-threading is a hot-button on these
lists; but sorting seems a relatively isolated of the code
and I'd wonder if it'd be isolate-able enough that multiple
CPUs could be used there.

[1]: http://www.cs.cmu.edu/~damon2005/damonpdf/4%20best%20paper%20-%20multithreaded%20architectures%20and%20the%20sort%20benchmark.pdf
[2]: http://delivery.acm.org/10.1145/1120000/1114254/DaMoN_103.pdf?key1=1114254&amp;key2=5713023711&amp;coll=&amp;dl=ACM&amp;CFID=15151515&amp;CFTOKEN=6184618

#13Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Ron Mayer (#12)
Re: Sorting Improvements for 8.4

Hi,

Le mardi 18 décembre 2007, Ron Mayer a écrit :

Has anyone looked into sorting algorithms that could use
more than one CPU or core at a time?

[...]

PS: Yeah, I know multi-threading is a hot-button on these
lists; but sorting seems a relatively isolated of the code
and I'd wonder if it'd be isolate-able enough that multiple
CPUs could be used there.

And before that objection to multi-threading implementation and portability
concerns arise, what about using a coroutine BSD-licenced portable
implementation such as Protothreads to have backend code use several CPU at a
time?
http://www.sics.se/~adam/pt/

With such a tool, would it be possible to think about producer/consumer
parallel executions for sorting, aggregates nodes or other parts of the
executor?

Hope this helps, regards,
--
dim

#14Simon Riggs
simon@2ndQuadrant.com
In reply to: Ron Mayer (#12)
Re: Sorting Improvements for 8.4

On Mon, 2007-12-17 at 16:34 -0800, Ron Mayer wrote:

PS: Yeah, I know multi-threading is a hot-button on these
lists; but sorting seems a relatively isolated of the code
and I'd wonder if it'd be isolate-able enough that multiple
CPUs could be used there.

I'm not sure multi-threading is the issue you think. Threads is, but
only for architectural reasons. Using multiple processes to complete a
task seems very sensible to me.

Yeh, sorting is isolated enough to try out some of those ideas on. I was
unaware of the work on finding medians, so thats a good way of dividing
the workloads for parallelism.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

#15Jeff Davis
pgsql@j-davis.com
In reply to: Simon Riggs (#14)
Re: Sorting Improvements for 8.4

On Tue, 2007-12-18 at 09:31 +0000, Simon Riggs wrote:

On Mon, 2007-12-17 at 16:34 -0800, Ron Mayer wrote:

PS: Yeah, I know multi-threading is a hot-button on these
lists; but sorting seems a relatively isolated of the code
and I'd wonder if it'd be isolate-able enough that multiple
CPUs could be used there.

I'm not sure multi-threading is the issue you think. Threads is, but
only for architectural reasons. Using multiple processes to complete a
task seems very sensible to me.

My first thought would be that we would need a new executor node (e.g.
"ParallelSort") that would only be chosen when the cost of the sort is
large enough to outweigh other factors (such as process creation time,
dividing available work_mem, and any necessary IPC).

It seems to me the simplest way to do it would be to allow each sub
process to allocate work_mem/P where P is the degree of parallelization.
However, that somewhat works against our schemes for dynamic run
handling and forecasting, and may lead to more random reading from disk.
Any other scheme I can think of would involve more IPC, which might make
the idea just too complex.

Regards,
Jeff Davis

#16Mark Mielke
mark@mark.mielke.cc
In reply to: Jeff Davis (#15)
Re: Sorting Improvements for 8.4

Jeff Davis wrote:

My first thought would be that we would need a new executor node (e.g.
"ParallelSort") that would only be chosen when the cost of the sort is
large enough to outweigh other factors (such as process creation time,
dividing available work_mem, and any necessary IPC).

It seems to me the simplest way to do it would be to allow each sub
process to allocate work_mem/P where P is the degree of parallelization.
However, that somewhat works against our schemes for dynamic run
handling and forecasting, and may lead to more random reading from disk.
Any other scheme I can think of would involve more IPC, which might make
the idea just too complex.

I am curious - what algorithms exist to efficiently do a parallel sort?
Do you mean if sorting 1 million items, it is possible to separate this
into 2 sets of 500 thousand each, execute them in separate threads
(with task administration and synchronization overhead) , combine the
results, and complete the task in significantly less time than doing it
in one thread? I am skeptical that this is possible, and suspect that
the overall efficiency of the system would go down even if the
throughput of a single execution increases.

Or do you mean being able to perform parts of the query plan fully in
parallel? If this, then one would need a lot more than ParallelSort...

Cheers,
mark

--
Mark Mielke <mark@mielke.cc>

#17Michał Zaborowski
michal.zaborowski@gmail.com
In reply to: Mark Mielke (#16)
Re: Sorting Improvements for 8.4

2007/12/19, Mark Mielke <mark@mark.mielke.cc>:

Jeff Davis wrote:

My first thought would be that we would need a new executor node (e.g.
"ParallelSort") that would only be chosen when the cost of the sort is
large enough to outweigh other factors (such as process creation time,
dividing available work_mem, and any necessary IPC).

It seems to me the simplest way to do it would be to allow each sub
process to allocate work_mem/P where P is the degree of parallelization.
However, that somewhat works against our schemes for dynamic run
handling and forecasting, and may lead to more random reading from disk.
Any other scheme I can think of would involve more IPC, which might make
the idea just too complex.

I am curious - what algorithms exist to efficiently do a parallel sort?
Do you mean if sorting 1 million items, it is possible to separate this
into 2 sets of 500 thousand each, execute them in separate threads
(with task administration and synchronization overhead) , combine the
results, and complete the task in significantly less time than doing it
in one thread? I am skeptical that this is possible, and suspect that
the overall efficiency of the system would go down even if the
throughput of a single execution increases.

Ok - we want to sort table with quick sort and we want to do it on - N threads.
Every thread - gets begin and end of indices of the table. First step starts
at 0 and lasts with count -1. Single step: find medium value and move
lover before it and bigger after. In normal case - we use recursive call - so
the same procedure is being called for that two parts. In thread we can put
indices at side list - and use queue of threads to pick up data from the list.
We can use common table, access to side list with indices has to be serialized.

Or do you mean being able to perform parts of the query plan fully in
parallel? If this, then one would need a lot more than ParallelSort...

Nice to have, but rather for data warehouses. In other cases... IE - backend
for Internet - there are many requests and every processor / core works nice.

--
Regards,
Michał Zaborowski (TeXXaS)

#18Andreas Joseph Krogh
andreak@officenet.no
In reply to: Dimitri Fontaine (#13)
Re: Sorting Improvements for 8.4

On Tuesday 18 December 2007 10:03:25 Dimitri Fontaine wrote:

Hi,

Le mardi 18 décembre 2007, Ron Mayer a écrit :

Has anyone looked into sorting algorithms that could use
more than one CPU or core at a time?

[...]

PS: Yeah, I know multi-threading is a hot-button on these
lists; but sorting seems a relatively isolated of the code
and I'd wonder if it'd be isolate-able enough that multiple
CPUs could be used there.

And before that objection to multi-threading implementation and portability
concerns arise, what about using a coroutine BSD-licenced portable
implementation such as Protothreads to have backend code use several CPU at
a time?
http://www.sics.se/~adam/pt/

With such a tool, would it be possible to think about producer/consumer
parallel executions for sorting, aggregates nodes or other parts of the
executor?

Hope this helps, regards,

And remember; Users don't care about portability-issues, they care about
performance. If multi-threading is a way to speed up sorting considerably, it
should, IMHO, be considered seriously.

--
Andreas Joseph Krogh

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andreas Joseph Krogh (#18)
Re: Sorting Improvements for 8.4

Andreas Joseph Krogh <andreak@officenet.no> writes:

And remember; Users don't care about portability-issues, they care about
performance.

Nonsense. If Postgres stops working on their machine, they'll care.

regards, tom lane

#20Mark Mielke
mark@mark.mielke.cc
In reply to: Michał Zaborowski (#17)
Re: Sorting Improvements for 8.4

Michaďż˝ Zaborowski wrote:

Ok - we want to sort table with quick sort and we want to do it on - N threads.
Every thread - gets begin and end of indices of the table. First step starts
at 0 and lasts with count -1. Single step: find medium value and move
lover before it and bigger after. In normal case - we use recursive call - so
the same procedure is being called for that two parts. In thread we can put
indices at side list - and use queue of threads to pick up data from the list.
We can use common table, access to side list with indices has to be serialized.

Stupid question #2: Is it well recognized that the CPU is the bottleneck
in the PostgreSQL sorting mechanism? Or might it be memory bandwidth and
I/O?

It would seem to me that any sort worth parallelizing (administrative
and synchronization overhead), must have data larger than the L2 cache.
If larger than the L2 cache, it becomes real memory speed. If real
memory speed, wouldn't one CPU without hardware synchronization, be able
to fill the memory read/write pipe? If 'divide and conquer' to
parallize, wouldn't the values written
from one thread, often (1 / N) need to be read from another thread,
requiring hardware data synchronization?

I see the wikipedia.org page describes how easy it is to parallelize
quick sort, and scale performance linearly with the number of
processors, but I don't see references to back this claim.
At least some of these steps seem difficult or impractical to
parallelize. For example, the initial partition reorder that moves items
lower than the pivot to the left, and items higher than the pivot to the
right, would not be easy to parallelize using an in-place re-order. It
needs to move one partition down before it can 'divide and conquer'.
They say no synchronization is required, but I think they are missing
the hardware synchronization required (especially in the inner most
loops where the thread task becomes shorter, and starts to fit in
L1/L2). They say linear, but then talk about a 'new thread being
created'. New thread creation has a cost, and if reduced to using a
thread pool, then synchronization *is* required.

It sounds like a 'nice in theory' idea. :-) Which doesn't mean it is
wrong...

I am curious enough to write a test...

Or do you mean being able to perform parts of the query plan fully in
parallel? If this, then one would need a lot more than ParallelSort..

Nice to have, but rather for data warehouses. In other cases... IE - backend
for Internet - there are many requests and every processor / core works nice.

I'm a fan of the 'each plan item is a task, that is assigned to the
pool, with each CPU grabbing tasks from the pool'. Another 'nice in
theory' idea (used by DB2?). As it is, though, I think PostgreSQL
planning is heavily designed to maximize performance on a single CPU,
and single queries would not easily scale to multiple CPUs. (Perhaps
hashing could be done on another CPU, or as you describe above, sorting)

Cheers,
mark

--
Mark Mielke <mark@mielke.cc>

#21Jeff Davis
pgsql@j-davis.com
In reply to: Mark Mielke (#20)
#22Ron Mayer
rm_pg@cheapcomplexdevices.com
In reply to: Mark Mielke (#16)
#23Mark Mielke
mark@mark.mielke.cc
In reply to: Ron Mayer (#22)
#24Mark Mielke
mark@mark.mielke.cc
In reply to: Jeff Davis (#21)
#25Dann Corbit
DCorbit@connx.com
In reply to: Mark Mielke (#24)
#26Jeff Davis
pgsql@j-davis.com
In reply to: Mark Mielke (#24)
#27Jeff Davis
pgsql@j-davis.com
In reply to: Dann Corbit (#25)
#28Dann Corbit
DCorbit@connx.com
In reply to: Jeff Davis (#27)
#29Dann Corbit
DCorbit@connx.com
In reply to: Dann Corbit (#28)
#30Jeff Davis
pgsql@j-davis.com
In reply to: Dann Corbit (#28)
#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mark Mielke (#24)
#32Bruce Momjian
bruce@momjian.us
In reply to: Jeff Davis (#26)
#33Dann Corbit
DCorbit@connx.com
In reply to: Bruce Momjian (#32)
#34Greg Smith
gsmith@gregsmith.com
In reply to: Dann Corbit (#33)
#35Bruce Momjian
bruce@momjian.us
In reply to: Dann Corbit (#33)
#36Bruce Momjian
bruce@momjian.us
In reply to: Greg Smith (#34)
#37Ron Mayer
rm_pg@cheapcomplexdevices.com
In reply to: Tom Lane (#31)
#38Greg Smith
gsmith@gregsmith.com
In reply to: Bruce Momjian (#36)
#39Martijn van Oosterhout
kleptog@svana.org
In reply to: Ron Mayer (#37)
#40Greg Smith
gsmith@gregsmith.com
In reply to: Martijn van Oosterhout (#39)
#41Brian Hurt
bhurt@janestcapital.com
In reply to: Dann Corbit (#25)
#42Dann Corbit
DCorbit@connx.com
In reply to: Brian Hurt (#41)
#43Ron Mayer
rm_pg@cheapcomplexdevices.com
In reply to: Bruce Momjian (#32)
#44Mark Mielke
mark@mark.mielke.cc
In reply to: Jeff Davis (#26)
#45Jeff Davis
pgsql@j-davis.com
In reply to: Bruce Momjian (#32)
#46Mark Mielke
mark@mark.mielke.cc
In reply to: Jeff Davis (#45)
#47Brian Hurt
bhurt@janestcapital.com
In reply to: Brian Hurt (#41)
#48Bruce Momjian
bruce@momjian.us
In reply to: Brian Hurt (#47)
#49Joris Dobbelsteen
Joris@familiedobbelsteen.nl
In reply to: Simon Riggs (#1)
#50James Mansion
james@mansionfamily.plus.com
In reply to: Ron Mayer (#22)
#51Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#1)