Merge algorithms for large numbers of "tapes"
BTW, I was just looking over Knuth's discussion of sorting again, and
realized that there is still something more that could be done within
the existing sort framework. We currently use standard polyphase merge
(his Algorithm 5.4.2D), which IIRC I chose because it was simple and
for relatively small numbers of tapes T it was about as good as anything
else. Knuth spends a great deal of energy on minimizing tape rewind
time which of course is of no interest to us, and I had supposed that
all of his more-complex algorithms were really only of interest if you
needed to consider rewind time. However, now that we've changed the
code to prefer large numbers of tapes, it's not at all clear that
Algorithm D is still the right one to use. In particular I'm looking at
cascade merge, Algorithm 5.4.3C, which appears to use significantly
fewer passes when T is large. Do you want to try that?
regards, tom lane
On 3/7/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
BTW, I was just looking over Knuth's discussion of sorting again, and
realized that there is still something more that could be done within
the existing sort framework. We currently use standard polyphase merge
(his Algorithm 5.4.2D), which IIRC I chose because it was simple and
for relatively small numbers of tapes T it was about as good as anything
else. Knuth spends a great deal of energy on minimizing tape rewind
time which of course is of no interest to us, and I had supposed that
all of his more-complex algorithms were really only of interest if you
needed to consider rewind time. However, now that we've changed the
code to prefer large numbers of tapes, it's not at all clear that
Algorithm D is still the right one to use. In particular I'm looking at
cascade merge, Algorithm 5.4.3C, which appears to use significantly
fewer passes when T is large. Do you want to try that?
I haven't personally played with this algorithm but having spent the last 15
minutes reading it over, it does sound like an interesting idea for trial.
At first glance it didn't seem much better than polyphase for our case, but
after reading the entire algorithm, discussion, and thinking it over for a
couple minutes, I could see it as potentially better.
Guess we won't really know 'til it can be tested :)
--
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
732.331.1324
Use a priority queue for the sorted sub-lists. When the key-object
extracted from the head of the smallest queue exceeds the key-object
from the head of the second queue, adjust the priority of the smallest
queue within the list of queues.
It uses a total of 2 read/write passes over the data, no matter how many
subfiles you have. It is dominatingly faster than any other sort of
external merge when you have lots of subfiles.
I posted some message to the list on this subject before, and gave a
pointer to sample code that demonstrates the concept.
If you have one million sub-files, it still only takes a total of two
read-write passes.
________________________________
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Jonah H. Harris
Sent: Tuesday, March 07, 2006 5:28 PM
To: Tom Lane
Cc: Simon Riggs; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Merge algorithms for large numbers of "tapes"
On 3/7/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
BTW, I was just looking over Knuth's discussion of sorting
again, and
realized that there is still something more that could be done
within
the existing sort framework. We currently use standard
polyphase merge
(his Algorithm 5.4.2D), which IIRC I chose because it was simple
and
for relatively small numbers of tapes T it was about as good as
anything
else. Knuth spends a great deal of energy on minimizing tape
rewind
time which of course is of no interest to us, and I had supposed
that
all of his more-complex algorithms were really only of interest
if you
needed to consider rewind time. However, now that we've changed
the
code to prefer large numbers of tapes, it's not at all clear
that
Algorithm D is still the right one to use. In particular I'm
looking at
cascade merge, Algorithm 5.4.3C, which appears to use
significantly
fewer passes when T is large. Do you want to try that?
I haven't personally played with this algorithm but having spent the
last 15 minutes reading it over, it does sound like an interesting idea
for trial. At first glance it didn't seem much better than polyphase
for our case, but after reading the entire algorithm, discussion, and
thinking it over for a couple minutes, I could see it as potentially
better.
Guess we won't really know 'til it can be tested :)
--
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
732.331.1324
Import Notes
Resolved by subject fallback
Tom,
fewer passes when T is large. Do you want to try that?
Two passes is the state-of-the-practice on external disk sorts.
If we¹re looking to replace the tape sort approach, I would hope for a two
pass approach, with the merge pass avoided in the case of unidirectional
access.
- Luke
"Luke Lonergan" <llonergan@greenplum.com> writes:
Two passes is the state-of-the-practice on external disk sorts.
There is no such thing as a fixed number of passes regardless of
available memory and size of the data.
regards, tom lane
Yes all of the current best practice external sorts use two passes. A
first to produce the runs, which results in ³S² number of ³files², then a
single merge pass across the runs. At most 1 pass across the S runs is
required to implement the merge.
- Luke
On 3/7/06 8:03 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
Show quoted text
"Luke Lonergan" <llonergan@greenplum.com> writes:
Two passes is the state-of-the-practice on external disk sorts.
There is no such thing as a fixed number of passes regardless of
available memory and size of the data.regards, tom lane
"Jonah H. Harris" <jonah.harris@gmail.com> writes:
On 3/7/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
However, now that we've changed the code to prefer large numbers of tapes,
it's not at all clear that Algorithm D is still the right one to use. In
particular I'm looking at cascade merge, Algorithm 5.4.3C, which appears
to use significantly fewer passes when T is large. Do you want to try
that?Guess we won't really know 'til it can be tested :)
It would also be interesting to allow multiple temporary areas to be declared
and try to spread tape files across the temporary areas. Ideally keeping input
and output tapes on separate drives.
--
greg
Tom,
On 3/7/06 8:03 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
"Luke Lonergan" <llonergan@greenplum.com> writes:
Two passes is the state-of-the-practice on external disk sorts.
There is no such thing as a fixed number of passes regardless of
available memory and size of the data.
While technically correct, the practice shows that two-passes is the norm
for the vast majority of cases since 1986:
http://research.microsoft.com/~Gray/papers/TandemTR86.3_FastSort.doc
Square root of the sort set is the memory requirement for a two pass sort.
10M for 10GB of sort set, for instance.
Given the current resource management limitations we live with, you are
correct about multi-pass being necessary, but we should be aware that modern
commercial databases allocate enough memory to provide two-pass external
sorts.
- Luke
On Tue, 2006-03-07 at 18:14 -0500, Tom Lane wrote:
BTW, I was just looking over Knuth's discussion of sorting again, and
realized that there is still something more that could be done within
the existing sort framework. We currently use standard polyphase merge
(his Algorithm 5.4.2D), which IIRC I chose because it was simple and
for relatively small numbers of tapes T it was about as good as anything
else. Knuth spends a great deal of energy on minimizing tape rewind
time which of course is of no interest to us, and I had supposed that
all of his more-complex algorithms were really only of interest if you
needed to consider rewind time. However, now that we've changed the
code to prefer large numbers of tapes, it's not at all clear that
Algorithm D is still the right one to use. In particular I'm looking at
cascade merge, Algorithm 5.4.3C, which appears to use significantly
fewer passes when T is large.
Ah! Well spotted. Yeh, looks like it will improve performance a good
deal. So, yes, definitely a TODO item.
Do you want to try that?
The Cascade Merge re-writes the way logical tapes are selected and how
the runs are merged. It doesn't seem to do anything at all about the
run-forming, which would still use heapsort. So the only effect is when
we have more runs than "tapes", so for the limits of where we would
begin noticing any benefit would be:
work_mem= 1 GB benefit at 8 TB
work_mem= 256MB benefit at 0.5 TB
work_mem= 8MB benefit at 256 MB
work_mem= 1MB benefit at 12 MB (min 7 tapes).
(based upon runs on average twice size of memory, and each logical tape
requiring 256KB memory, i.e. min(work_mem/4, 6) * work_mem * 2, which
for work_mem > 2 MB gives 0.5 * work_mem^2)
Which means the benefit we get is when we have for some reason been
unable to give the sort enough space, or not set parameters correctly.
So, still a concern...but makes me think about 2 other issues first:
1. Earlier we had some results that showed that the heapsorts got slower
when work_mem was higher and that concerns me most of all right now.
It's possible you'll have reduced that considerably with the
pull-out-the-first-attr patch. I'll look into some test results to show
that has gone away. We also have Nyberg et al telling us that as of 1994
they established that heapsort would always be slower than qsort, as a
result of CPU cache locality improvements. An improvement here would
effect all sorts > work_mem.
2. Improvement in the way we do overall memory allocation, so we would
not have the problem of undersetting work_mem that we currently
experience. If we solved this problem we would have faster sorts in
*all* cases, not just extremely large ones. Dynamically setting work_mem
higher when possible would be very useful. I've looked at this a few
times and have some suggestions, but perhaps its worth asking for ideas
in this area?
Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes:
1. Earlier we had some results that showed that the heapsorts got slower
when work_mem was higher and that concerns me most of all right now.
Fair enough, but that's completely independent of the merge algorithm.
(I don't think the Nyberg results necessarily apply to our situation
anyway, as we are not sorting arrays of integers, and hence the cache
effects are far weaker for us. I don't mind trying alternate sort
algorithms, but I'm not going to believe an improvement in advance of
direct evidence in our own environment.)
2. Improvement in the way we do overall memory allocation, so we would
not have the problem of undersetting work_mem that we currently
experience. If we solved this problem we would have faster sorts in
*all* cases, not just extremely large ones. Dynamically setting work_mem
higher when possible would be very useful.
I think this would be extremely dangerous, as it would encourage
processes to take more than their fair share of available resources.
Also, to the extent that you believe the problem is insufficient L2
cache, it seems increasing work_mem to many times the size of L2 will
always be counterproductive. (Certainly there is no value in increasing
work_mem until we are in a regime where it consistently improves
performance significantly, which it seems we aren't yet.)
regards, tom lane
Tom,
On 3/8/06 7:21 AM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
Simon Riggs <simon@2ndquadrant.com> writes:
1. Earlier we had some results that showed that the heapsorts got slower
when work_mem was higher and that concerns me most of all right now.Fair enough, but that's completely independent of the merge algorithm.
(I don't think the Nyberg results necessarily apply to our situation
anyway, as we are not sorting arrays of integers, and hence the cache
effects are far weaker for us. I don't mind trying alternate sort
Even with the indirection, we should investigate alternative approaches that
others have demonstrated to be superior WRT L2 cache use.
A major commercial database currently performs external sorts of various
fields 4 times faster, and commonly uses more than 256MB of sort memory in
one example case to do it.
I think this would be extremely dangerous, as it would encourage
processes to take more than their fair share of available resources.
I agree - in fact, we currently have no structured concept of "fair share of
available resources", nor a way to share them.
I think the answer to this should involve the use of statement queuing and
resource queues.
Also, to the extent that you believe the problem is insufficient L2
cache, it seems increasing work_mem to many times the size of L2 will
always be counterproductive. (Certainly there is no value in increasing
work_mem until we are in a regime where it consistently improves
performance significantly, which it seems we aren't yet.)
Not if you cache block, the optimization that operates on a block of memory
one L2 block in size at a time.
- Luke
On Wed, Mar 08, 2006 at 07:28:16AM -0800, Luke Lonergan wrote:
I think this would be extremely dangerous, as it would encourage
processes to take more than their fair share of available resources.I agree - in fact, we currently have no structured concept of "fair share of
available resources", nor a way to share them.
A concept it would be great to add at some point, both for memory and
IO. But that's another discussion entirely.
I think the answer to this should involve the use of statement queuing and
resource queues.
Something else to consider is reducing the amount of memory used when we
have to fail to a tape sort, because at that point we'll be
substantially slower. So, for example, allow in-memory sorts to use up
to 1GB, because it shouldn't take a long period of time to read that
data in, and the sort will then be extremely fast. That means that the
sort would be using that amount of memory for a short period of time. If
we do have to fail to disk, cut back to 128MB, because having 8x that
certainly won't make the sort run anywhere close to 8x faster. The trick
would be releasing memory that a sort we thought could fit in memory but
couldn't. It would also be good to start estimating which sorts should
fit in memory and which won't before we start (AFAIK the current code
assumes we'll fit in memory until it runs out).
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
"Jim C. Nasby" <jnasby@pervasive.com> writes:
If we do have to fail to disk, cut back to 128MB, because having 8x that
certainly won't make the sort run anywhere close to 8x faster.
Not sure that follows. In particular, the entire point of the recent
changes has been to extend the range in which we can use a single merge
pass --- that is, write the data once as N sorted runs, then merge them
in a single read pass. As soon as you have to do an actual merge-back-
to-disk pass, your total I/O volume doubles, so there is definitely a
considerable gain if that can be avoided. And a larger work_mem
translates directly to fewer/longer sorted runs.
regards, tom lane
On Wed, 2006-03-08 at 10:21 -0500, Tom Lane wrote:
Simon Riggs <simon@2ndquadrant.com> writes:
1. Earlier we had some results that showed that the heapsorts got slower
when work_mem was higher and that concerns me most of all right now.Fair enough, but that's completely independent of the merge algorithm.
(I don't think the Nyberg results necessarily apply to our situation
anyway, as we are not sorting arrays of integers, and hence the cache
effects are far weaker for us. I don't mind trying alternate sort
algorithms, but I'm not going to believe an improvement in advance of
direct evidence in our own environment.)
Of course, this would be prototyped first...and I agree about possible
variability of those results for us.
2. Improvement in the way we do overall memory allocation, so we would
not have the problem of undersetting work_mem that we currently
experience. If we solved this problem we would have faster sorts in
*all* cases, not just extremely large ones. Dynamically setting work_mem
higher when possible would be very useful.I think this would be extremely dangerous, as it would encourage
processes to take more than their fair share of available resources.
Fair share is the objective. I was trying to describe the general case
so we could discuss a solution that would allow a dynamic approach
rather than the static one we have now.
Want to handle these cases: "How much to allocate, when..."
A. we have predicted number of users
B. we have a busy system - more than predicted number of users
C. we have a quiet system - less than predicted number of users
In B/C we have to be careful that we don't under/overallocate resources
only to find the situation changes immediately afterwards.
In many cases the static allocation is actually essential since you may
be more interested in guaranteeing a conservative run time rather than
seeking to produce occasional/unpredictable bursts of speed. But in many
cases people want to have certain tasks go faster when its quiet and go
slower when its not.
Also, to the extent that you believe the problem is insufficient L2
cache, it seems increasing work_mem to many times the size of L2 will
always be counterproductive.
Sorry to confuse: (1) and (2) were completely separate, so no intended
interaction between L2 cache and memory.
(Certainly there is no value in increasing
work_mem until we are in a regime where it consistently improves
performance significantly, which it seems we aren't yet.)
Very much agreed.
Best Regards, Simon Riggs
On Wed, Mar 08, 2006 at 11:20:50AM -0500, Tom Lane wrote:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
If we do have to fail to disk, cut back to 128MB, because having 8x that
certainly won't make the sort run anywhere close to 8x faster.Not sure that follows. In particular, the entire point of the recent
changes has been to extend the range in which we can use a single merge
pass --- that is, write the data once as N sorted runs, then merge them
in a single read pass. As soon as you have to do an actual merge-back-
to-disk pass, your total I/O volume doubles, so there is definitely a
considerable gain if that can be avoided. And a larger work_mem
translates directly to fewer/longer sorted runs.
But do fewer/longer sorted runs translate into not merging back to disk?
I thought that was controlled by if we had to be able to rewind the
result set.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Jim,
On 3/8/06 9:49 AM, "Jim C. Nasby" <jnasby@pervasive.com> wrote:
On Wed, Mar 08, 2006 at 11:20:50AM -0500, Tom Lane wrote:
Not sure that follows. In particular, the entire point of the recent
changes has been to extend the range in which we can use a single merge
pass --- that is, write the data once as N sorted runs, then merge them
in a single read pass. As soon as you have to do an actual merge-back-
to-disk pass, your total I/O volume doubles, so there is definitely a
considerable gain if that can be avoided. And a larger work_mem
translates directly to fewer/longer sorted runs.But do fewer/longer sorted runs translate into not merging back to disk?
I thought that was controlled by if we had to be able to rewind the
result set.
In the *tape* algorithm, there is an intermediate abstraction in the merging
called tapes (!) that are used to store intermediate merge results. Simon's
work implemented more tapes, which asymptotically approaches a single merge
pass as the number of tapes approaches the number of runs.
The Replacement Selection algorithm generally will produce about 1/2 the
number of runs that a simpler partial sort algorithm would, and the more
memory it uses, the fewer runs there are, and with fewer runs, fewer tapes
are required to avoid more passes on the merge.
This whole tape abstraction is something that I believe is unique to
Postgres among modern databases, and we have found that by removing it
entirely along with logtape.c, we remove 2000 lines of useless code that
only complicates our optimization problem.
- Luke
"Jim C. Nasby" <jnasby@pervasive.com> writes:
But do fewer/longer sorted runs translate into not merging back to disk?
I thought that was controlled by if we had to be able to rewind the
result set.
A plain SELECT ... ORDER BY doesn't assume that anymore. It is still
required for some cases such as the input to a merge join, but the
on-the-fly-final-merge code is going to be used a lot more in 8.2 than
it was before.
regards, tom lane
I do not clearly understand the sorting code in PostgreSQL. If I did
have a good grasp of it, I would take a go at improving it.
Here are some suggestions of things that I know work really, really
well:
#1. Two pass merge (none of that silly poly-tape merge goo)
#2. Load ONLY the keys that are to be sorted into memory. Use a
pointer exchange sort, and do not move the physical rows of data at all.
I am pretty sure from this thread that PostgreSQL is not doing #1, and I
have no idea if it is doing #2.
A useful trick:
Since merge is mentioned, I should say something else about merge joins.
If you do not have room to load the sorted keys for bsearch, load every
kth key (where k is computed by sizeof merge_ram / sizeof key_data).
Then, when you have found the block the thing you are looking for by the
"kth key bsearch", bsearch that block.
Now, maybe PostrgeSQL already uses tricks better than these. I don't
know. But if they prove helpful suggestions I will be glad of it.
-----Original Message-----
From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
owner@postgresql.org] On Behalf Of Tom Lane
Sent: Wednesday, March 08, 2006 12:32 PM
To: Jim C. Nasby
Cc: Luke Lonergan; Simon Riggs; pgsql-hackers@postgreSQL.org
Subject: Re: [HACKERS] Merge algorithms for large numbers of "tapes""Jim C. Nasby" <jnasby@pervasive.com> writes:
But do fewer/longer sorted runs translate into not merging back to
disk?
I thought that was controlled by if we had to be able to rewind the
result set.A plain SELECT ... ORDER BY doesn't assume that anymore. It is still
required for some cases such as the input to a merge join, but the
on-the-fly-final-merge code is going to be used a lot more in 8.2 than
it was before.regards, tom lane
---------------------------(end of
broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that
your
Show quoted text
message can get through to the mailing list cleanly
Import Notes
Resolved by subject fallback
Dann,
On 3/8/06 12:39 PM, "Dann Corbit" <DCorbit@connx.com> wrote:
Here are some suggestions of things that I know work really, really
well:
Can you point to an example? That might help move the discussion along.
The reason to interject about the tape goo in this discussion is that we
seem to be spending a lot of time optimizing around the tape goo without
tackling the overall structure of the external sort. I think we'll just end
up having to replace all of this goo when we really get around to fixing the
problem.
Add to this that other commercial databases external sort in 1/4 the time or
better on the same hardware with the same CPU/memory resources using a
2-pass external sort.
#1. Two pass merge (none of that silly poly-tape merge goo)
Voice of reason here. It's what the other database systems do.
#2. Load ONLY the keys that are to be sorted into memory. Use a
pointer exchange sort, and do not move the physical rows of data at all.
Sounds right. Example of this in practice?
I am pretty sure from this thread that PostgreSQL is not doing #1, and I
have no idea if it is doing #2.
Yep. Even Knuth says that the tape goo is only interesting from a
historical perspective and may not be relevant in an era of disk drives.
- Luke
-----Original Message-----
From: Luke Lonergan [mailto:llonergan@greenplum.com]
Sent: Wednesday, March 08, 2006 1:52 PM
To: Dann Corbit; Tom Lane; Jim C. Nasby
Cc: Simon Riggs; pgsql-hackers@postgreSQL.org
Subject: Re: [HACKERS] Merge algorithms for large numbers of "tapes"Dann,
On 3/8/06 12:39 PM, "Dann Corbit" <DCorbit@connx.com> wrote:
Here are some suggestions of things that I know work really, really
well:Can you point to an example? That might help move the discussion
along.
I wrote all of the sorting and merging stuff for CONNX Solutions
http://www.connx.com
I have carefully benched all of this stuff and (at least for our system)
the ideas I propose work well. Of course, every system is different and
the only way to know if it is an improvement is to try it in place.
The reason to interject about the tape goo in this discussion is that
we
seem to be spending a lot of time optimizing around the tape goo
without
tackling the overall structure of the external sort. I think we'll
just
end
up having to replace all of this goo when we really get around to
fixing
the
problem.
I suggest trying several alternatives and benching them with real world
queries and especially with the open database benchmark suite.
Add to this that other commercial databases external sort in 1/4 the
time
or
better on the same hardware with the same CPU/memory resources using a
2-pass external sort.
Our sort merge is so fast that I can join two tables on a column with no
index faster than on a database that has a unique clustered index on the
column. Benchmarked against Oracle, SQL*Server, and several others.
If you check our ORDER BY on a large table with no index, you will see
that it is competitive with the best commercial systems.
If you are interested, you could get an eval of CONNX and try it
yourself (eval is free for some number of days, I don't remember what).
#1. Two pass merge (none of that silly poly-tape merge goo)
Voice of reason here. It's what the other database systems do.
#2. Load ONLY the keys that are to be sorted into memory. Use a
pointer exchange sort, and do not move the physical rows of data at
all.
Sounds right. Example of this in practice?
It is what we use here. It is the only way to fly. This is well known,
and if you read a few articles from the ACM, you will see that it has
been known for decades.
I am pretty sure from this thread that PostgreSQL is not doing #1,
and I
have no idea if it is doing #2.
Yep. Even Knuth says that the tape goo is only interesting from a
historical perspective and may not be relevant in an era of disk
drives.
Show quoted text
- Luke
Import Notes
Resolved by subject fallback