[PERFORM] A Better External Sort?

Started by Ron Peacetreeover 20 years ago126 messageshackers

rjpeace@earthlink.net

over 20 years ago

From: Ron Peacetree <rjpeace@earthlink.net>
Sent: Sep 24, 2005 6:30 AM
Subject: Re: [HACKERS] [PERFORM] Releasing memory during External sorting?

... the amount of IO done is the most
important of the things that you should be optimizing for in
choosing an external sorting algorithm.

<snip>

Since sorting is a fundamental operation in many parts of a DBMS,
this is a Big Deal.

This discussion has gotten my creative juices flowing. I'll post
some Straw Man algorithm sketches after I've done some more
thought.

As a thought exeriment, I've been considering the best way to sort 1TB
(2^40B) of 2-4KB (2^11-2^12B) records. That's 2^28-2^29 records.

Part I: A Model of the System
The performance of such external sorts is limited by HD IO, then
memory IO, and finally CPU throughput.

On commodity HW, single HD IO is ~1/2048 (single HD realistic worst
case) to ~1/128 (single HD best case. No more than one seek every
~14.7ms for a ~50MB/s 7200rpm SATA II HD) the throughtput of RAM.

RAID HD IO will be in the range from as low as a single HD (RAID 1) to
~1/8 (a RAID system saturating the external IO bus) the throughput of
RAM.

RAM is ~1/8-1/16 the throughput and ~128x the latency of the data
pathways internal to the CPU.

This model suggests that HD IO will greatly dominate every other
factor, particuarly if we are talking about a single HD rather than a
peripheral bus saturating RAID subsystem. If at all possible, we want
to access the HD subsystem only once for each data item, and we want
to avoid seeking more than the critical number of seeks implied above
when doing it. It also suggests that at a minimum, it's worth it to
spend ~8 memory operations or ~64 CPU operations to avoid a HD access.
Far more than that if we are talking about a single random access.

It's worth spending ~128 CPU operations to avoid a single random RAM
access, and literally 10's or even 100's of thousands of CPU operations to
avoid a random HD access. In addition, there are many indications in
current ECE and IT literature that the performance gaps between these
pieces of computer systems are increasing and expected to continue to do
so for the forseeable future. In short, _internal_ sorts have some, and are
going to increasingly have more, of the same IO problems usually
associated with external sorts.

Part II: a Suggested Algorithm
The simplest case is one where we have to order the data using a key that
only has two values.

Given 2^40B of data using 2KB or 4KB per record, the most compact
representation we can make of such a data set is to assign a 32b= 4B RID
or Rptr for location + a 1b key for each record. Just the RID's would take up
1.25GB (250M records) or 2.5GB (500M records). Enough space that even
an implied ordering of records may not fit into RAM.

Still, sorting 1.25GB or 2.5GB of RIDs is considerably less expensive in terms
of IO operations than sorting the actual 1TB of data.

That IO cost can be lowered even further if instead of actually physically
sorting the RIDs, we assign a RID to the appropriate catagory inside the CPU
as we scan the data set and append the entries in a catagory from CPU cache
to a RAM file in one IO burst whenever said catagory gets full inside the CPU.
We can do the same with either RAM file to HD whenever they get full. The
sorted order of the data is found by concatenating the appropriate files at the
end of the process.

As simple as this example is, it has many of the characteristics we are looking for:
A= We access each piece of data once on HD and in RAM.
B= We do the minimum amount of RAM and HD IO, and almost no random IO in
either case.
C= We do as much work as possible within the CPU.
D= This process is stable. Equal keys stay in the original order they are encountered.

To generalize this method, we first need our 1b Key to become a sufficiently large
enough Key or KeyPrefix to be useful, yet not so big as to be CPU cache unfriendly.

Cache lines (also sometimes called "blocks") are usually 64B= 512b in size.
Therefore our RID+Key or KeyPrefix should never be larger than this. For a 2^40B
data set, a 5B RID leaves us with potentially as much as 59B of Key or KeyPrefix.
Since the data can't take on more than 40b worth different values (actually 500M= 29b
for our example), we have more than adequate space for Key or KeyPrefix. We just
have to figure out how to use it effectively.
A typical CPU L2 cache can hold 10's or 100's of thousands of such cache lines.
That's enough that we should be able to do a significant amount of useful work within
the CPU w/o having to go off-die.

The data structure we are using to represent the sorted data also needs to be
generalized. We want a space efficient DS that allows us to find any given element in
as few accesses as possible and that allows us to insert new elements or reorganize
the DS as efficiently as possible. This being a DB discussion list, a B+ tree seems like
a fairly obvious suggestion ;-)

A B+ tree where each element is no larger than a cache line and no node is larger than
what fits into L2 cache can be created dynamically as we scan the data set via any of
the fast, low IO methods well known for doing so. Since the L2 cache can hold 10's of
thousands of cache lines, it should be easy to make sure that the B+ tree has something
like 1000 elements per node (making the base of the logarithm for access being at least
1000). The log base 1000 of 500M is ~2.9, so that means that even in the absolute
worst case where every one of the 500M records is unique we can find any given
element in less than 3 accesses of the B+ tree. Increasing the order of the B+ tree is
an option to reduce average accesses even further.

Since the DS representing the sorted order of the data is a B+ tree, it's very "IO friendly"
if we need to store part or all of it on HD.

In an multiprocessor environment, we can assign chunks of the data set to different
CPUs, let them build their independant B+ trees to represent the data in sorted order from
their POV, and then merge the B+ trees very efficiently into one overall DS to represent
the sorted order of the entire data set.

Finally, since these are B+ trees, we can keep them around and easily update them at will
for frequent used sorting conditions.

What do people think?

Ron