[PERFORM] A Better External Sort?

Started by Ron Peacetreeover 20 years ago126 messageshackers
Jump to latest
#1Ron Peacetree
rjpeace@earthlink.net

From: Ron Peacetree <rjpeace@earthlink.net>
Sent: Sep 24, 2005 6:30 AM
Subject: Re: [HACKERS] [PERFORM] Releasing memory during External sorting?

... the amount of IO done is the most
important of the things that you should be optimizing for in
choosing an external sorting algorithm.

<snip>

Since sorting is a fundamental operation in many parts of a DBMS,
this is a Big Deal.

This discussion has gotten my creative juices flowing. I'll post
some Straw Man algorithm sketches after I've done some more
thought.

As a thought exeriment, I've been considering the best way to sort 1TB
(2^40B) of 2-4KB (2^11-2^12B) records. That's 2^28-2^29 records.

Part I: A Model of the System
The performance of such external sorts is limited by HD IO, then
memory IO, and finally CPU throughput.

On commodity HW, single HD IO is ~1/2048 (single HD realistic worst
case) to ~1/128 (single HD best case. No more than one seek every
~14.7ms for a ~50MB/s 7200rpm SATA II HD) the throughtput of RAM.

RAID HD IO will be in the range from as low as a single HD (RAID 1) to
~1/8 (a RAID system saturating the external IO bus) the throughput of
RAM.

RAM is ~1/8-1/16 the throughput and ~128x the latency of the data
pathways internal to the CPU.

This model suggests that HD IO will greatly dominate every other
factor, particuarly if we are talking about a single HD rather than a
peripheral bus saturating RAID subsystem. If at all possible, we want
to access the HD subsystem only once for each data item, and we want
to avoid seeking more than the critical number of seeks implied above
when doing it. It also suggests that at a minimum, it's worth it to
spend ~8 memory operations or ~64 CPU operations to avoid a HD access.
Far more than that if we are talking about a single random access.

It's worth spending ~128 CPU operations to avoid a single random RAM
access, and literally 10's or even 100's of thousands of CPU operations to
avoid a random HD access. In addition, there are many indications in
current ECE and IT literature that the performance gaps between these
pieces of computer systems are increasing and expected to continue to do
so for the forseeable future. In short, _internal_ sorts have some, and are
going to increasingly have more, of the same IO problems usually
associated with external sorts.

Part II: a Suggested Algorithm
The simplest case is one where we have to order the data using a key that
only has two values.

Given 2^40B of data using 2KB or 4KB per record, the most compact
representation we can make of such a data set is to assign a 32b= 4B RID
or Rptr for location + a 1b key for each record. Just the RID's would take up
1.25GB (250M records) or 2.5GB (500M records). Enough space that even
an implied ordering of records may not fit into RAM.

Still, sorting 1.25GB or 2.5GB of RIDs is considerably less expensive in terms
of IO operations than sorting the actual 1TB of data.

That IO cost can be lowered even further if instead of actually physically
sorting the RIDs, we assign a RID to the appropriate catagory inside the CPU
as we scan the data set and append the entries in a catagory from CPU cache
to a RAM file in one IO burst whenever said catagory gets full inside the CPU.
We can do the same with either RAM file to HD whenever they get full. The
sorted order of the data is found by concatenating the appropriate files at the
end of the process.

As simple as this example is, it has many of the characteristics we are looking for:
A= We access each piece of data once on HD and in RAM.
B= We do the minimum amount of RAM and HD IO, and almost no random IO in
either case.
C= We do as much work as possible within the CPU.
D= This process is stable. Equal keys stay in the original order they are encountered.

To generalize this method, we first need our 1b Key to become a sufficiently large
enough Key or KeyPrefix to be useful, yet not so big as to be CPU cache unfriendly.

Cache lines (also sometimes called "blocks") are usually 64B= 512b in size.
Therefore our RID+Key or KeyPrefix should never be larger than this. For a 2^40B
data set, a 5B RID leaves us with potentially as much as 59B of Key or KeyPrefix.
Since the data can't take on more than 40b worth different values (actually 500M= 29b
for our example), we have more than adequate space for Key or KeyPrefix. We just
have to figure out how to use it effectively.
A typical CPU L2 cache can hold 10's or 100's of thousands of such cache lines.
That's enough that we should be able to do a significant amount of useful work within
the CPU w/o having to go off-die.

The data structure we are using to represent the sorted data also needs to be
generalized. We want a space efficient DS that allows us to find any given element in
as few accesses as possible and that allows us to insert new elements or reorganize
the DS as efficiently as possible. This being a DB discussion list, a B+ tree seems like
a fairly obvious suggestion ;-)

A B+ tree where each element is no larger than a cache line and no node is larger than
what fits into L2 cache can be created dynamically as we scan the data set via any of
the fast, low IO methods well known for doing so. Since the L2 cache can hold 10's of
thousands of cache lines, it should be easy to make sure that the B+ tree has something
like 1000 elements per node (making the base of the logarithm for access being at least
1000). The log base 1000 of 500M is ~2.9, so that means that even in the absolute
worst case where every one of the 500M records is unique we can find any given
element in less than 3 accesses of the B+ tree. Increasing the order of the B+ tree is
an option to reduce average accesses even further.

Since the DS representing the sorted order of the data is a B+ tree, it's very "IO friendly"
if we need to store part or all of it on HD.

In an multiprocessor environment, we can assign chunks of the data set to different
CPUs, let them build their independant B+ trees to represent the data in sorted order from
their POV, and then merge the B+ trees very efficiently into one overall DS to represent
the sorted order of the entire data set.

Finally, since these are B+ trees, we can keep them around and easily update them at will
for frequent used sorting conditions.

What do people think?

Ron

#2Dann Corbit
DCorbit@connx.com
In reply to: Ron Peacetree (#1)
Re: [PERFORM] A Better External Sort?

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
owner@postgresql.org] On Behalf Of Ron Peacetree
Sent: Monday, September 26, 2005 10:47 AM
To: pgsql-hackers@postgresql.org; pgsql-performance@postgresql.org
Subject: [HACKERS] [PERFORM] A Better External Sort?

From: Ron Peacetree <rjpeace@earthlink.net>
Sent: Sep 24, 2005 6:30 AM
Subject: Re: [HACKERS] [PERFORM] Releasing memory during External

sorting?

... the amount of IO done is the most
important of the things that you should be optimizing for in
choosing an external sorting algorithm.

<snip>

Since sorting is a fundamental operation in many parts of a DBMS,
this is a Big Deal.

This discussion has gotten my creative juices flowing. I'll post
some Straw Man algorithm sketches after I've done some more
thought.

As a thought exeriment, I've been considering the best way to sort 1TB
(2^40B) of 2-4KB (2^11-2^12B) records. That's 2^28-2^29 records.

Part I: A Model of the System
The performance of such external sorts is limited by HD IO, then
memory IO, and finally CPU throughput.

On commodity HW, single HD IO is ~1/2048 (single HD realistic worst
case) to ~1/128 (single HD best case. No more than one seek every
~14.7ms for a ~50MB/s 7200rpm SATA II HD) the throughtput of RAM.

RAID HD IO will be in the range from as low as a single HD (RAID 1) to
~1/8 (a RAID system saturating the external IO bus) the throughput of
RAM.

RAM is ~1/8-1/16 the throughput and ~128x the latency of the data
pathways internal to the CPU.

This model suggests that HD IO will greatly dominate every other
factor, particuarly if we are talking about a single HD rather than a
peripheral bus saturating RAID subsystem. If at all possible, we want
to access the HD subsystem only once for each data item,

If you can achieve that, I think you should be given a Nobel Prize, and
I mean that sincerely. I also think that your analysis is interesting.

and we want
to avoid seeking more than the critical number of seeks implied above
when doing it. It also suggests that at a minimum, it's worth it to
spend ~8 memory operations or ~64 CPU operations to avoid a HD access.
Far more than that if we are talking about a single random access.

It's worth spending ~128 CPU operations to avoid a single random RAM
access, and literally 10's or even 100's of thousands of CPU

operations to

avoid a random HD access. In addition, there are many indications in
current ECE and IT literature that the performance gaps between these
pieces of computer systems are increasing and expected to continue to

do

so for the forseeable future. In short, _internal_ sorts have some,

and

are
going to increasingly have more, of the same IO problems usually
associated with external sorts.

Knuth has made the observation (confirmed by others) that 40% of
mainframe CPU cycles are spent on sorting. Hence, any sort of
optimization in this area is a potential for enormous savings.

Part II: a Suggested Algorithm
The simplest case is one where we have to order the data using a key

that

only has two values.

I suggest testing against a very large class of distributions. All of
the common statistical models are a start (Gaussian, Poisson, etc.) and
also single value, two distinct values, to some limit.

Given 2^40B of data using 2KB or 4KB per record, the most compact
representation we can make of such a data set is to assign a 32b= 4B

RID

or Rptr for location + a 1b key for each record. Just the RID's would
take up
1.25GB (250M records) or 2.5GB (500M records). Enough space that even
an implied ordering of records may not fit into RAM.

Still, sorting 1.25GB or 2.5GB of RIDs is considerably less expensive

in

terms
of IO operations than sorting the actual 1TB of data.

That IO cost can be lowered even further if instead of actually

physically

sorting the RIDs, we assign a RID to the appropriate catagory inside

the

CPU
as we scan the data set and append the entries in a catagory from CPU
cache
to a RAM file in one IO burst whenever said catagory gets full inside

the

CPU.
We can do the same with either RAM file to HD whenever they get full.

The

sorted order of the data is found by concatenating the appropriate

files

at the
end of the process.

As simple as this example is, it has many of the characteristics we

are

looking for:
A= We access each piece of data once on HD and in RAM.
B= We do the minimum amount of RAM and HD IO, and almost no random IO

in

either case.
C= We do as much work as possible within the CPU.
D= This process is stable. Equal keys stay in the original order they

are

encountered.

To generalize this method, we first need our 1b Key to become a
sufficiently large
enough Key or KeyPrefix to be useful, yet not so big as to be CPU

cache

unfriendly.

Cache lines (also sometimes called "blocks") are usually 64B= 512b in
size.
Therefore our RID+Key or KeyPrefix should never be larger than this.

For

a 2^40B
data set, a 5B RID leaves us with potentially as much as 59B of Key or
KeyPrefix.
Since the data can't take on more than 40b worth different values
(actually 500M= 29b
for our example), we have more than adequate space for Key or

KeyPrefix.

We just
have to figure out how to use it effectively.
A typical CPU L2 cache can hold 10's or 100's of thousands of such

cache

lines.
That's enough that we should be able to do a significant amount of

useful

work within
the CPU w/o having to go off-die.

The data structure we are using to represent the sorted data also

needs to

be
generalized. We want a space efficient DS that allows us to find any
given element in
as few accesses as possible and that allows us to insert new elements

or

reorganize
the DS as efficiently as possible. This being a DB discussion list, a

B+

tree seems like
a fairly obvious suggestion ;-)

A B+ tree where each element is no larger than a cache line and no

node is

larger than
what fits into L2 cache can be created dynamically as we scan the data

set

via any of
the fast, low IO methods well known for doing so. Since the L2 cache

can

hold 10's of
thousands of cache lines, it should be easy to make sure that the B+

tree

has something
like 1000 elements per node (making the base of the logarithm for

access

being at least
1000). The log base 1000 of 500M is ~2.9, so that means that even in

the

absolute
worst case where every one of the 500M records is unique we can find

any

given
element in less than 3 accesses of the B+ tree. Increasing the order

of

the B+ tree is
an option to reduce average accesses even further.

Since the DS representing the sorted order of the data is a B+ tree,

it's

very "IO friendly"
if we need to store part or all of it on HD.

In an multiprocessor environment, we can assign chunks of the data set

to

different
CPUs, let them build their independant B+ trees to represent the data

in

sorted order from
their POV, and then merge the B+ trees very efficiently into one

overall

DS to represent
the sorted order of the entire data set.

Finally, since these are B+ trees, we can keep them around and easily
update them at will
for frequent used sorting conditions.

What do people think?

I think that your analysis is very interesting. I would like to see the
result of the experiment.

I think that the btrees are going to be O(n*log(n)) in construction of
the indexes in disk access unless you memory map them [which means you
would need stupendous memory volume] and so I cannot say that I really
understand your idea yet. Can you draw a picture of it for me? (I am
dyslexic and understand things far better when I can visualize it).

#3Jonah H. Harris
jonah.harris@gmail.com
In reply to: Dann Corbit (#2)
Re: [PERFORM] A Better External Sort?

Ron,

Having rested my brain for the last few days, your theory made for
interesting reading... Rather than argue the technical specs, I'd love to
see an implementation :)

-Jonah

On 9/26/05, Dann Corbit <DCorbit@connx.com> wrote:

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
owner@postgresql.org] On Behalf Of Ron Peacetree
Sent: Monday, September 26, 2005 10:47 AM
To: pgsql-hackers@postgresql.org; pgsql-performance@postgresql.org
Subject: [HACKERS] [PERFORM] A Better External Sort?

From: Ron Peacetree <rjpeace@earthlink.net>
Sent: Sep 24, 2005 6:30 AM
Subject: Re: [HACKERS] [PERFORM] Releasing memory during External

sorting?

... the amount of IO done is the most
important of the things that you should be optimizing for in
choosing an external sorting algorithm.

<snip>

Since sorting is a fundamental operation in many parts of a DBMS,
this is a Big Deal.

This discussion has gotten my creative juices flowing. I'll post
some Straw Man algorithm sketches after I've done some more
thought.

As a thought exeriment, I've been considering the best way to sort 1TB
(2^40B) of 2-4KB (2^11-2^12B) records. That's 2^28-2^29 records.

Part I: A Model of the System
The performance of such external sorts is limited by HD IO, then
memory IO, and finally CPU throughput.

On commodity HW, single HD IO is ~1/2048 (single HD realistic worst
case) to ~1/128 (single HD best case. No more than one seek every
~14.7ms for a ~50MB/s 7200rpm SATA II HD) the throughtput of RAM.

RAID HD IO will be in the range from as low as a single HD (RAID 1) to
~1/8 (a RAID system saturating the external IO bus) the throughput of
RAM.

RAM is ~1/8-1/16 the throughput and ~128x the latency of the data
pathways internal to the CPU.

This model suggests that HD IO will greatly dominate every other
factor, particuarly if we are talking about a single HD rather than a
peripheral bus saturating RAID subsystem. If at all possible, we want
to access the HD subsystem only once for each data item,

If you can achieve that, I think you should be given a Nobel Prize, and
I mean that sincerely. I also think that your analysis is interesting.

and we want
to avoid seeking more than the critical number of seeks implied above
when doing it. It also suggests that at a minimum, it's worth it to
spend ~8 memory operations or ~64 CPU operations to avoid a HD access.
Far more than that if we are talking about a single random access.

It's worth spending ~128 CPU operations to avoid a single random RAM
access, and literally 10's or even 100's of thousands of CPU

operations to

avoid a random HD access. In addition, there are many indications in
current ECE and IT literature that the performance gaps between these
pieces of computer systems are increasing and expected to continue to

do

so for the forseeable future. In short, _internal_ sorts have some,

and

are
going to increasingly have more, of the same IO problems usually
associated with external sorts.

Knuth has made the observation (confirmed by others) that 40% of
mainframe CPU cycles are spent on sorting. Hence, any sort of
optimization in this area is a potential for enormous savings.

Part II: a Suggested Algorithm
The simplest case is one where we have to order the data using a key

that

only has two values.

I suggest testing against a very large class of distributions. All of
the common statistical models are a start (Gaussian, Poisson, etc.) and
also single value, two distinct values, to some limit.

Given 2^40B of data using 2KB or 4KB per record, the most compact
representation we can make of such a data set is to assign a 32b= 4B

RID

or Rptr for location + a 1b key for each record. Just the RID's would
take up
1.25GB (250M records) or 2.5GB (500M records). Enough space that even
an implied ordering of records may not fit into RAM.

Still, sorting 1.25GB or 2.5GB of RIDs is considerably less expensive

in

terms
of IO operations than sorting the actual 1TB of data.

That IO cost can be lowered even further if instead of actually

physically

sorting the RIDs, we assign a RID to the appropriate catagory inside

the

CPU
as we scan the data set and append the entries in a catagory from CPU
cache
to a RAM file in one IO burst whenever said catagory gets full inside

the

CPU.
We can do the same with either RAM file to HD whenever they get full.

The

sorted order of the data is found by concatenating the appropriate

files

at the
end of the process.

As simple as this example is, it has many of the characteristics we

are

looking for:
A= We access each piece of data once on HD and in RAM.
B= We do the minimum amount of RAM and HD IO, and almost no random IO

in

either case.
C= We do as much work as possible within the CPU.
D= This process is stable. Equal keys stay in the original order they

are

encountered.

To generalize this method, we first need our 1b Key to become a
sufficiently large
enough Key or KeyPrefix to be useful, yet not so big as to be CPU

cache

unfriendly.

Cache lines (also sometimes called "blocks") are usually 64B= 512b in
size.
Therefore our RID+Key or KeyPrefix should never be larger than this.

For

a 2^40B
data set, a 5B RID leaves us with potentially as much as 59B of Key or
KeyPrefix.
Since the data can't take on more than 40b worth different values
(actually 500M= 29b
for our example), we have more than adequate space for Key or

KeyPrefix.

We just
have to figure out how to use it effectively.
A typical CPU L2 cache can hold 10's or 100's of thousands of such

cache

lines.
That's enough that we should be able to do a significant amount of

useful

work within
the CPU w/o having to go off-die.

The data structure we are using to represent the sorted data also

needs to

be
generalized. We want a space efficient DS that allows us to find any
given element in
as few accesses as possible and that allows us to insert new elements

or

reorganize
the DS as efficiently as possible. This being a DB discussion list, a

B+

tree seems like
a fairly obvious suggestion ;-)

A B+ tree where each element is no larger than a cache line and no

node is

larger than
what fits into L2 cache can be created dynamically as we scan the data

set

via any of
the fast, low IO methods well known for doing so. Since the L2 cache

can

hold 10's of
thousands of cache lines, it should be easy to make sure that the B+

tree

has something
like 1000 elements per node (making the base of the logarithm for

access

being at least
1000). The log base 1000 of 500M is ~2.9, so that means that even in

the

absolute
worst case where every one of the 500M records is unique we can find

any

given
element in less than 3 accesses of the B+ tree. Increasing the order

of

the B+ tree is
an option to reduce average accesses even further.

Since the DS representing the sorted order of the data is a B+ tree,

it's

very "IO friendly"
if we need to store part or all of it on HD.

In an multiprocessor environment, we can assign chunks of the data set

to

different
CPUs, let them build their independant B+ trees to represent the data

in

sorted order from
their POV, and then merge the B+ trees very efficiently into one

overall

DS to represent
the sorted order of the entire data set.

Finally, since these are B+ trees, we can keep them around and easily
update them at will
for frequent used sorting conditions.

What do people think?

I think that your analysis is very interesting. I would like to see the
result of the experiment.

I think that the btrees are going to be O(n*log(n)) in construction of
the indexes in disk access unless you memory map them [which means you
would need stupendous memory volume] and so I cannot say that I really
understand your idea yet. Can you draw a picture of it for me? (I am
dyslexic and understand things far better when I can visualize it).

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

--
Respectfully,

Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
http://www.enterprisedb.com/

#4Ron Peacetree
rjpeace@earthlink.net
In reply to: Jonah H. Harris (#3)
Re: [PERFORM] A Better External Sort?

From: Dann Corbit <DCorbit@connx.com>
Sent: Sep 26, 2005 5:13 PM
To: Ron Peacetree <rjpeace@earthlink.net>, pgsql-hackers@postgresql.org,
pgsql-performance@postgresql.org
Subject: RE: [HACKERS] [PERFORM] A Better External Sort?

I think that the btrees are going to be O(n*log(n)) in construction of
the indexes in disk access unless you memory map them [which means you
would need stupendous memory volume] and so I cannot say that I really
understand your idea yet.

Traditional algorithms for the construction of Btree variants (B, B+, B*, ...)
don't require O(nlgn) HD accesses. These shouldn't either.

Let's start by assuming that an element is <= in size to a cache line and a
node fits into L1 DCache. To make the discussion more concrete, I'll use a
64KB L1 cache + a 1MB L2 cache only as an example.

Simplest case: the Key has few enough distinct values that all Keys or
KeyPrefixes fit into L1 DCache (for a 64KB cache with 64B lines, that's
<= 1000 different values. More if we can fit more than 1 element into
each cache line.).

As we scan the data set coming in from HD, we compare the Key or KeyPrefix
to the sorted list of Key values in the node. This can be done in O(lgn) using
Binary Search or O(lglgn) using a variation of Interpolation Search.
If the Key value exists, we append this RID to the list of RIDs having the
same Key:
If the RAM buffer of this list of RIDs is full we append it and the current
RID to the HD list of these RIDs.
Else we insert this new key value into its proper place in the sorted list of Key
values in the node and start a new list for this value of RID.

We allocate room for a CPU write buffer so we can schedule RAM writes to
the RAM lists of RIDs so as to minimize the randomness of them.

When we are finished scanning the data set from HD, the sorted node with
RID lists for each Key value contains the sort order for the whole data set.

Notice that almost all of the random data access is occuring within the CPU
rather than in RAM or HD, and that we are accessing RAM or HD only when
absolutely needed.

Next simplest case: Multiple nodes, but they all fit in the CPU cache(s).
In the given example CPU, we will be able to fit at least 1000 elements per
node and 2^20/2^16= up to 16 such nodes in this CPU. We use a node's
worth of space as a RAM write buffer, so we end up with room for 15 such
nodes in this CPU. This is enough for a 2 level index to at least 15,000
distinct Key value lists.

All of the traditional tricks for splitting a Btree node and redistributing
elements within them during insertion or splitting for maximum node
utilization can be used here.

The most general case: There are too many nodes to fit within the CPU
cache(s). The root node now points to a maximum of at least 1000 nodes
since each element in the root node points to another node. A full 2 level
index is now enough to point to at least 10^6 distinct Key value lists, and
3 levels will index more distinct Key values than is possible in our 1TB,
500M record example.

We can use some sort of node use prediction algorithm like LFU to decide
which node should be moved out of CPU when we have to replace one of
the nodes in the CPU. The nodes in RAM or on HD can be arranged to
maximize streaming IO behavior and minimize random access IO
behavior.

As you can see, both the RAM and HD IO are as minimized as possible,
and what such IO there is has been optimized for streaming behavior.

Can you draw a picture of it for me? (I am dyslexic and understand things
far better when I can visualize it).

Not much for pictures. Hopefully the explanation helps?

Ron

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Ron Peacetree (#4)
Re: [PERFORM] A Better External Sort?

Ron Peacetree <rjpeace@earthlink.net> writes:

Let's start by assuming that an element is <= in size to a cache line and a
node fits into L1 DCache. [ much else snipped ]

So far, you've blithely assumed that you know the size of a cache line,
the sizes of L1 and L2 cache, and that you are working with sort keys
that you can efficiently pack into cache lines. And that you know the
relative access speeds of the caches and memory so that you can schedule
transfers, and that the hardware lets you get at that transfer timing.
And that the number of distinct key values isn't very large.

I don't see much prospect that anything we can actually use in a
portable fashion is going to emerge from this line of thought.

regards, tom lane

#6Ron Peacetree
rjpeace@earthlink.net
In reply to: Tom Lane (#5)
Re: [PERFORM] A Better External Sort?

SECOND ATTEMPT AT POST. Web mailer appears to have
eaten first one. I apologize in advance if anyone gets two
versions of this post.
=r

From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: Sep 26, 2005 9:42 PM
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

So far, you've blithely assumed that you know the size of a cache line,
the sizes of L1 and L2 cache,

NO. I used exact values only as examples. Realistic examples drawn
from an extensive survey of past, present, and what I could find out
about future systems; but only examples nonetheless. For instance,
Hennessy and Patterson 3ed points out that 64B cache lines are
optimally performing for caches between 16KB and 256KB. The same
source as well as sources specifically on CPU memory hierarchy
design points out that we are not likely to see L1 caches larger than
256KB in the forseeable future.

The important point was the idea of an efficient Key, rather than
Record, sort using a CPU cache friendly data structure with provably
good space and IO characteristics based on a reasonable model of
current and likely future single box computer architecture (although
it would be fairly easy to extend it to include the effects of
networking.)

No apriori exact or known values are required for the method to work.

and that you are working with sort keys that you can efficiently pack
into cache lines.

Not "pack". "map". n items can not take on more than n values. n
values can be represented in lgn bits. Less efficient mappings can
also work. Either way I demonstrated that we have plenty of space in
a likely and common cache line size. Creating a mapping function
to represent m values in lgm bits is a well known hack, and if we keep
track of minimum and maximum values for fields during insert and
delete operations, we can even create mapping functions fairly easily.
(IIRC, Oracle does keep track of minimum and maximum field
values.)

And that you know the relative access speeds of the caches and
memory so that you can schedule transfers,

Again, no. I created a reasonable model of a computer system that
holds remarkably well over a _very_ wide range of examples. I
don't need the numbers to be exactly right to justify my approach
to this problem or understand why other approaches may have
downsides. I just have to get the relative performance of the
system components and the relative performance gap between them
reasonably correct. The stated model does that very well.

Please don't take my word for it. Go grab some random box:
laptop, desktop, unix server, etc and try it for yourself. Part of the
reason I published the model was so that others could examine it.

and that the hardware lets you get at that transfer timing.

Never said anything about this, and in fact I do not need any such.

And that the number of distinct key values isn't very large.

Quite the opposite in fact. I went out of my way to show that the
method still works well even if every Key is distinct. It is _more
efficient_ when the number of distinct keys is small compared to
the number of data items, but it works as well as any other Btree
would when all n of the Keys are distinct. This is just a CPU cache
and more IO friendly Btree, not some magical and unheard of
technique. It's just as general purpose as Btrees usually are.

I'm simply looking at the current and likely future state of computer
systems architecture and coming up with a slight twist on how to use
already well known and characterized techniques. not trying to start
a revolution.

I'm trying very hard NOT to waste anyone's time around here.
Including my own
Ron

#7Jonah H. Harris
jonah.harris@gmail.com
In reply to: Ron Peacetree (#6)
Re: [PERFORM] A Better External Sort?

Ron,

Again, if you feel strongly enough about the theory to argue it, I recommend
that you spend your time constructively; create an implemenation of it.
Citing academics is cool and all, but code speaks louder than theory in this
case. As Tom mentioned, this has to be portable. Making assumptions about
computing architectures (especially those in the future), is fine for
theory, but not practical for something that needs to be maintained in the
real-world. Go forth and write thy code.

-Jonah

On 9/27/05, Ron Peacetree <rjpeace@earthlink.net> wrote:

SECOND ATTEMPT AT POST. Web mailer appears to have
eaten first one. I apologize in advance if anyone gets two
versions of this post.
=r

From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: Sep 26, 2005 9:42 PM
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

So far, you've blithely assumed that you know the size of a cache line,
the sizes of L1 and L2 cache,

NO. I used exact values only as examples. Realistic examples drawn
from an extensive survey of past, present, and what I could find out
about future systems; but only examples nonetheless. For instance,
Hennessy and Patterson 3ed points out that 64B cache lines are
optimally performing for caches between 16KB and 256KB. The same
source as well as sources specifically on CPU memory hierarchy
design points out that we are not likely to see L1 caches larger than
256KB in the forseeable future.

The important point was the idea of an efficient Key, rather than
Record, sort using a CPU cache friendly data structure with provably
good space and IO characteristics based on a reasonable model of
current and likely future single box computer architecture (although
it would be fairly easy to extend it to include the effects of
networking.)

No apriori exact or known values are required for the method to work.

and that you are working with sort keys that you can efficiently pack
into cache lines.

Not "pack". "map". n items can not take on more than n values. n
values can be represented in lgn bits. Less efficient mappings can
also work. Either way I demonstrated that we have plenty of space in
a likely and common cache line size. Creating a mapping function
to represent m values in lgm bits is a well known hack, and if we keep
track of minimum and maximum values for fields during insert and
delete operations, we can even create mapping functions fairly easily.
(IIRC, Oracle does keep track of minimum and maximum field
values.)

And that you know the relative access speeds of the caches and
memory so that you can schedule transfers,

Again, no. I created a reasonable model of a computer system that
holds remarkably well over a _very_ wide range of examples. I
don't need the numbers to be exactly right to justify my approach
to this problem or understand why other approaches may have
downsides. I just have to get the relative performance of the
system components and the relative performance gap between them
reasonably correct. The stated model does that very well.

Please don't take my word for it. Go grab some random box:
laptop, desktop, unix server, etc and try it for yourself. Part of the
reason I published the model was so that others could examine it.

and that the hardware lets you get at that transfer timing.

Never said anything about this, and in fact I do not need any such.

And that the number of distinct key values isn't very large.

Quite the opposite in fact. I went out of my way to show that the
method still works well even if every Key is distinct. It is _more
efficient_ when the number of distinct keys is small compared to
the number of data items, but it works as well as any other Btree
would when all n of the Keys are distinct. This is just a CPU cache
and more IO friendly Btree, not some magical and unheard of
technique. It's just as general purpose as Btrees usually are.

I'm simply looking at the current and likely future state of computer
systems architecture and coming up with a slight twist on how to use
already well known and characterized techniques. not trying to start
a revolution.

I'm trying very hard NOT to waste anyone's time around here.
Including my own
Ron

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

--
Respectfully,

Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
http://www.enterprisedb.com/

#8Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Jonah H. Harris (#7)
Re: [PERFORM] A Better External Sort?

I can't help wondering how a couple thousand context switches per
second would affect the attempt to load disk info into the L1 and
L2 caches. That's pretty much the low end of what I see when the
server is under any significant load.

#9Josh Berkus
josh@agliodbs.com
In reply to: Ron Peacetree (#4)
Re: [PERFORM] A Better External Sort?

Ron,

I've somehow missed part of this thread, which is a shame since this is
an area of primary concern for me.

Your suggested algorithm seems to be designed to relieve I/O load by
making more use of the CPU. (if I followed it correctly). However,
that's not PostgreSQL's problem; currently for us external sort is a
*CPU-bound* operation, half of which is value comparisons. (oprofiles
available if anyone cares)

So we need to look, instead, at algorithms which make better use of
work_mem to lower CPU activity, possibly even at the expense of I/O.

--Josh Berkus

#10Ron Peacetree
rjpeace@earthlink.net
In reply to: Josh Berkus (#9)
Re: [PERFORM] A Better External Sort?

From: Josh Berkus <josh@agliodbs.com>
ent: Sep 27, 2005 12:15 PM
To: Ron Peacetree <rjpeace@earthlink.net>
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

I've somehow missed part of this thread, which is a shame since this is
an area of primary concern for me.

Your suggested algorithm seems to be designed to relieve I/O load by
making more use of the CPU. (if I followed it correctly).

The goal is to minimize all IO load. Not just HD IO load, but also RAM
IO load. Particularly random access IO load of any type (for instance:
"the pointer chasing problem").

In addition, the design replaces explicit data or explicit key manipulation
with the creation of a smaller, far more CPU and IO efficient data
structure (essentially a CPU cache friendly Btree index) of the sorted
order of the data.

That Btree can be used to generate a physical reordering of the data
in one pass, but that's the weakest use for it. The more powerful
uses involve allowing the Btree to persist and using it for more
efficient re-searches or combining it with other such Btrees (either as
a step in task distribution across multiple CPUs or as a more efficient
way to do things like joins by manipulating these Btrees rather than
the actual records.)

However, that's not PostgreSQL's problem; currently for us external
sort is a *CPU-bound* operation, half of which is value comparisons.
(oprofiles available if anyone cares)

So we need to look, instead, at algorithms which make better use of
work_mem to lower CPU activity, possibly even at the expense of I/O.

I suspect that even the highly efficient sorting code we have is
suffering more pessimal CPU IO behavior than what I'm presenting.
Jim Gray's external sorting contest web site points out that memory IO
has become a serious problem for most of the contest entries.

Also, I'll bet the current code manipulates more data.

Finally, there's the possibilty of reusing the product of this work to a
degree and in ways that we can't with our current sorting code.

Now all we need is resources and time to create a prototype.
Since I'm not likely to have either any time soon, I'm hoping that
I'll be able to explain this well enough that others can test it.

*sigh* I _never_ have enough time or resources any more...
Ron

#11Jeffrey W. Baker
jwbaker@acm.org
In reply to: Ron Peacetree (#10)
Re: [PERFORM] A Better External Sort?

On Tue, 2005-09-27 at 13:15 -0400, Ron Peacetree wrote:

That Btree can be used to generate a physical reordering of the data
in one pass, but that's the weakest use for it. The more powerful
uses involve allowing the Btree to persist and using it for more
efficient re-searches or combining it with other such Btrees (either as
a step in task distribution across multiple CPUs or as a more efficient
way to do things like joins by manipulating these Btrees rather than
the actual records.)

Maybe you could describe some concrete use cases. I can see what you
are getting at, and I can imagine some advantageous uses, but I'd like
to know what you are thinking.

Specifically I'd like to see some cases where this would beat sequential
scan. I'm thinking that in your example of a terabyte table with a
column having only two values, all the queries I can think of would be
better served with a sequential scan.

Perhaps I believe this because you can now buy as much sequential I/O as
you want. Random I/O is the only real savings.

-jwb

#12Ron Peacetree
rjpeace@earthlink.net
In reply to: Jeffrey W. Baker (#11)
Re: [PERFORM] A Better External Sort?

From: "Jeffrey W. Baker" <jwbaker@acm.org>
Sent: Sep 27, 2005 1:26 PM
To: Ron Peacetree <rjpeace@earthlink.net>
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

On Tue, 2005-09-27 at 13:15 -0400, Ron Peacetree wrote:

That Btree can be used to generate a physical reordering of the data
in one pass, but that's the weakest use for it. The more powerful
uses involve allowing the Btree to persist and using it for more
efficient re-searches or combining it with other such Btrees (either as
a step in task distribution across multiple CPUs or as a more efficient
way to do things like joins by manipulating these Btrees rather than
the actual records.)

Maybe you could describe some concrete use cases. I can see what
you are getting at, and I can imagine some advantageous uses, but
I'd like to know what you are thinking.

1= In a 4P box, we split the data in RAM into 4 regions and create
a CPU cache friendly Btree using the method I described for each CPU.
The 4 Btrees can be merged in a more time and space efficient manner
than the original records to form a Btree that represents the sorted order
of the entire data set. Any of these Btrees can be allowed to persist to
lower the cost of doing similar operations in the future (Updating the
Btrees during inserts and deletes is cheaper than updating the original
data files and then redoing the same sort from scratch in the future.)
Both the original sort and future such sorts are made more efficient
than current methods.

2= We use my method to sort two different tables. We now have these
very efficient representations of a specific ordering on these tables. A
join operation can now be done using these Btrees rather than the
original data tables that involves less overhead than many current
methods.

3= We have multiple such Btrees for the same data set representing
sorts done using different fields (and therefore different Keys).
Calculating a sorted order for the data based on a composition of
those Keys is now cheaper than doing the sort based on the composite
Key from scratch. When some of the Btrees exist and some of them
do not, there is a tradeoff calculation to be made. Sometimes it will be
cheaper to do the sort from scratch using the composite Key.

Specifically I'd like to see some cases where this would beat sequential
scan. I'm thinking that in your example of a terabyte table with a
column having only two values, all the queries I can think of would be
better served with a sequential scan.

In my original example, a sequential scan of the 1TB of 2KB or 4KB
records, => 250M or 500M records of data, being sorted on a binary
value key will take ~1000x more time than reading in the ~1GB Btree
I described that used a Key+RID (plus node pointers) representation
of the data.

Just to clarify the point further,
1TB of 1B records => 2^40 records of at most 256 distinct values.
1TB of 2B records => 2^39 records of at most 2^16 distinct values.
1TB of 4B records => 2^38 records of at most 2^32 distinct values.
1TB of 5B records => 200B records of at most 200B distinct values.

From here on, the number of possible distinct values is limited by the

number of records.
100B records are used in the "Indy" version of Jim Gray's sorting
contests, so 1TB => 10B records.
2KB-4KB is the most common record size I've seen in enterprise
class DBMS (so I used this value to make my initial example more
realistic).

Therefore the vast majority of the time representing a data set by Key
will use less space that the original record. Less space used means
less IO to scan the data set, which means faster scan times.

This is why index files work in the first place, right?

Perhaps I believe this because you can now buy as much sequential I/O
as you want. Random I/O is the only real savings.

1= No, you can not "buy as much sequential IO as you want". Even if
with an infinite budget, there are physical and engineering limits. Long
before you reach those limits, you will pay exponentially increasing costs
for linearly increasing performance gains. So even if you _can_ buy a
certain level of sequential IO, it may not be the most efficient way to
spend money.

2= Most RW IT professionals have far from an infinite budget. Just traffic
on these lists shows how severe the typical cost constraints usually are.
OTOH, if you have an inifinite IT budget, care to help a few less fortunate
than yourself? After all, a even a large constant substracted from infinity
is still infinity... ;-)

3= No matter how fast you can do IO, IO remains the most expensive
part of the performance equation. The fastest and cheapest IO you can
do is _no_ IO. As long as we trade cheaper RAM and even cheaoer CPU
operations for IO correctly, more space efficient data representations will
always be a Win because of this.

#13Ron Peacetree
rjpeace@earthlink.net
In reply to: Ron Peacetree (#12)
Re: [PERFORM] A Better External Sort?

In the interest of efficiency and "not reinventing the wheel", does anyone know
where I can find C or C++ source code for a Btree variant with the following
properties:

A= Data elements (RIDs) are only stored in the leaves, Keys (actually
KeyPrefixes; see "D" below) and Node pointers are only stored in the internal
nodes of the Btree.

B= Element redistribution is done as an alternative to node splitting in overflow
conditions during Inserts whenever possible.

C= Variable length Keys are supported.

D= Node buffering with a reasonable replacement policy is supported.

E= Since we will know beforehand exactly how many RID's will be stored, we
will know apriori how much space will be needed for leaves, and will know the
worst case for how much space will be required for the Btree internal nodes
as well. This implies that we may be able to use an array, rather than linked
list, implementation of the Btree. Less pointer chasing at the expense of more
CPU calculations, but that's a trade-off in the correct direction.

Such source would be a big help in getting a prototype together.

Thanks in advance for any pointers or source,
Ron

#14Ron Peacetree
rjpeace@earthlink.net
In reply to: Ron Peacetree (#13)
Re: [PERFORM] A Better External Sort?

If I've done this correctly, there should not be anywhere near
the number of context switches we currently see while sorting.

Each unscheduled context switch represents something unexpected
occuring or things not being where they are needed when they are
needed. Reducing such circumstances to the absolute minimum
was one of the design goals.

Reducing the total amount of IO to the absolute minimum should
help as well.

Ron

-----Original Message-----
From: Kevin Grittner <Kevin.Grittner@wicourts.gov>
Sent: Sep 27, 2005 11:21 AM
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

I can't help wondering how a couple thousand context switches per
second would affect the attempt to load disk info into the L1 and
L2 caches. That's pretty much the low end of what I see when the
server is under any significant load.

#15Jeffrey W. Baker
jwbaker@acm.org
In reply to: Ron Peacetree (#12)
Re: [PERFORM] A Better External Sort?

On Wed, 2005-09-28 at 12:03 -0400, Ron Peacetree wrote:

From: "Jeffrey W. Baker" <jwbaker@acm.org>
Sent: Sep 27, 2005 1:26 PM
To: Ron Peacetree <rjpeace@earthlink.net>
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

On Tue, 2005-09-27 at 13:15 -0400, Ron Peacetree wrote:

That Btree can be used to generate a physical reordering of the data
in one pass, but that's the weakest use for it. The more powerful
uses involve allowing the Btree to persist and using it for more
efficient re-searches or combining it with other such Btrees (either as
a step in task distribution across multiple CPUs or as a more efficient
way to do things like joins by manipulating these Btrees rather than
the actual records.)

Maybe you could describe some concrete use cases. I can see what
you are getting at, and I can imagine some advantageous uses, but
I'd like to know what you are thinking.

Specifically I'd like to see some cases where this would beat sequential
scan. I'm thinking that in your example of a terabyte table with a
column having only two values, all the queries I can think of would be
better served with a sequential scan.

In my original example, a sequential scan of the 1TB of 2KB or 4KB
records, => 250M or 500M records of data, being sorted on a binary
value key will take ~1000x more time than reading in the ~1GB Btree
I described that used a Key+RID (plus node pointers) representation
of the data.

You are engaging in a length and verbose exercise in mental
masturbation, because you have not yet given a concrete example of a
query where this stuff would come in handy. A common, general-purpose
case would be the best.

We can all see that the method you describe might be a good way to sort
a very large dataset with some known properties, which would be fine if
you are trying to break the terasort benchmark. But that's not what
we're doing here. We are designing and operating relational databases.
So please explain the application.

Your main example seems to focus on a large table where a key column has
constrained values. This case is interesting in proportion to the
number of possible values. If I have billions of rows, each having one
of only two values, I can think of a trivial and very fast method of
returning the table "sorted" by that key: make two sequential passes,
returning the first value on the first pass and the second value on the
second pass. This will be faster than the method you propose.

I think an important aspect you have failed to address is how much of
the heap you must visit after the sort is complete. If you are
returning every tuple in the heap then the optimal plan will be very
different from the case when you needn't.

-jwb

PS: Whatever mailer you use doesn't understand or respect threading nor
attribution. Out of respect for the list's readers, please try a mailer
that supports these 30-year-old fundamentals of electronic mail.

#16Jeffrey W. Baker
jwbaker@acm.org
In reply to: Ron Peacetree (#12)
Sequential I/O Cost (was Re: A Better External Sort?)

On Wed, 2005-09-28 at 12:03 -0400, Ron Peacetree wrote:

From: "Jeffrey W. Baker" <jwbaker@acm.org>
Perhaps I believe this because you can now buy as much sequential I/O
as you want. Random I/O is the only real savings.

1= No, you can not "buy as much sequential IO as you want". Even if
with an infinite budget, there are physical and engineering limits. Long
before you reach those limits, you will pay exponentially increasing costs
for linearly increasing performance gains. So even if you _can_ buy a
certain level of sequential IO, it may not be the most efficient way to
spend money.

This is just false. You can buy sequential I/O for linear money up to
and beyond your platform's main memory bandwidth. Even 1GB/sec will
severely tax memory bandwidth of mainstream platforms, and you can
achieve this rate for a modest cost.

I have one array that can supply this rate and it has only 15 disks. It
would fit on my desk. I think your dire talk about the limits of
science and engineering may be a tad overblown.

#17Ron Peacetree
rjpeace@earthlink.net
In reply to: Jeffrey W. Baker (#15)
Re: [PERFORM] A Better External Sort?

From: "Jeffrey W. Baker" <jwbaker@acm.org>
Sent: Sep 29, 2005 12:27 AM
To: Ron Peacetree <rjpeace@earthlink.net>
Cc: pgsql-hackers@postgresql.org, pgsql-performance@postgresql.org
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

You are engaging in a length and verbose exercise in mental
masturbation, because you have not yet given a concrete example of a
query where this stuff would come in handy. A common, general-purpose
case would be the best.

??? I posted =3= specific classes of common, general-purpose query
operations where OES and the OES Btrees look like they should be
superior to current methods:
1= when splitting sorting or other operations across multiple CPUs
2= when doing joins of different tables by doing the join on these Btrees
rather than the original tables.
3= when the opportunity arises to reuse OES Btree results of previous
sorts for different keys in the same table. Now we can combine the
existing Btrees to obtain the new order based on the composite key
without ever manipulating the original, much larger, table.

In what way are these examples not "concrete"?

We can all see that the method you describe might be a good way to sort
a very large dataset with some known properties, which would be fine if
you are trying to break the terasort benchmark. But that's not what
we're doing here. We are designing and operating relational databases.
So please explain the application.

This is a GENERAL method. It's based on CPU cache efficient Btrees that
use variable length prefix keys and RIDs.
It assumes NOTHING about the data or the system in order to work.
I gave some concrete examples for the sake of easing explanation, NOT
as an indication of assumptions or limitations of the method. I've even
gone out of my way to prove that no such assumptions or limitations exist.
Where in the world are you getting such impressions?

Your main example seems to focus on a large table where a key column has
constrained values. This case is interesting in proportion to the
number of possible values. If I have billions of rows, each having one
of only two values, I can think of a trivial and very fast method of
returning the table "sorted" by that key: make two sequential passes,
returning the first value on the first pass and the second value on the
second pass. This will be faster than the method you propose.

1= No that was not my main example. It was the simplest example used to
frame the later more complicated examples. Please don't get hung up on it.

2= You are incorrect. Since IO is the most expensive operation we can do,
any method that makes two passes through the data at top scanning speed
will take at least 2x as long as any method that only takes one such pass.

I think an important aspect you have failed to address is how much of
the heap you must visit after the sort is complete. If you are
returning every tuple in the heap then the optimal plan will be very
different from the case when you needn't.

Hmmm. Not sure which "heap" you are referring to, but the OES Btree
index is provably the lowest (in terms of tree height) and smallest
possible CPU cache efficient data structure that one can make and still
have all of the traditional benefits associated with a Btree representation
of a data set.

Nonetheless, returning a RID, or all RIDs with(out) the same Key, or all
RIDs (not) within a range of Keys, or simply all RIDs in sorted order is
efficient. Just as should be for a Btree (actually it's a B+ tree variant to
use Knuth's nomenclature). I'm sure someone posting from acm.org
recognizes how each of these Btree operations maps to various SQL
features...

I haven't been talking about query plans because they are orthogonal to
the issue under discussion? If we use a layered model for PostgreSQL's
architecture, this functionality is more primal than that of a query
planner. ALL query plans that currently involve sorts will benefit from a
more efficient way to do, or avoid, sorts.

PS: Whatever mailer you use doesn't understand or respect threading nor
attribution. Out of respect for the list's readers, please try a mailer
that supports these 30-year-old fundamentals of electronic mail.

That is an issue of infrastructure on the recieving side, not on the sending
(my) side since even my web mailer seems appropriately RFC conformant.
Everything seems to be going in the correct places and being properly
organized on archival.postgres.org ...

Ron

In reply to: Ron Peacetree (#17)
Re: [PERFORM] A Better External Sort?

Your main example seems to focus on a large table where a key
column has
constrained values. This case is interesting in proportion to the
number of possible values. If I have billions of rows, each
having one
of only two values, I can think of a trivial and very fast method of
returning the table "sorted" by that key: make two sequential passes,
returning the first value on the first pass and the second value
on the
second pass. This will be faster than the method you propose.

1= No that was not my main example. It was the simplest example
used to
frame the later more complicated examples. Please don't get hung
up on it.

2= You are incorrect. Since IO is the most expensive operation we
can do,
any method that makes two passes through the data at top scanning
speed
will take at least 2x as long as any method that only takes one
such pass.

You do not get the point.
As the time you get the sorted references to the tuples, you need to
fetch the tuples themself, check their visbility, etc. and returns
them to the client.

So,
if there is only 2 values in the column of big table that is larger
than available RAM,
two seq scans of the table without any sorting
is the fastest solution.

Cordialement,
Jean-Gérard Pailloncy

#19Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Pailloncy Jean-Gerard (#18)
Re: [PERFORM] A Better External Sort?

In my original example, a sequential scan of the 1TB of 2KB
or 4KB records, => 250M or 500M records of data, being sorted
on a binary value key will take ~1000x more time than reading
in the ~1GB Btree I described that used a Key+RID (plus node
pointers) representation of the data.

Imho you seem to ignore the final step your algorithm needs of
collecting the
data rows. After you sorted the keys the collect step will effectively
access the
tuples in random order (given a sufficiently large key range).

This random access is bad. It effectively allows a competing algorithm
to read the
whole data at least 40 times sequentially, or write the set 20 times
sequentially.
(Those are the random/sequential ratios of modern discs)

Andreas

#20Pierre-Frédéric Caillaud
lists@boutiquenumerique.com
In reply to: Pailloncy Jean-Gerard (#18)
Re: [PERFORM] A Better External Sort?

Just to add a little anarchy in your nice debate...

Who really needs all the results of a sort on your terabyte table ?

I guess not many people do a SELECT from such a table and want all the
results. So, this leaves :
- Really wanting all the results, to fetch using a cursor,
- CLUSTER type things, where you really want everything in order,
- Aggregates (Sort->GroupAggregate), which might really need to sort the
whole table.
- Complex queries where the whole dataset needs to be examined, in order
to return a few values
- Joins (again, the whole table is probably not going to be selected)
- And the ones I forgot.

However,

Most likely you only want to SELECT N rows, in some ordering :
- the first N (ORDER BY x LIMIT N)
- last N (ORDER BY x DESC LIMIT N)
- WHERE x>value ORDER BY x LIMIT N
- WHERE x<value ORDER BY x DESC LIMIT N
- and other variants

Or, you are doing a Merge JOIN against some other table ; in that case,
yes, you might need the whole sorted terabyte table, but most likely there
are WHERE clauses in the query that restrict the set, and thus, maybe we
can get some conditions or limit values on the column to sort.

Also the new, optimized hash join, which is more memory efficient, might
cover this case.

Point is, sometimes, you only need part of the results of your sort. And
the bigger the sort, the most likely it becomes that you only want part of
the results. So, while we're in the fun hand-waving, new algorithm trying
mode, why not consider this right from the start ? (I know I'm totally in
hand-waving mode right now, so slap me if needed).

I'd say your new, fancy sort algorithm needs a few more input values :

- Range of values that must appear in the final result of the sort :
none, minimum, maximum, both, or even a set of values from the other
side of the join, hashed, or sorted.
- LIMIT information (first N, last N, none)
- Enhanced Limit information (first/last N values of the second column to
sort, for each value of the first column) (the infamous "top10 by
category" query)
- etc.

With this, the amount of data that needs to be kept in memory is
dramatically reduced, from the whole table (even using your compressed
keys, that's big) to something more manageable which will be closer to the
size of the final result set which will be returned to the client, and
avoid a lot of effort.

So, this would not be useful in all cases, but when it applies, it would
be really useful.

Regards !

#21Josh Berkus
josh@agliodbs.com
In reply to: Jeffrey W. Baker (#15)
#22Luke Lonergan
llonergan@greenplum.com
In reply to: Josh Berkus (#21)
#23David Fetter
david@fetter.org
In reply to: Luke Lonergan (#22)
#24Jeffrey W. Baker
jwbaker@acm.org
In reply to: Luke Lonergan (#22)
#25Josh Berkus
josh@agliodbs.com
In reply to: Jeffrey W. Baker (#24)
#26Jeffrey W. Baker
jwbaker@acm.org
In reply to: Josh Berkus (#25)
#27Josh Berkus
josh@agliodbs.com
In reply to: Jeffrey W. Baker (#26)
#28Dann Corbit
DCorbit@connx.com
In reply to: Josh Berkus (#27)
#29Ron Peacetree
rjpeace@earthlink.net
In reply to: Dann Corbit (#28)
#30Ron Peacetree
rjpeace@earthlink.net
In reply to: Ron Peacetree (#29)
#31Luke Lonergan
llonergan@greenplum.com
In reply to: Jeffrey W. Baker (#24)
#32Ron Peacetree
rjpeace@earthlink.net
In reply to: Luke Lonergan (#31)
#33Josh Berkus
josh@agliodbs.com
In reply to: Ron Peacetree (#32)
#34Dann Corbit
DCorbit@connx.com
In reply to: Josh Berkus (#33)
#35Ron Peacetree
rjpeace@earthlink.net
In reply to: Dann Corbit (#34)
#36Jignesh K. Shah
J.K.Shah@Sun.COM
In reply to: Ron Peacetree (#35)
#37Luke Lonergan
llonergan@greenplum.com
In reply to: Ron Peacetree (#35)
#38Josh Berkus
josh@agliodbs.com
In reply to: Ron Peacetree (#35)
#39Dann Corbit
DCorbit@connx.com
In reply to: Josh Berkus (#38)
#40Pierre-Frédéric Caillaud
lists@boutiquenumerique.com
In reply to: Luke Lonergan (#37)
#41Simon Riggs
simon@2ndQuadrant.com
In reply to: Josh Berkus (#38)
#42Michael Stone
mstone+postgres@mathom.us
In reply to: Josh Berkus (#38)
#43Ron Peacetree
rjpeace@earthlink.net
In reply to: Michael Stone (#42)
#44Dann Corbit
DCorbit@connx.com
In reply to: Ron Peacetree (#43)
#45Gregory Maxwell
gmaxwell@gmail.com
In reply to: Ron Peacetree (#43)
#46Gregory Maxwell
gmaxwell@gmail.com
In reply to: Ron Peacetree (#12)
#47Dann Corbit
DCorbit@connx.com
In reply to: Gregory Maxwell (#46)
#48Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeffrey W. Baker (#24)
#49Dann Corbit
DCorbit@connx.com
In reply to: Tom Lane (#48)
#50Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#48)
#51Hannu Krosing
hannu@tm.ee
In reply to: Luke Lonergan (#37)
#52Ron Peacetree
rjpeace@earthlink.net
In reply to: Hannu Krosing (#51)
#53Andrew Dunstan
andrew@dunslane.net
In reply to: Ron Peacetree (#52)
#54Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Berkus (#21)
#55Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#48)
#56Martijn van Oosterhout
kleptog@svana.org
In reply to: Ron Peacetree (#52)
#57Ron Peacetree
rjpeace@earthlink.net
In reply to: Martijn van Oosterhout (#56)
#58Ron Peacetree
rjpeace@earthlink.net
In reply to: Ron Peacetree (#57)
#59Martijn van Oosterhout
kleptog@svana.org
In reply to: Ron Peacetree (#58)
#60Tom Lane
tgl@sss.pgh.pa.us
In reply to: Martijn van Oosterhout (#59)
#61Martijn van Oosterhout
kleptog@svana.org
In reply to: Tom Lane (#60)
#62Martijn van Oosterhout
kleptog@svana.org
In reply to: Martijn van Oosterhout (#61)
#63Josh Berkus
josh@agliodbs.com
In reply to: Michael Stone (#42)
#64Josh Berkus
josh@agliodbs.com
In reply to: Tom Lane (#54)
#65Jeffrey W. Baker
jwbaker@acm.org
In reply to: Josh Berkus (#63)
#66Josh Berkus
josh@agliodbs.com
In reply to: Jeffrey W. Baker (#65)
#67Ron Peacetree
rjpeace@earthlink.net
In reply to: Josh Berkus (#66)
#68Luke Lonergan
llonergan@greenplum.com
In reply to: Josh Berkus (#66)
#69Jeffrey W. Baker
jwbaker@acm.org
In reply to: Josh Berkus (#66)
#70Hannu Krosing
hannu@tm.ee
In reply to: Josh Berkus (#66)
#71Simon Riggs
simon@2ndQuadrant.com
In reply to: Martijn van Oosterhout (#62)
#72Luke Lonergan
llonergan@greenplum.com
In reply to: Hannu Krosing (#70)
#73Michael Stone
mstone+postgres@mathom.us
In reply to: Josh Berkus (#63)
#74Josh Berkus
josh@agliodbs.com
In reply to: Michael Stone (#73)
#75Josh Berkus
josh@agliodbs.com
In reply to: Jeffrey W. Baker (#69)
#76Ron Peacetree
rjpeace@earthlink.net
In reply to: Josh Berkus (#75)
#77Gregory Maxwell
gmaxwell@gmail.com
In reply to: Ron Peacetree (#76)
#78Ron Peacetree
rjpeace@earthlink.net
In reply to: Gregory Maxwell (#77)
#79Martijn van Oosterhout
kleptog@svana.org
In reply to: Simon Riggs (#71)
#80Simon Riggs
simon@2ndQuadrant.com
In reply to: Martijn van Oosterhout (#79)
#81Martijn van Oosterhout
kleptog@svana.org
In reply to: Simon Riggs (#80)
#82Ron Peacetree
rjpeace@earthlink.net
In reply to: Martijn van Oosterhout (#81)
#83Tom Lane
tgl@sss.pgh.pa.us
In reply to: Martijn van Oosterhout (#81)
#84Martijn van Oosterhout
kleptog@svana.org
In reply to: Tom Lane (#83)
#85Simon Riggs
simon@2ndQuadrant.com
In reply to: Martijn van Oosterhout (#84)
#86Tom Lane
tgl@sss.pgh.pa.us
In reply to: Martijn van Oosterhout (#84)
#87Martijn van Oosterhout
kleptog@svana.org
In reply to: Simon Riggs (#85)
#88Mark Mielke
mark@mark.mielke.cc
In reply to: Martijn van Oosterhout (#87)
#89Michael Stone
mstone+postgres@mathom.us
In reply to: Martijn van Oosterhout (#56)
#90Michael Stone
mstone+postgres@mathom.us
In reply to: Hannu Krosing (#70)
#91Hannu Krosing
hannu@tm.ee
In reply to: Michael Stone (#90)
#92Martijn van Oosterhout
kleptog@svana.org
In reply to: Michael Stone (#89)
#93Luke Lonergan
llonergan@greenplum.com
In reply to: Martijn van Oosterhout (#92)
#94Michael Stone
mstone+postgres@mathom.us
In reply to: Luke Lonergan (#93)
#95Ron Peacetree
rjpeace@earthlink.net
In reply to: Michael Stone (#94)
#96Joshua D. Drake
jd@commandprompt.com
In reply to: Ron Peacetree (#95)
#97Ron Peacetree
rjpeace@earthlink.net
In reply to: Joshua D. Drake (#96)
#98Jeffrey W. Baker
jwbaker@acm.org
In reply to: Ron Peacetree (#95)
#99Andrej Ricnik-Bay
andrej.groups@gmail.com
In reply to: Michael Stone (#94)
#100Jonah H. Harris
jonah.harris@gmail.com
In reply to: Ron Peacetree (#97)
#101Ron Peacetree
rjpeace@earthlink.net
In reply to: Jonah H. Harris (#100)
#102Luke Lonergan
llonergan@greenplum.com
In reply to: Michael Stone (#94)
#103Steinar H. Gunderson
sgunderson@bigfoot.com
In reply to: Luke Lonergan (#102)
#104Michael Stone
mstone+postgres@mathom.us
In reply to: Luke Lonergan (#102)
In reply to: Ron Peacetree (#97)
In reply to: Ron Peacetree (#101)
#107Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Hannu Krosing (#106)
#108Josh Berkus
josh@agliodbs.com
In reply to: Zeugswetter Andreas SB SD (#107)
In reply to: Ron Peacetree (#101)
#110Luke Lonergan
llonergan@greenplum.com
In reply to: Zeugswetter Andreas SB SD (#107)
#111Tom Lane
tgl@sss.pgh.pa.us
In reply to: Martijn van Oosterhout (#109)
In reply to: Tom Lane (#111)
#113Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#111)
#114Tom Lane
tgl@sss.pgh.pa.us
In reply to: Martijn van Oosterhout (#112)
In reply to: Tom Lane (#114)
#116Luke Lonergan
llonergan@greenplum.com
In reply to: Steinar H. Gunderson (#103)
#117Mark Mielke
mark@mark.mielke.cc
In reply to: Luke Lonergan (#116)
#118Luke Lonergan
llonergan@greenplum.com
In reply to: Mark Mielke (#117)
#119Mark Mielke
mark@mark.mielke.cc
In reply to: Luke Lonergan (#118)
#120Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Zeugswetter Andreas SB SD (#19)
#121Evgeny Gridasov
eugrid@fpm.kubsu.ru
In reply to: Tom Lane (#48)
#122Javier Somoza
jsomoza@pandasoftware.es
In reply to: Evgeny Gridasov (#121)
#123Bruce Momjian
bruce@momjian.us
In reply to: Javier Somoza (#122)
#124PFC
lists@peufeu.com
In reply to: Bruce Momjian (#123)
#125Tom Lane
tgl@sss.pgh.pa.us
In reply to: PFC (#124)
#126PFC
lists@peufeu.com
In reply to: Tom Lane (#125)