Releasing memory during External sorting?

Started by Simon Riggsover 20 years ago20 messages
#1Simon Riggs
simon@2ndquadrant.com

I have concerns about whether we are overallocating memory for use in
external sorts. (All code relating to this is in tuplesort.c)

When we begin a sort we allocate (work_mem | maintenance_work_mem) and
attempt to do the sort in memory. If the sort set is too big to fit in
memory we then write to disk and begin an external sort. The same memory
allocation is used for both types of sort, AFAICS.

The external sort algorithm benefits from some memory but not much.
Knuth says that the amount of memory required is very low, with a value
typically less than 1 kB. I/O overheads mean that there is benefit from
having longer sequential writes, so the optimum is much larger than
that. I've not seen any data that indicates that a setting higher than
16 MB adds any value at all to a large external sort. I have some
indications from private tests that very high memory settings may
actually hinder performance of the sorts, though I cannot explain that
and wonder whether it is the performance tests themselves that have
issues.

Does anyone have any clear data that shows the value of large settings
of work_mem when the data to be sorted is much larger than memory? (I am
well aware of the value of setting work_mem higher for smaller sorts, so
any performance data needs to reflect only very large sorts).

If not, I would propose that when we move from qsort to tapesort mode we
free the larger work_mem setting (if one exists) and allocate only a
lower, though still optimal setting for the tapesort. That way the
memory can be freed for use by other users or the OS while the tapesort
proceeds (which is usually quite a while...).

Feedback, please.

Best Regards, Simon Riggs

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#1)
Re: Releasing memory during External sorting?

Simon Riggs <simon@2ndquadrant.com> writes:

If not, I would propose that when we move from qsort to tapesort mode we
free the larger work_mem setting (if one exists) and allocate only a
lower, though still optimal setting for the tapesort. That way the
memory can be freed for use by other users or the OS while the tapesort
proceeds (which is usually quite a while...).

On most platforms it's quite unlikely that any memory would actually get
released back to the OS before transaction end, because the memory
blocks belonging to the tuplesort context will be intermixed with blocks
belonging to other contexts. So I think this is pretty pointless.
(If you can't afford to have the sort using all of sort_mem, you've set
sort_mem too large, anyway.)

regards, tom lane

#3Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#2)
Re: Releasing memory during External sorting?

On Fri, 2005-09-23 at 10:09 -0400, Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

If not, I would propose that when we move from qsort to tapesort mode we
free the larger work_mem setting (if one exists) and allocate only a
lower, though still optimal setting for the tapesort. That way the
memory can be freed for use by other users or the OS while the tapesort
proceeds (which is usually quite a while...).

On most platforms it's quite unlikely that any memory would actually get
released back to the OS before transaction end, because the memory
blocks belonging to the tuplesort context will be intermixed with blocks
belonging to other contexts. So I think this is pretty pointless.

I take it you mean pointless because of the way the memory allocation
works, rather than because giving memory back isn't worthwhile ?

Surely the sort memory would be allocated in contiguous chunks? In some
cases we might be talking about more than a GB of memory, so it'd be
good to get that back ASAP. I'm speculating....

(If you can't afford to have the sort using all of sort_mem, you've set
sort_mem too large, anyway.)

Sort takes care to allocate only what it needs as starts up. All I'm
suggesting is to take the same care when the sort mode changes. If the
above argument held water then we would just allocate all the memory in
one lump at startup, "because we can afford to", so I don't buy that.

Since we know the predicted size of the sort set prior to starting the
sort node, could we not use that information to allocate memory
appropriately? i.e. if sort size is predicted to be more than twice the
size of work_mem, then just move straight to the external sort algorithm
and set the work_mem down at the lower limit?

That is, unless somebody has evidence that having a very large memory
has any performance benefit for external sorting?

Best Regards, Simon Riggs

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#3)
Re: Releasing memory during External sorting?

Simon Riggs <simon@2ndquadrant.com> writes:

Since we know the predicted size of the sort set prior to starting the
sort node, could we not use that information to allocate memory
appropriately? i.e. if sort size is predicted to be more than twice the
size of work_mem, then just move straight to the external sort algorithm
and set the work_mem down at the lower limit?

Have you actually read the sort code?

During the run-forming phase it's definitely useful to eat all the
memory you can: that translates directly to longer initial runs and
hence fewer merge passes. During the run-merging phase it's possible
that using less memory would not hurt performance any, but as already
stated, I don't think it will actually end up cutting the backend's
memory footprint --- the sbrk point will be established during the run
forming phase and it's unlikely to move back much until transaction end.

Also, if I recall the development of that code correctly, the reason for
using more than minimum memory during the merge phase is that writing or
reading lots of tuples at once improves sequentiality of access to the
temp files. So I'm not sure that cutting down the memory wouldn't hurt
performance.

regards, tom lane

In reply to: Tom Lane (#2)
Re: Releasing memory during External sorting?

On most platforms it's quite unlikely that any memory would
actually get
released back to the OS before transaction end, because the memory
blocks belonging to the tuplesort context will be intermixed with
blocks
belonging to other contexts. So I think this is pretty pointless.
(If you can't afford to have the sort using all of sort_mem, you've
set
sort_mem too large, anyway.)

On OpenBSD 3.8 malloc use mmap(2) and no more sbrk.
So, as soon as the bloc is free, it returns to the OS.
Access to the freed pointer crashs immediatly.

Cordialement,
Jean-Gérard Pailloncy

#6Ron Peacetree
rjpeace@earthlink.net
In reply to: Pailloncy Jean-Gerard (#5)
Re: [PERFORM] Releasing memory during External sorting?

From: Simon Riggs <simon@2ndquadrant.com>
Sent: Sep 23, 2005 5:37 AM
Subject: [PERFORM] Releasing memory during External sorting?

I have concerns about whether we are overallocating memory for use in
external sorts. (All code relating to this is in tuplesort.c)

A decent external sorting algorithm, say a Merge Sort + Radix (or
Distribution Counting) hybrid with appropriate optimizations for small sub-
files, should become more effective / efficient the more RAM you give it.

The external sort algorithm benefits from some memory but not much.

That's probably an artifact of the psql external sorting code and _not_
due to some fundamental external sorting issue.

Knuth says that the amount of memory required is very low, with a value
typically less than 1 kB.

"Required" means the external sort can operate on that little memory. How
Much memory is required for optimal performance is another matter.

I/O overheads mean that there is benefit from having longer sequential
writes, so the optimum is much larger than that. I've not seen any data
that indicates that a setting higher than 16 MB adds any value at all to a
large external sort.

It should. A first pass upper bound would be the amount of RAM needed for
Replacement Selection to create a run (ie sort) of the whole file. That should
be ~ the amount of RAM to hold 1/2 the file in a Replacement Selection pass.

At the simplest, for any file over 32MB the optimum should be more than
16MB.

I have some indications from private tests that very high memory settings
may actually hinder performance of the sorts, though I cannot explain that
and wonder whether it is the performance tests themselves that have issues.

Hmmm. Are you talking about amounts so high that you are throwing the OS
into paging and swapping thrash behavior? If not, then the above is weird.

Does anyone have any clear data that shows the value of large settings
of work_mem when the data to be sorted is much larger than memory? (I am
well aware of the value of setting work_mem higher for smaller sorts, so
any performance data needs to reflect only very large sorts).

This is not PostgreSQL specific, but it does prove the point that the performance
of external sorts benefits greatly from large amounts of RAM being available:

http://research.microsoft.com/barc/SortBenchmark/

Looking at the particulars of the algorithms listed there should shed a lot of light
on what a "good" external sorting algorithm looks like:
1= HD IO matters the most.
1a= Seeking behavior is the largest factor in poor performance.
2= No optimal external sorting algorithm should use more than 2 passes.
3= Optimal external sorting algorithms should use 1 pass if at all possible.
4= Use as much RAM as possible, and use it as efficiently as possible.
5= The amount of RAM needed to hide the latency of a HD subsytem goes up as
the _square_ of the difference between the bandwidth of the HD subsystem and
memory.
6= Be cache friendly.
7= For large numbers of records whose sorting key is substantially smaller than
the record itself, use a pointer + compressed key representation and write the data
to HD in sorted order (Replace HD seeks with RAM seeks. Minimize RAM seeks).
8= Since your performance will be constrained by HD IO first and RAM IO second,
up to a point it is worth it to spend more CPU cycles to save on IO.

Given the large and growing gap between CPU IO, RAM IO, and HD IO, these issues
are becoming more important for _internal_ sorts as well.

Feedback, please.

Best Regards, Simon Riggs

Hope this is useful,
Ron

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Ron Peacetree (#6)
Re: [PERFORM] Releasing memory during External sorting?

Ron Peacetree <rjpeace@earthlink.net> writes:

2= No optimal external sorting algorithm should use more than 2 passes.
3= Optimal external sorting algorithms should use 1 pass if at all possible.

A comparison-based sort must use at least N log N operations, so it
would appear to me that if you haven't got approximately log N passes
then your algorithm doesn't work.

regards, tom lane

#8Mark Lewis
mark.lewis@mir3.com
In reply to: Tom Lane (#7)
Re: [PERFORM] Releasing memory during External sorting?

operations != passes. If you were clever, you could probably write a
modified bubble-sort algorithm that only made 2 passes. A pass is a
disk scan, operations are then performed (hopefully in memory) on what
you read from the disk. So there's no theoretical log N lower-bound on
the number of disk passes.

Not that I have anything else useful to add to this discussion, just a
tidbit I remembered from my CS classes back in college :)

-- Mark

Show quoted text

On Fri, 2005-09-23 at 13:17 -0400, Tom Lane wrote:

Ron Peacetree <rjpeace@earthlink.net> writes:

2= No optimal external sorting algorithm should use more than 2 passes.
3= Optimal external sorting algorithms should use 1 pass if at all possible.

A comparison-based sort must use at least N log N operations, so it
would appear to me that if you haven't got approximately log N passes
then your algorithm doesn't work.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mark Lewis (#8)
Re: [PERFORM] Releasing memory during External sorting?

Mark Lewis <mark.lewis@mir3.com> writes:

operations != passes. If you were clever, you could probably write a
modified bubble-sort algorithm that only made 2 passes. A pass is a
disk scan, operations are then performed (hopefully in memory) on what
you read from the disk. So there's no theoretical log N lower-bound on
the number of disk passes.

Given infinite memory that might be true, but I don't think I believe it
for limited memory. If you have room for K tuples in memory then it's
impossible to perform more than K*N useful comparisons per pass (ie, as
each tuple comes off the disk you can compare it to all the ones
currently in memory; anything more is certainly redundant work). So if
K < logN it's clearly not gonna work.

It's possible that you could design an algorithm that works in a fixed
number of passes if you are allowed to assume you can hold O(log N)
tuples in memory --- and in practice that would probably work fine,
if the constant factor implied by the O() isn't too big. But it's not
really solving the general external-sort problem.

regards, tom lane

#10Martijn van Oosterhout
kleptog@svana.org
In reply to: Pailloncy Jean-Gerard (#5)
Re: Releasing memory during External sorting?

On Fri, Sep 23, 2005 at 06:39:35PM +0200, Pailloncy Jean-Gerard wrote:

On most platforms it's quite unlikely that any memory would actually
get released back to the OS before transaction end, because the
memory blocks belonging to the tuplesort context will be intermixed
with blocks belonging to other contexts. So I think this is pretty
pointless. (If you can't afford to have the sort using all of
sort_mem, you've set sort_mem too large, anyway.)

On OpenBSD 3.8 malloc use mmap(2) and no more sbrk.
So, as soon as the bloc is free, it returns to the OS.
Access to the freed pointer crashs immediatly.

Interesting point. Glibc also uses mmap() but only for allocations
greater than a few K, otherwise it's a waste of space.

I guess you would have to look into the postgresql allocator to see if
it doesn't divide the mmap()ed space up between multiple contexts.
Large allocations certainly appear to be passed off to malloc() but I
don't think execSort allocates all it's space in one go, it just counts
the space allocated by palloc().

So, unless someone goes and adds changes the tuplesort code to allocate
big blocks and use them only for tuples, I think you're going to run
into issues with data interleaved, meaning not much to give back to the
OS...
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.

#11Ron Peacetree
rjpeace@earthlink.net
In reply to: Martijn van Oosterhout (#10)
Re: [PERFORM] Releasing memory during External sorting?

Yep. Also, bear in mind that the lg(n!)= ~ nlgn - n lower bound on
the number of comparisions:
a= says nothing about the amount of data movement used.
b= only holds for generic comparison based sorting algorithms.

As Knuth says (vol 3, p180), Distribution Counting sorts without
ever comparing elements to each other at all, and so does Radix
Sort. Similar comments can be found in many algorithms texts.

Any time we know that the range of the data to be sorted is substantially
restricted compared to the number of items to be sorted, we can sort in
less than O(lg(n!)) time. DB fields tend to take on few values and are
therefore "substantially restricted".

Given the proper resources and algorithms, O(n) sorts are very plausible
when sorting DB records.

All of the fastest external sorts of the last decade or so take advantage of
this. Check out that URL I posted.

Ron

-----Original Message-----
From: Mark Lewis <mark.lewis@mir3.com>
Sent: Sep 23, 2005 1:43 PM
To: Tom Lane <tgl@sss.pgh.pa.us>
Subject: Re: [PERFORM] Releasing memory during External sorting?

operations != passes. If you were clever, you could probably write a
modified bubble-sort algorithm that only made 2 passes. A pass is a
disk scan, operations are then performed (hopefully in memory) on what
you read from the disk. So there's no theoretical log N lower-bound on
the number of disk passes.

Not that I have anything else useful to add to this discussion, just a
tidbit I remembered from my CS classes back in college :)

-- Mark

Show quoted text

On Fri, 2005-09-23 at 13:17 -0400, Tom Lane wrote:

Ron Peacetree <rjpeace@earthlink.net> writes:

2= No optimal external sorting algorithm should use more than 2 passes.
3= Optimal external sorting algorithms should use 1 pass if at all possible.

A comparison-based sort must use at least N log N operations, so it
would appear to me that if you haven't got approximately log N passes
then your algorithm doesn't work.

regards, tom lane

#12Ron Peacetree
rjpeace@earthlink.net
In reply to: Ron Peacetree (#11)
Re: [PERFORM] Releasing memory during External sorting?

From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: Sep 23, 2005 2:15 PM
Subject: Re: [PERFORM] Releasing memory during External sorting?

Mark Lewis <mark.lewis@mir3.com> writes:

operations != passes. If you were clever, you could probably write a
modified bubble-sort algorithm that only made 2 passes. A pass is a
disk scan, operations are then performed (hopefully in memory) on what
you read from the disk. So there's no theoretical log N lower-bound on
the number of disk passes.

Given infinite memory that might be true, but I don't think I believe it
for limited memory. If you have room for K tuples in memory then it's
impossible to perform more than K*N useful comparisons per pass (ie, as
each tuple comes off the disk you can compare it to all the ones
currently in memory; anything more is certainly redundant work). So if
K < logN it's clearly not gonna work.

Actually, it's far better than that. I recall a paper I saw in one of the
algorithms journals 15+ years ago that proved that if you knew the range
of the data, regardless of what that range was, and had n^2 space, you
could sort n items in O(n) time.

Turns out that with very modest constraints on the range of the data and
substantially less extra space (about the same as you'd need for
Replacement Selection + External Merge Sort), you can _still_ sort in
O(n) time.

It's possible that you could design an algorithm that works in a fixed
number of passes if you are allowed to assume you can hold O(log N)
tuples in memory --- and in practice that would probably work fine,
if the constant factor implied by the O() isn't too big. But it's not
really solving the general external-sort problem.

If you know nothing about the data to be sorted and must guard against
the worst possible edge cases, AKA the classic definition of "the general
external sorting problem", then one can't do better than some variant
of Replacement Selection + Unbalanced Multiway Merge.

OTOH, ITRW things are _not_ like that. We know the range of the data
in our DB fields or we can safely assume it to be relatively constrained.
This allows us access to much better external sorting algorithms.

For example Postman Sort (the 2005 winner of the PennySort benchmark)
is basically an IO optimized version of an external Radix Sort.

Ron

#13Dann Corbit
DCorbit@connx.com
In reply to: Ron Peacetree (#12)
Re: [PERFORM] Releasing memory during External sorting?

For the subfiles, load the top element of each subfile into a priority
queue. Extract the min element and write it to disk. If the next value
is the same, then the queue does not need to be adjusted. If the next
value in the subfile changes, then adjust it.

Then, when the lowest element in the priority queue changes, adjust the
queue.

Keep doing that until the queue is empty.

You can create all the subfiles in one pass over the data.

You can read all the subfiles, merge them, and write them out in a
second pass (no matter how many of them there are).

Replacement selection is not a good idea any more, since obvious better
ideas should take over. Longer runs are of no value if you do not have
to do multiple merge passes.

I have explained this general technique in the book "C Unleashed",
chapter 13.

Sample code is available on the book's home page.

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
owner@postgresql.org] On Behalf Of Ron Peacetree
Sent: Friday, September 23, 2005 11:41 AM
To: Mark Lewis; Tom Lane; pgsql-hackers@postgresql.org; pgsql-
performance@postgresql.org
Subject: Re: [HACKERS] [PERFORM] Releasing memory during External

sorting?

Yep. Also, bear in mind that the lg(n!)= ~ nlgn - n lower bound on
the number of comparisions:
a= says nothing about the amount of data movement used.
b= only holds for generic comparison based sorting algorithms.

As Knuth says (vol 3, p180), Distribution Counting sorts without
ever comparing elements to each other at all, and so does Radix
Sort. Similar comments can be found in many algorithms texts.

Any time we know that the range of the data to be sorted is

substantially

restricted compared to the number of items to be sorted, we can sort

in

less than O(lg(n!)) time. DB fields tend to take on few values and

are

therefore "substantially restricted".

Given the proper resources and algorithms, O(n) sorts are very

plausible

when sorting DB records.

All of the fastest external sorts of the last decade or so take

advantage

of
this. Check out that URL I posted.

Ron

-----Original Message-----
From: Mark Lewis <mark.lewis@mir3.com>
Sent: Sep 23, 2005 1:43 PM
To: Tom Lane <tgl@sss.pgh.pa.us>
Subject: Re: [PERFORM] Releasing memory during External sorting?

operations != passes. If you were clever, you could probably write a
modified bubble-sort algorithm that only made 2 passes. A pass is a
disk scan, operations are then performed (hopefully in memory) on what
you read from the disk. So there's no theoretical log N lower-bound

on

the number of disk passes.

Not that I have anything else useful to add to this discussion, just a
tidbit I remembered from my CS classes back in college :)

-- Mark

On Fri, 2005-09-23 at 13:17 -0400, Tom Lane wrote:

Ron Peacetree <rjpeace@earthlink.net> writes:

2= No optimal external sorting algorithm should use more than 2

passes.

3= Optimal external sorting algorithms should use 1 pass if at all

possible.

A comparison-based sort must use at least N log N operations, so it
would appear to me that if you haven't got approximately log N

passes

then your algorithm doesn't work.

regards, tom lane

---------------------------(end of

broadcast)---------------------------

TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that

your

Show quoted text

message can get through to the mailing list cleanly

#14Dann Corbit
DCorbit@connx.com
In reply to: Dann Corbit (#13)
Re: [PERFORM] Releasing memory during External sorting?

The cited book also explains how to use a callback function to perform
arbitrary radix sorts (you simply need a method that returns the
[bucketsize] most significant bits for a given data type, for the length
of the key).

So you can sort fairly arbitrary data in linear time (of course if the
key is long then O(n*log(n)) will be better anyway.)

But in any case, if we are talking about external sorting, then disk
time will be so totally dominant that the choice of algorithm is
practically irrelevant.

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
owner@postgresql.org] On Behalf Of Dann Corbit
Sent: Friday, September 23, 2005 2:21 PM
To: Ron Peacetree; Mark Lewis; Tom Lane; pgsql-hackers@postgresql.org;
pgsql-performance@postgresql.org
Subject: Re: [HACKERS] [PERFORM] Releasing memory during External

sorting?

For the subfiles, load the top element of each subfile into a priority
queue. Extract the min element and write it to disk. If the next

value

is the same, then the queue does not need to be adjusted. If the next
value in the subfile changes, then adjust it.

Then, when the lowest element in the priority queue changes, adjust

the

queue.

Keep doing that until the queue is empty.

You can create all the subfiles in one pass over the data.

You can read all the subfiles, merge them, and write them out in a
second pass (no matter how many of them there are).

Replacement selection is not a good idea any more, since obvious

better

ideas should take over. Longer runs are of no value if you do not

have

to do multiple merge passes.

I have explained this general technique in the book "C Unleashed",
chapter 13.

Sample code is available on the book's home page.

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
owner@postgresql.org] On Behalf Of Ron Peacetree
Sent: Friday, September 23, 2005 11:41 AM
To: Mark Lewis; Tom Lane; pgsql-hackers@postgresql.org; pgsql-
performance@postgresql.org
Subject: Re: [HACKERS] [PERFORM] Releasing memory during External

sorting?

Yep. Also, bear in mind that the lg(n!)= ~ nlgn - n lower bound on
the number of comparisions:
a= says nothing about the amount of data movement used.
b= only holds for generic comparison based sorting algorithms.

As Knuth says (vol 3, p180), Distribution Counting sorts without
ever comparing elements to each other at all, and so does Radix
Sort. Similar comments can be found in many algorithms texts.

Any time we know that the range of the data to be sorted is

substantially

restricted compared to the number of items to be sorted, we can sort

in

less than O(lg(n!)) time. DB fields tend to take on few values and

are

therefore "substantially restricted".

Given the proper resources and algorithms, O(n) sorts are very

plausible

when sorting DB records.

All of the fastest external sorts of the last decade or so take

advantage

of
this. Check out that URL I posted.

Ron

-----Original Message-----
From: Mark Lewis <mark.lewis@mir3.com>
Sent: Sep 23, 2005 1:43 PM
To: Tom Lane <tgl@sss.pgh.pa.us>
Subject: Re: [PERFORM] Releasing memory during External sorting?

operations != passes. If you were clever, you could probably write

a

modified bubble-sort algorithm that only made 2 passes. A pass is a
disk scan, operations are then performed (hopefully in memory) on

what

you read from the disk. So there's no theoretical log N lower-bound

on

the number of disk passes.

Not that I have anything else useful to add to this discussion, just

a

tidbit I remembered from my CS classes back in college :)

-- Mark

On Fri, 2005-09-23 at 13:17 -0400, Tom Lane wrote:

Ron Peacetree <rjpeace@earthlink.net> writes:

2= No optimal external sorting algorithm should use more than 2

passes.

3= Optimal external sorting algorithms should use 1 pass if at

all

possible.

A comparison-based sort must use at least N log N operations, so

it

would appear to me that if you haven't got approximately log N

passes

then your algorithm doesn't work.

regards, tom lane

---------------------------(end of

broadcast)---------------------------

TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that

your

message can get through to the mailing list cleanly

---------------------------(end of

broadcast)---------------------------

TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that

your

Show quoted text

message can get through to the mailing list cleanly

#15Meir Maor
meirmaor@gmail.com
In reply to: Simon Riggs (#1)
Re: Releasing memory during External sorting?

Calculating Optimal memory for disk based sort is based only on minimizing
IO.
A previous post stated we can merge as many subfiles as we want in a single
pass,
this is not accurate, as we want to eliminate disk seeks also in the merge
phase,
also the merging should be done by reading blocks of data from each subfile,
if we have data of size N and M memory, then we will have K=N/M subfiles to
merge
after sorting each.
in the merge operation if we want to merge all blocks in one pass we will
read
M/K data from each subfile into memory and begin merging, we will read
another M/K block
when the buffer from a subfile is empty,
we would like disk seek time to be irrelavant when comparing to sequential
IO time.
We notice that we are performing IO in blocks of N/K^2 which is M/(N/M)^2
let us assume that sequeential IO is done at 100MB/s and that
a random seek requires ~15ms. and we want seek time to be irrelavnt in one
order of
magnitute we get, that in the time of one random seek we can read 1.5MB of
data
and would get optimal performance if we perform IO in blocks of 15MB.
and since in the merge algorithm showed above we perform IO in blocks of M/K
we would like M>K*15MB which results in a very large memory requirement.
M^2>N*15MB
M>sqrt(N*15MB)
for example for sorting 10GB of data, we would like M>380MB
for optimal performance.

alternativly if we can choose a diffrent algorithm in which we merge only a
constant
number of sunfiles to gether at a time but then we will require multiple
passes to merge
the entire file. we will require log(K) passes over the entire data and this
approach obviously
improves with increase of memory.

The first aproach requires 2 passes of the entire data and K^2+K random
seeks,
the second aproach(when merging l blocks at a time) requires: log(l,K)
passes over the data
and K*l+K random seeks.

Show quoted text

On 9/23/05, Simon Riggs <simon@2ndquadrant.com> wrote:

I have concerns about whether we are overallocating memory for use in
external sorts. (All code relating to this is in tuplesort.c)

When we begin a sort we allocate (work_mem | maintenance_work_mem) and
attempt to do the sort in memory. If the sort set is too big to fit in
memory we then write to disk and begin an external sort. The same memory
allocation is used for both types of sort, AFAICS.

The external sort algorithm benefits from some memory but not much.
Knuth says that the amount of memory required is very low, with a value
typically less than 1 kB. I/O overheads mean that there is benefit from
having longer sequential writes, so the optimum is much larger than
that. I've not seen any data that indicates that a setting higher than
16 MB adds any value at all to a large external sort. I have some
indications from private tests that very high memory settings may
actually hinder performance of the sorts, though I cannot explain that
and wonder whether it is the performance tests themselves that have
issues.

Does anyone have any clear data that shows the value of large settings
of work_mem when the data to be sorted is much larger than memory? (I am
well aware of the value of setting work_mem higher for smaller sorts, so
any performance data needs to reflect only very large sorts).

If not, I would propose that when we move from qsort to tapesort mode we
free the larger work_mem setting (if one exists) and allocate only a
lower, though still optimal setting for the tapesort. That way the
memory can be freed for use by other users or the OS while the tapesort
proceeds (which is usually quite a while...).

Feedback, please.

Best Regards, Simon Riggs

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

#16Dann Corbit
DCorbit@connx.com
In reply to: Meir Maor (#15)
Re: Releasing memory during External sorting?

Generally, when you read from a set of subfiles, the OS will cache the
reads to some degree, so the disk-seek jitter is not always that bad. On
a highly fragmented disk drive, you might also jump all over the place
reading serially from a single subfile. Of course, every situation is
different. At any rate, I would recommend to benchmark many different
approaches. It is also rather important how much memory is available to
perform sorts and reads. But with memory becoming larger and cheaper,
it becomes less and less of a worry.

________________________________

From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Meir Maor
Sent: Friday, September 23, 2005 10:24 PM
To: Simon Riggs
Cc: pgsql-hackers@postgresql.org; pgsql-performance@postgresql.org
Subject: Re: [HACKERS] Releasing memory during External sorting?

Calculating Optimal memory for disk based sort is based only on
minimizing IO.
A previous post stated we can merge as many subfiles as we want in a
single pass,
this is not accurate, as we want to eliminate disk seeks also in the
merge phase,
also the merging should be done by reading blocks of data from each
subfile,
if we have data of size N and M memory, then we will have K=N/M subfiles
to merge
after sorting each.
in the merge operation if we want to merge all blocks in one pass we
will read
M/K data from each subfile into memory and begin merging, we will read
another M/K block
when the buffer from a subfile is empty,
we would like disk seek time to be irrelavant when comparing to
sequential IO time.
We notice that we are performing IO in blocks of N/K^2 which is
M/(N/M)^2
let us assume that sequeential IO is done at 100MB/s and that
a random seek requires ~15ms. and we want seek time to be irrelavnt in
one order of
magnitute we get, that in the time of one random seek we can read 1.5MB
of data
and would get optimal performance if we perform IO in blocks of 15MB.
and since in the merge algorithm showed above we perform IO in blocks of
M/K
we would like M>K*15MB which results in a very large memory requirement.
M^2>N*15MB
M>sqrt(N*15MB)
for example for sorting 10GB of data, we would like M>380MB
for optimal performance.

alternativly if we can choose a diffrent algorithm in which we merge
only a constant
number of sunfiles to gether at a time but then we will require multiple
passes to merge
the entire file. we will require log(K) passes over the entire data and
this approach obviously
improves with increase of memory.

The first aproach requires 2 passes of the entire data and K^2+K random
seeks,
the second aproach(when merging l blocks at a time) requires: log(l,K)
passes over the data
and K*l+K random seeks.

On 9/23/05, Simon Riggs <simon@2ndquadrant.com> wrote:

I have concerns about whether we are overallocating memory for use in
external sorts. (All code relating to this is in tuplesort.c)

When we begin a sort we allocate (work_mem | maintenance_work_mem) and
attempt to do the sort in memory. If the sort set is too big to fit in
memory we then write to disk and begin an external sort. The same memory
allocation is used for both types of sort, AFAICS.

The external sort algorithm benefits from some memory but not much.
Knuth says that the amount of memory required is very low, with a value
typically less than 1 kB. I/O overheads mean that there is benefit from
having longer sequential writes, so the optimum is much larger than
that. I've not seen any data that indicates that a setting higher than
16 MB adds any value at all to a large external sort. I have some
indications from private tests that very high memory settings may
actually hinder performance of the sorts, though I cannot explain that
and wonder whether it is the performance tests themselves that have
issues.

Does anyone have any clear data that shows the value of large settings
of work_mem when the data to be sorted is much larger than memory? (I am
well aware of the value of setting work_mem higher for smaller sorts, so

any performance data needs to reflect only very large sorts).

If not, I would propose that when we move from qsort to tapesort mode we
free the larger work_mem setting (if one exists) and allocate only a
lower, though still optimal setting for the tapesort. That way the
memory can be freed for use by other users or the OS while the tapesort
proceeds (which is usually quite a while...).

Feedback, please.

Best Regards, Simon Riggs

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

#17Ron Peacetree
rjpeace@earthlink.net
In reply to: Dann Corbit (#16)
Re: [PERFORM] Releasing memory during External sorting?

From: Dann Corbit <DCorbit@connx.com>
Sent: Sep 23, 2005 5:38 PM
Subject: RE: [HACKERS] [PERFORM] Releasing memory during External sorting?

_C Unleashed_ also explains how to use a callback function to perform
arbitrary radix sorts (you simply need a method that returns the
[bucketsize] most significant bits for a given data type, for the length
of the key).

So you can sort fairly arbitrary data in linear time (of course if the
key is long then O(n*log(n)) will be better anyway.)

But in any case, if we are talking about external sorting, then disk
time will be so totally dominant that the choice of algorithm is
practically irrelevant.

Horsefeathers. Jim Gray's sorting contest site:
http://research.microsoft.com/barc/SortBenchmark/

proves that the choice of algorithm can have a profound affect on
performance. After all, the amount of IO done is the most
important of the things that you should be optimizing for in
choosing an external sorting algorithm.

Clearly, if we know or can assume the range of the data in question
the theoretical minimum amount of IO is one pass through all of the
data (otherwise, we are back in O(lg(n!)) land ). Equally clearly, for
HD's that one pass should involve as few seeks as possible.

In fact, such a principle can be applied to _all_ forms of IO: HD,
RAM, and CPU cache. The absolute best that any sort can
possibly do is to make one pass through the data and deduce the
proper ordering of the data during that one pass.

It's usually also important that our algorithm be Stable, preferably
Wholly Stable.

Let's call such a sort Optimal External Sort (OES). Just how much
faster would it be than current practice?

The short answer is the difference between how long it currently
takes to sort a file vs how long it would take to "cat" the contents
of the same file to a RAM buffer (_without_ displaying it). IOW,
there's SIGNIFICANT room for improvement over current
standard practice in terms of sorting performance, particularly
external sorting performance.

Since sorting is a fundamental operation in many parts of a DBMS,
this is a Big Deal.

This discussion has gotten my creative juices flowing. I'll post
some Straw Man algorithm sketches after I've done some more
thought.

Ron

-----Original Message-----
From: Dann Corbit <DCorbit@connx.com>
Sent: Friday, September 23, 2005 2:21 PM
Subject: Re: [HACKERS] [PERFORM] Releasing memory during ...

For the subfiles, load the top element of each subfile into a priority
queue. Extract the min element and write it to disk. If the next
value is the same, then the queue does not need to be adjusted.
If the next value in the subfile changes, then adjust it.

Then, when the lowest element in the priority queue changes, adjust
the queue.

Keep doing that until the queue is empty.

You can create all the subfiles in one pass over the data.

You can read all the subfiles, merge them, and write them out in a
second pass (no matter how many of them there are).

The Gotcha with Priority Queues is that their performance depends
entirely on implementation. In naive implementations either Enqueue()
or Dequeue() takes O(n) time, which reduces sorting time to O(n^2).

The best implementations I know of need O(lglgn) time for those
operations, allowing sorting to be done in O(nlglgn) time.
Unfortunately, there's a lot of data manipulation going on in the
process and two IO passes are required to sort any given file.
Priority Queues do not appear to be very "IO friendly".

I know of no sorting performance benchmark contest winner based on
Priority Queues.

Replacement selection is not a good idea any more, since obvious
better ideas should take over. Longer runs are of no value if you do not
have to do multiple merge passes.

Judging from the literature and the contest winners, Replacement
Selection is still a viable and important technique. Besides Priority
Queues, what "obvious better ideas" have you heard of?

I have explained this general technique in the book "C Unleashed",
chapter 13.

Sample code is available on the book's home page.

URL please?

#18Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#4)
Re: Releasing memory during External sorting?

On Fri, 2005-09-23 at 11:31 -0400, Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

Since we know the predicted size of the sort set prior to starting the
sort node, could we not use that information to allocate memory
appropriately? i.e. if sort size is predicted to be more than twice the
size of work_mem, then just move straight to the external sort algorithm
and set the work_mem down at the lower limit?

Have you actually read the sort code?

Yes and Knuth too. Your research and code are incredible, almost
untouchable. Yet sort performance is important and empirical evidence
suggests that this can be improved upon significantly, so I am and will
be spending time trying to improve upon that. Another time...

This thread was aiming to plug a problem I saw with 8.1's ability to use
very large work_mem settings. I felt that either my performance numbers
were wrong or we needed to do something; I've not had anybody show me
performance numbers that prove mine doubtful, yet.

During the run-forming phase it's definitely useful to eat all the
memory you can: that translates directly to longer initial runs and
hence fewer merge passes.

Sounds good, but maybe that is not the dominant effect. I'll retest, on
the assumption that there is a benefit, but there's something wrong with
my earlier tests.

During the run-merging phase it's possible
that using less memory would not hurt performance any, but as already
stated, I don't think it will actually end up cutting the backend's
memory footprint --- the sbrk point will be established during the run
forming phase and it's unlikely to move back much until transaction end.

Also, if I recall the development of that code correctly, the reason for
using more than minimum memory during the merge phase is that writing or
reading lots of tuples at once improves sequentiality of access to the
temp files. So I'm not sure that cutting down the memory wouldn't hurt
performance.

Cutting memory below about 16 MB does definitely hurt external sort
performance; I explain that as being the effect of sequential access. I
haven't looked to nail down the breakpoint exactly since it seemed more
important simply to say that there looked like there was one.. Its just
that raising it above that mark doesn't help much, according to my
current results.

I'll get some more test results and repost them, next week. I will be
very happy if the results show that more memory helps.

Best Regards, Simon Riggs

#19Simon Riggs
simon@2ndquadrant.com
In reply to: Ron Peacetree (#6)
Re: [PERFORM] Releasing memory during External sorting?

On Fri, 2005-09-23 at 12:48 -0400, Ron Peacetree wrote:

I have some indications from private tests that very high memory settings
may actually hinder performance of the sorts, though I cannot explain that
and wonder whether it is the performance tests themselves that have issues.

Hmmm. Are you talking about amounts so high that you are throwing the OS
into paging and swapping thrash behavior? If not, then the above is weird.

Thanks for your thoughts. I'll retest, on the assumption that there is a
benefit, but there's something wrong with my earlier tests.

Best Regards, Simon Riggs

#20Dann Corbit
DCorbit@connx.com
In reply to: Simon Riggs (#19)
Re: [PERFORM] Releasing memory during External sorting?

-----Original Message-----
From: Ron Peacetree [mailto:rjpeace@earthlink.net]
Sent: Saturday, September 24, 2005 3:31 AM
To: Dann Corbit; pgsql-hackers@postgresql.org; pgsql-
performance@postgresql.org
Subject: RE: [HACKERS] [PERFORM] Releasing memory during External

sorting?

From: Dann Corbit <DCorbit@connx.com>
Sent: Sep 23, 2005 5:38 PM
Subject: RE: [HACKERS] [PERFORM] Releasing memory during External

sorting?

_C Unleashed_ also explains how to use a callback function to perform
arbitrary radix sorts (you simply need a method that returns the
[bucketsize] most significant bits for a given data type, for the

length

of the key).

So you can sort fairly arbitrary data in linear time (of course if

the

key is long then O(n*log(n)) will be better anyway.)

But in any case, if we are talking about external sorting, then disk
time will be so totally dominant that the choice of algorithm is
practically irrelevant.

Horsefeathers. Jim Gray's sorting contest site:
http://research.microsoft.com/barc/SortBenchmark/

proves that the choice of algorithm can have a profound affect on
performance.

Picklesmoke. I was referring to the algorithm used to perform the sort
stage, and not the algorithm used to perform the IO which has a dominant
effect on the overall sort time. I thought that should be clear from
context.

After all, the amount of IO done is the most
important of the things that you should be optimizing for in
choosing an external sorting algorithm.

Replacement selection uses a terrible O(f(n)) algorithm. The only
reason it is a customary choice for external sorting is because the runs
are twice as long.

Using a classical merge sequence, you have half as many reads and writes
using replacement selection as with other methods. That saves ONE read
and ONE write pass, because of having half as many subfiles.

Suppose, for instance, that you have 64 subfiles. Using any classical
merge algorithm, they will have to be read and merged in a first pass,
giving 32, then again giving 16 then again giving 8 then again giving 4,
then again giving two and one final pass to create one file.

So, if replacement selection were applied, there would be 6 read/write
passes instead of seven in this problem set. After the creation of the
original subfiles, the algorithm I listed reads once and writes once and
is done.

So what about the argument for skipping around? Well, first of all the
OS is going to cache the reads to a large degree. And second of all, if
we read a single record with no buffering and wrote a single record for
each operation, then because we only have to read once, that is better
than skipping around 7 times for every read and write because of
physically reading and writing the files over and over.

But don't take my word for it. Try it yourself. It is laughably
trivial to implement it.

Clearly, if we know or can assume the range of the data in question
the theoretical minimum amount of IO is one pass through all of the
data (otherwise, we are back in O(lg(n!)) land ). Equally clearly,

for

HD's that one pass should involve as few seeks as possible.

In fact, such a principle can be applied to _all_ forms of IO: HD,
RAM, and CPU cache. The absolute best that any sort can
possibly do is to make one pass through the data and deduce the
proper ordering of the data during that one pass.

It's usually also important that our algorithm be Stable, preferably
Wholly Stable.

Let's call such a sort Optimal External Sort (OES). Just how much
faster would it be than current practice?

The short answer is the difference between how long it currently
takes to sort a file vs how long it would take to "cat" the contents
of the same file to a RAM buffer (_without_ displaying it). IOW,
there's SIGNIFICANT room for improvement over current
standard practice in terms of sorting performance, particularly
external sorting performance.

Since sorting is a fundamental operation in many parts of a DBMS,
this is a Big Deal.

This discussion has gotten my creative juices flowing. I'll post
some Straw Man algorithm sketches after I've done some more
thought.

Ron

-----Original Message-----
From: Dann Corbit <DCorbit@connx.com>
Sent: Friday, September 23, 2005 2:21 PM
Subject: Re: [HACKERS] [PERFORM] Releasing memory during ...

For the subfiles, load the top element of each subfile into a

priority

queue. Extract the min element and write it to disk. If the next
value is the same, then the queue does not need to be adjusted.
If the next value in the subfile changes, then adjust it.

Then, when the lowest element in the priority queue changes, adjust
the queue.

Keep doing that until the queue is empty.

You can create all the subfiles in one pass over the data.

You can read all the subfiles, merge them, and write them out in a
second pass (no matter how many of them there are).

The Gotcha with Priority Queues is that their performance depends
entirely on implementation. In naive implementations either Enqueue()
or Dequeue() takes O(n) time, which reduces sorting time to O(n^2).

For a (typically) tiny queue [e.g. if you are merging 100 files) the
algorithm is very unimportant. The sample implementation just uses an
insertion sort to order the heads of the queues.

The best implementations I know of need O(lglgn) time for those
operations, allowing sorting to be done in O(nlglgn) time.
Unfortunately, there's a lot of data manipulation going on in the
process and two IO passes are required to sort any given file.
Priority Queues do not appear to be very "IO friendly".

I know of no sorting performance benchmark contest winner based on
Priority Queues.

That proves it then. On the other hand, why not benchmark it yourself?
It's trivial to implement. The sample code supplied in the link below
is intended only as something to demonstrate the basic principle, but it
is easy enough to make it robust.

Replacement selection is not a good idea any more, since obvious
better ideas should take over. Longer runs are of no value if you do

not

have to do multiple merge passes.

Judging from the literature and the contest winners, Replacement
Selection is still a viable and important technique. Besides Priority
Queues, what "obvious better ideas" have you heard of?

I have explained this general technique in the book "C Unleashed",
chapter 13.

Sample code is available on the book's home page.

URL please?

http://users.powernet.co.uk/eton/unleashed/errata/896213nw.zip