No merge sort?

Started by Taralabout 23 years ago40 messageshackers
Jump to latest
#1Taral
taral@taral.net

I tried general, but no response. Anyone here can shed some light on the
issue? Do I need to code merge sort into postgresql?

----- Forwarded message from Taral <taral@taral.net> -----

From: Taral <taral@taral.net>
To: pgsql-general@postgresql.org
Date: Wed, 12 Mar 2003 17:54:35 -0600
Subject: [GENERAL] No merge sort?
Message-ID: <20030312235435.GA3007@taral.net>

I have a table "test" that looks like this:

CREATE TABLE test (
id BIGINT,
time INTEGER
);

There is an index:

CREATE INDEX idx ON test(id, time);

The table has been loaded with 2M rows, where time ranges sequentially
from 0 to 1999999 and id is random values from 0 to 49999.

This query:

SELECT * FROM idx WHERE id IN (...) AND time > 198000 AND time < 199800
ORDER BY time DESC LIMIT 20;

has an EXPLAIN ANALYZE of:

Limit (cost=3635.28..3635.28 rows=20 width=12) (actual time=22.94...22.96 rows=14 loops=1)
-> Sort (cost=3635.28..3635.28 rows=23 width=12) (actual time=22.93..22.93 rows=14 loops=1)
-> Index Scan using idx, idx, ..., idx, idx on test (cost=0.00...3634.77 rows=23 width=12) (actual time=1.01..22.10 rows=14 loops=1)
Total runtime: 29.12 msec

This query:

SELECT * FROM idx WHERE id IN (...) AND time < 199800 ORDER BY time DESC
LIMIT 20;

has an EXPLAIN ANALYZE of:

Limit (cost=14516.46..14516.46 rows=20 width=12) (actual time=1448..83..1448.86 rows=20 loops=1)
-> Sort (cost=14516.46..14516.46 rows=2527 width=12) (actual time=1448.82..1448.83 rows=21 loops=1)
-> Index Scan using idx, idx, ..., idx, idx on test (cost=0.00...14373.67 rows=2527 width=12) (actual time=0.14..1437.33 rows=2048 loops=1)
Total runtime: 1454.62 msec

Since the index will output 'time' sorted data for each 'id', why isn't
a merge sort being used here? A merge sort would reduce the execution
time back to 30 ms.

--
Taral <taral@taral.net>
This message is digitally signed. Please PGP encrypt mail to me.
"Most parents have better things to do with their time than take care of
their children." -- Me

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Taral (#1)
Re: No merge sort?

Taral <taral@taral.net> writes:

Do I need to code merge sort into postgresql?

Seems like a waste of effort to me. I find this example less than
compelling --- the case that could be sped up is quite narrow,
and the potential performance gain not all that large. (A sort
is a sort however you slice it, with O(N log N) runtime...)

Also, the direction we'd likely be going in in future is to merge
multiple indexscans using bitmap techniques, so that the output
ordering of the scans couldn't be counted on anyway.

regards, tom lane

#3Taral
taral@taral.net
In reply to: Tom Lane (#2)
Re: No merge sort?

On Thu, Mar 13, 2003 at 04:28:34PM -0500, Tom Lane wrote:

Seems like a waste of effort to me. I find this example less than
compelling --- the case that could be sped up is quite narrow,
and the potential performance gain not all that large. (A sort
is a sort however you slice it, with O(N log N) runtime...)

Actually, it's O(N) time. The index can produce "time" sorted data for
each "id" in linear time, and the merge sort can merge them in linear
time. Also, the existing system insists on loading _all_ candidate rows
whereas this method can benefit from the limit.

If you don't want to code it, I will. I need it for the livejournal
mysql->postgresql transition. (No, mysql doesn't do it right either.)
But a few pointers to the right places to look in the code would be
helpful.

Also, the direction we'd likely be going in in future is to merge
multiple indexscans using bitmap techniques, so that the output
ordering of the scans couldn't be counted on anyway.

I don't understand this. What do these bitmap techniques do?

--
Taral <taral@taral.net>
This message is digitally signed. Please PGP encrypt mail to me.
"Most parents have better things to do with their time than take care of
their children." -- Me

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Taral (#3)
Re: No merge sort?

Taral <taral@taral.net> writes:

On Thu, Mar 13, 2003 at 04:28:34PM -0500, Tom Lane wrote:

Seems like a waste of effort to me. I find this example less than
compelling --- the case that could be sped up is quite narrow,
and the potential performance gain not all that large. (A sort
is a sort however you slice it, with O(N log N) runtime...)

Actually, it's O(N) time.

Only if you assume a fixed number of input streams.

Also, the direction we'd likely be going in in future is to merge
multiple indexscans using bitmap techniques, so that the output
ordering of the scans couldn't be counted on anyway.

I don't understand this. What do these bitmap techniques do?

The idea is you look at the index to make a list of main-table tuple
positions you are interested in, which you represent compactly as a
compressed bitmap. (There is some finagling needed because PG actually
uses block/line number rather than a pure tuple number to identify
tuples, but you can fake it with a reasonably small amount of overhead.)
Then you can combine multiple index inputs by ANDing or ORing bitmaps
(the OR case applies to your example). Finally, you traverse the heap,
accessing the desired rows in heap-location order. This loses in terms
of producing presorted output --- but it often saves enough in I/O costs
to more than justify doing the sort in memory.

regards, tom lane

#5Taral
taral@taral.net
In reply to: Tom Lane (#4)
Re: No merge sort?

On Thu, Mar 13, 2003 at 10:30:27PM -0500, Tom Lane wrote:

The idea is you look at the index to make a list of main-table tuple
positions you are interested in, which you represent compactly as a
compressed bitmap. (There is some finagling needed because PG actually
uses block/line number rather than a pure tuple number to identify
tuples, but you can fake it with a reasonably small amount of overhead.)
Then you can combine multiple index inputs by ANDing or ORing bitmaps
(the OR case applies to your example). Finally, you traverse the heap,
accessing the desired rows in heap-location order. This loses in terms
of producing presorted output --- but it often saves enough in I/O costs
to more than justify doing the sort in memory.

And it loses bigtime in the case of LIMIT. If the unlimited query
returns 4,000 records and I only want 20, you're retrieving 200x too
much data from disk.

--
Taral <taral@taral.net>
This message is digitally signed. Please PGP encrypt mail to me.
"Most parents have better things to do with their time than take care of
their children." -- Me

#6Taral
taral@taral.net
In reply to: Taral (#1)
No index maximum? (was Re: No merge sort?)

Same setup, different query:

test=> explain select max(time) from test where id = '1';
NOTICE: QUERY PLAN:

Aggregate (cost=5084.67..5084.67 rows=1 width=0)
-> Index Scan using idx on test (cost=0.00..5081.33 rows=1333 width=0)

Since the index is (id, time), why isn't the index being used to
retrieve the maximum value?

On Thu, Mar 13, 2003 at 03:10:49PM -0600, Taral wrote:

I have a table "test" that looks like this:

CREATE TABLE test (
id BIGINT,
time INTEGER
);

There is an index:

CREATE INDEX idx ON test(id, time);

The table has been loaded with 2M rows, where time ranges sequentially
from 0 to 1999999 and id is random values from 0 to 49999.

--
Taral <taral@taral.net>
This message is digitally signed. Please PGP encrypt mail to me.
"Most parents have better things to do with their time than take care of
their children." -- Me

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Taral (#5)
Re: No merge sort?

Taral <taral@taral.net> writes:

On Thu, Mar 13, 2003 at 10:30:27PM -0500, Tom Lane wrote:

The idea is you look at the index to make a list of main-table tuple
positions you are interested in, which you represent compactly as a
compressed bitmap. [snip]

And it loses bigtime in the case of LIMIT. If the unlimited query
returns 4,000 records and I only want 20, you're retrieving 200x too
much data from disk.

Sure. That's why we have a planner that distinguishes between startup
cost and total cost, and interpolates when a LIMIT is involved. But
if this mergesort idea only helps for small-limit cases, that's another
restriction on its scope of usefulness...

regards, tom lane

#8Taral
taral@taral.net
In reply to: Tom Lane (#7)
Re: No merge sort?

On Fri, Mar 14, 2003 at 10:43:30PM -0500, Tom Lane wrote:

Sure. That's why we have a planner that distinguishes between startup
cost and total cost, and interpolates when a LIMIT is involved. But
if this mergesort idea only helps for small-limit cases, that's another
restriction on its scope of usefulness...

I don't think so, since even in the non-limit case it avoids having to
do a full sort if the number of initial streams is finite and small (as
in the case I demonstrated), reducing time complexity to O(N).

--
Taral <taral@taral.net>
This message is digitally signed. Please PGP encrypt mail to me.
"Most parents have better things to do with their time than take care of
their children." -- Me

#9Bruno Wolff III
bruno@wolff.to
In reply to: Taral (#6)
Re: No index maximum? (was Re: No merge sort?)

On Fri, Mar 14, 2003 at 14:19:46 -0600,
Taral <taral@taral.net> wrote:

Same setup, different query:

test=> explain select max(time) from test where id = '1';
NOTICE: QUERY PLAN:

Aggregate (cost=5084.67..5084.67 rows=1 width=0)
-> Index Scan using idx on test (cost=0.00..5081.33 rows=1333 width=0)

Since the index is (id, time), why isn't the index being used to
retrieve the maximum value?

It looks like an index scan is being done.

If the index was on (time, id) instead of (id, time), then you could get
a further speed up by rewriting the query as:
select time from test where id = '1' order by time desc limit 1;

#10Taral
taral@taral.net
In reply to: Bruno Wolff III (#9)
Re: No index maximum? (was Re: No merge sort?)

On Sat, Mar 15, 2003 at 09:23:28AM -0600, Bruno Wolff III wrote:

On Fri, Mar 14, 2003 at 14:19:46 -0600,
Taral <taral@taral.net> wrote:

Same setup, different query:

test=> explain select max(time) from test where id = '1';
NOTICE: QUERY PLAN:

Aggregate (cost=5084.67..5084.67 rows=1 width=0)
-> Index Scan using idx on test (cost=0.00..5081.33 rows=1333 width=0)

Since the index is (id, time), why isn't the index being used to
retrieve the maximum value?

It looks like an index scan is being done.

If the index was on (time, id) instead of (id, time), then you could get
a further speed up by rewriting the query as:
select time from test where id = '1' order by time desc limit 1;

Yes, that's exactly it. It's an index _scan_. It should simply be able
to read the maximum straight from the btree.

--
Taral <taral@taral.net>
This message is digitally signed. Please PGP encrypt mail to me.
"Most parents have better things to do with their time than take care of
their children." -- Me

#11Bruno Wolff III
bruno@wolff.to
In reply to: Taral (#10)
Re: No index maximum? (was Re: No merge sort?)

On Mon, Mar 17, 2003 at 11:23:47 -0600,
Taral <taral@taral.net> wrote:

On Sat, Mar 15, 2003 at 09:23:28AM -0600, Bruno Wolff III wrote:

On Fri, Mar 14, 2003 at 14:19:46 -0600,
Taral <taral@taral.net> wrote:

Same setup, different query:

test=> explain select max(time) from test where id = '1';
NOTICE: QUERY PLAN:

Aggregate (cost=5084.67..5084.67 rows=1 width=0)
-> Index Scan using idx on test (cost=0.00..5081.33 rows=1333 width=0)

Since the index is (id, time), why isn't the index being used to
retrieve the maximum value?

It looks like an index scan is being done.

If the index was on (time, id) instead of (id, time), then you could get
a further speed up by rewriting the query as:
select time from test where id = '1' order by time desc limit 1;

Yes, that's exactly it. It's an index _scan_. It should simply be able
to read the maximum straight from the btree.

max and min don't use indexes. They are generic aggregate functions and
postgres doesn't have the special knowledge to know that for those
aggregate functions and index can be used. You can get around this
by rewriting the query as I previously indicated.

For more details on why things are this way, search the archives. This
topic comes up a lot.

I was also mistaken about have to switch the index around for this case.
It should work the way you have it (if you rewrite the query).

#12Taral
taral@taral.net
In reply to: Taral (#10)
Re: No index maximum? (was Re: No merge sort?)

On Mon, Mar 17, 2003 at 11:23:47AM -0600, Taral wrote:

Yes, that's exactly it. It's an index _scan_. It should simply be able
to read the maximum straight from the btree.

Still doesn't work, even with rewritten query. It sort a
Limit(Sort(Index Scan)), with 1333 rows being pulled from the index.

--
Taral <taral@taral.net>
This message is digitally signed. Please PGP encrypt mail to me.
"Most parents have better things to do with their time than take care of
their children." -- Me

#13Ron Peacetree
rjpeace@earthlink.net
In reply to: Taral (#1)
Re: No merge sort?

AFAIK, there are only 3 general purpose internal sorting techniques
that have O(n) behavior:
1= Insertion Sort for "almost sorted" lists. Since "almost sorted" is
a fuzzy concept, let's make the first approximation definition that no
more than n^(1/2) of the elements can be disordered. There are better
definitions in the literature, but I don't remember them off the top
of my head.

2= Sorts from the class of Address Calculation, Counting, or
Distribution Sort. These need to be able to carry out something more
complex than simply a comparison in order to achieve O(n), and
therefore have high constants in their execution. For large enough n
though, these are the performance kings.

3= Straight Radix Sort where you minimize the number of passes by
using a base much greater than two for the the radix. Usually octal
or hexidecimal. On a 32b or 64b system, this approach will let you
sort in 2 passes.

All of the above have potentially nasty trade-offs in comparision to
the standard heavily tuned median-of-three quicksort used by the sort
unix command. Nonetheless, I could see some value in providing all of
these with PostgeSQL (including a decent port of the unix sort routine
for the Win folks). I'll note in passing that quicksort's Achille's
heel is that it's unstable (all of the rest are stable), which can be
a problem in a DB.

In the general case there's a few papers out there that state you can
sort in O(n) if you can throw O(n^2) space at the problem. That
implies to me that for DB's, we are not going to be able to use O(n)
algorithms for most of our needs.

As for external sorts, everything I've ever read says that some sort
of merge technique is used: balanced, multiway, or polyphase. In all
cases, I've seen comments to the effect that reading some part of the
data into internal buffers, sorting them, and then merging them with
already sorted data is the best practice.

All of this seems to imply that instead of mergesort (which is
stable), PostgreSQL might be better off with the 4 sorts I've listed
plus a viciously efficient merge utility for combining partial results
that do fit into memory into result files on disk that don't.

Or am I crazy?

Ron Peacetree

#14Bruce Momjian
bruce@momjian.us
In reply to: Ron Peacetree (#13)
Re: No merge sort?

"Ron Peacetree" <rjpeace@earthlink.net> writes:

AFAIK, there are only 3 general purpose internal sorting techniques
that have O(n) behavior:

Strictly speaking there are no sorting algorithms that have worst-case time
behaviour better than O(nlog(n)). Period.

The typical kind-of O(n) sorts involve either special-case inputs (almost
sorted), or only work if you ignore some aspect of the input size (radix
sort).

So, for example, for radix sort the log(n) factor comes precisely from having
to have log(n) passes. If you use octal that might go to log(n)/8 but it's
still O(log(n)).

If you assume your input fields are limited to a fixed size then O() notation
loses meaning. You can always sort in linear time by just storing bit flags in
a big vector and then scanning your (fixed-size) vector to read out the
values.

However databases cannot assume fixed size inputs. Regardless of whether it's
on a 32 or 64 bit system postgres still has to sort much larger data
structures. floats are typically 64 - 96 bytes, bigints can be arbitrarily
large.

In fact posgres can sort user-defined datatypes as long as they have < and =
operators. (Actually I'm not sure on the precise constraint.)

Oh, and just to add one last fly in the ointment, postgres has to be able to
sort large datasets that don't even fit in memory. That means storing
temporary data on disk and minimizing the times data has to move from disk to
memory and back. Some alogorithms are better at that than others.

--
greg

#15Dann Corbit
DCorbit@connx.com
In reply to: Bruce Momjian (#14)
Re: No merge sort?

-----Original Message-----
From: Greg Stark [mailto:gsstark@mit.edu]
Sent: Monday, April 07, 2003 12:36 PM
To: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] No merge sort?

"Ron Peacetree" <rjpeace@earthlink.net> writes:

AFAIK, there are only 3 general purpose internal sorting techniques
that have O(n) behavior:

Strictly speaking there are no sorting algorithms that have
worst-case time behaviour better than O(nlog(n)). Period.

Provably true if the sort uses comparison. Definitely false for
distribution sorts or distribution sort hybrids.

The typical kind-of O(n) sorts involve either special-case
inputs (almost sorted), or only work if you ignore some
aspect of the input size (radix sort).

There is no need to ignore any aspect of the input with radix sort.
Radix sort can also be generalized in the same way that quicksort can.
The callback function returns the "bucket" of 'radix' signifigant bits.
So, for unsigned char, you just return the 'kth' character when the
'kth' bucket is requested if the radix is 256 for 8 bits. For an
example of dealing with other sorts of data, see chapter 13 of "C
Unleashed":
http://users.powernet.co.uk/eton/unleashed/

So, for example, for radix sort the log(n) factor comes
precisely from having to have log(n) passes. If you use octal
that might go to log(n)/8 but it's still O(log(n)).

You do not have to make log(n) passes at all. In fact, with LSD radix
sort, all the buckets can be counted in a single pass. Demonstration in
the named book (with source code).

If you assume your input fields are limited to a fixed size
then O() notation loses meaning. You can always sort in
linear time by just storing bit flags in a big vector and
then scanning your (fixed-size) vector to read out the values.

However databases cannot assume fixed size inputs. Regardless
of whether it's on a 32 or 64 bit system postgres still has
to sort much larger data structures. floats are typically 64
- 96 bytes, bigints can be arbitrarily large.

Floats are generally 8 bytes. If you use hfloat on a VAX, it is 16
bytes. That is the largest hardware floating point in common use.
Bigints can be arbitrarily large, but PostgreSQL does not handle
anything larger than 16 byte integers so there is no need to worry about
larger sizes.

A numeric type can be arbitrarily large (though a thousand digits is
pure foolishness). However, these numbers can be sorted by standard
means. A typical scenario would include an MSD radix sort until the
subsets were 200K rows, then Singleton's sort until 50 rows, then
insertion sort or perhaps Shellsort. I write sorting routines for a
database company and millions of rows can be sorted in just a few
seconds.

In fact posgres can sort user-defined datatypes as long as
they have < and = operators. (Actually I'm not sure on the
precise constraint.)

Oh, and just to add one last fly in the ointment, postgres
has to be able to sort large datasets that don't even fit in
memory. That means storing temporary data on disk and
minimizing the times data has to move from disk to memory and
back. Some alogorithms are better at that than others.

A very important advancement in this area is a priority queue based
external sort. It allows the use of the fastest possible internal sort
and a single pass to read, a single pass to merge, and a single pass to
write all of the data. In this regard, it can be enormously faster than
even replacement selection. For replacement selection, the advantage is
that the runs are twice as long as normal, allowing log(n)/2 passes
instead of log(n) passes. However, the priority queue method only has
one read pass, one write pass, and one merge pass.

It works like this:
1. Read blocks of data into memory and sort them.
2. Write the sorted blocks to disk
3. On the merge phase, we create a priority queue of file pointers, and
read the first item from each. The priority queue sorts on the first
item in the file list. When we extract an item from the minimal queue,
we write it to disk. If the item changes when we read the next item
from that file, we adjust the priority. We continue until all the files
are empty. The same book cited previously has a model implementation to
demonstrate the technique. Since disk I/O totally dominates external
sorts, this technique is a titanic win over conventional sorting means.

Another simple solution is to memory map the file, but do it in large
multi-page blocks instead of mapping the whole thing.

--
greg

dann

Show quoted text

---------------------------(end of
broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to
majordomo@postgresql.org

#16Jason M. Felice
jfelice@cronosys.com
In reply to: Bruce Momjian (#14)
Re: No merge sort?

On Mon, Apr 07, 2003 at 03:36:10PM -0400, Greg Stark wrote:

"Ron Peacetree" <rjpeace@earthlink.net> writes:

AFAIK, there are only 3 general purpose internal sorting techniques
that have O(n) behavior:

Strictly speaking there are no sorting algorithms that have worst-case time
behaviour better than O(nlog(n)). Period.

Not true.

http://www.elsewhere.org/jargon/html/entry/bogo-sort.html

-Jay 'Eraserhead' Felice

P.S. <g>

#17Chris Browne
cbbrowne@acm.org
In reply to: Ron Peacetree (#13)
Re: No merge sort?

Ron Peacetree wrote:

AFAIK, there are only 3 general purpose internal sorting techniques
that have O(n) behavior:

Hum? NO "general purpose" sorting technique has O(n) behaviour.

The theoretical best scenario, _in general_, is O(n log n).

Insertion sort is expected to provide O(n^2) behaviour, and radix-like
schemes get arbitrarily memory hungry and have bad pathological results.

All of the above have potentially nasty trade-offs in comparision to
the standard heavily tuned median-of-three quicksort used by the sort
unix command. Nonetheless, I could see some value in providing all of
these with PostgeSQL (including a decent port of the unix sort routine
for the Win folks). I'll note in passing that quicksort's Achille's
heel is that it's unstable (all of the rest are stable), which can be
a problem in a DB.

Making one sort algorithm work very efficiently is quite likely to be a
lot more effective than frittering away time trying to get some special
cases (that you can't regularly use) to work acceptably.

All of this seems to imply that instead of mergesort (which is
stable), PostgreSQL might be better off with the 4 sorts I've listed
plus a viciously efficient merge utility for combining partial results
that do fit into memory into result files on disk that don't.

Or am I crazy?

More than likely. It is highly likely that it will typically take more
computational effort to figure out that one of the 4 sorts provided
/any/ improvement than any computational effort they would save.

That's a /very/ common problem. There's also a fair chance, seen in
practice, that the action of collecting additional statistics to improve
query optimization will cost more than the savings provided by the
optimizations.
--
(concatenate 'string "cbbrowne" "@acm.org")
http://www3.sympatico.ca/cbbrowne/wp.html
When ever in doubt consult a song. --JT Fletcher

#18Dann Corbit
DCorbit@connx.com
In reply to: Chris Browne (#17)
Re: No merge sort?

-----Original Message-----
From: cbbrowne@cbbrowne.com [mailto:cbbrowne@cbbrowne.com]
Sent: Monday, April 07, 2003 1:20 PM
To: Ron Peacetree
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] No merge sort?

Ron Peacetree wrote:

AFAIK, there are only 3 general purpose internal sorting techniques
that have O(n) behavior:

Hum? NO "general purpose" sorting technique has O(n) behaviour.

The theoretical best scenario, _in general_, is O(n log n).

The ACM has published papers on several techniques that are O(n log log
n)
Radix and counting sort can be totally generalized in the same way that
comparision sorts can by creating a callback function.

If I have strings of 50 characters in length, I can count all RADIX
characters in an array of unsigned long counts[RADIX][50] in a single
pass. As the key strings become successively larger, radix sorts fall
flat compared to comparison techniques, but a counting pass can also
determine whether distribution or comparision sorting will be better.

Insertion sort is expected to provide O(n^2) behaviour, and
radix-like schemes get arbitrarily memory hungry and have bad
pathological results.

However, the pathological cases can be detected. A typical pathological
case for real database work is an array where the initial data is
identical for a long distance. This is not unusual in an ORDER BY/GROUP
BY scenario.

All of the above have potentially nasty trade-offs in

comparision to

the standard heavily tuned median-of-three quicksort used

by the sort

unix command. Nonetheless, I could see some value in

providing all of

these with PostgeSQL (including a decent port of the unix

sort routine

for the Win folks). I'll note in passing that quicksort's

Achille's

heel is that it's unstable (all of the rest are stable),

which can be

a problem in a DB.

A better sort can make your database dominatingly better than others for
some operations. It can be an enormous technical advantage.

Making one sort algorithm work very efficiently is quite
likely to be a lot more effective than frittering away time
trying to get some special cases (that you can't regularly
use) to work acceptably.

I don't agree. In fact, this is never true. A decent sorting routine
usually involves several techniques. Consider the library qsort
routine. Usually, it will use quicksort with several special tests and
a randomized median selection until the set size becomes small enough.
Then, we switch to insertion sort or shellsort.
Or consider the STL and Plauger's library sort routine. It is
introspective sort and is a hybrid of three algorithms (quicksort,
heapsort, and insertion sort).

All of this seems to imply that instead of mergesort (which is
stable), PostgreSQL might be better off with the 4 sorts

I've listed

plus a viciously efficient merge utility for combining

partial results

that do fit into memory into result files on disk that don't.

Or am I crazy?

More than likely. It is highly likely that it will typically
take more computational effort to figure out that one of the
4 sorts provided /any/ improvement than any computational
effort they would save.

It is highly unlikely that this is true. In fact, every good sorting
routine does *exactly* this.

That's a /very/ common problem. There's also a fair chance,
seen in practice, that the action of collecting additional
statistics to improve query optimization will cost more than
the savings provided by the optimizations.
--
(concatenate 'string "cbbrowne" "@acm.org")
http://www3.sympatico.ca/cbbrowne/wp.html
When ever in doubt
consult a song. --JT Fletcher

---------------------------(end of
broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

#19Dann Corbit
DCorbit@connx.com
In reply to: Dann Corbit (#18)
Re: No merge sort?

-----Original Message-----
From: Jason M. Felice [mailto:jfelice@cronosys.com]
Sent: Monday, April 07, 2003 1:10 PM
To: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] No merge sort?

On Mon, Apr 07, 2003 at 03:36:10PM -0400, Greg Stark wrote:

"Ron Peacetree" <rjpeace@earthlink.net> writes:

AFAIK, there are only 3 general purpose internal sorting

techniques

that have O(n) behavior:

Strictly speaking there are no sorting algorithms that have

worst-case

time behaviour better than O(nlog(n)). Period.

Not true.

http://www.elsewhere.org/jargon/html/entry/bogo-sort.html

He said "better than" not "worse than".

For comparison based sorting it is _provably_ true that you cannot sort
faster than log(n!) which is O(n*(log(n))).

#20Bruce Momjian
bruce@momjian.us
In reply to: Dann Corbit (#15)
Re: No merge sort?

floats are typically 64 - 96 bytes, bigints can be arbitrarily large.

Floats are generally 8 bytes.

Er, I meant "bits" there. oops.

I'm still reading the other stuff.

Most of this usually comes down to differences between theory and practice.
The constants often matter a whole lot, and the cache behaviour and memory
usage can matter a whole lot. quicksort after all is actually O(n^2) worst
case...

--
greg

#21Steve Wampler
swampler@noao.edu
In reply to: Dann Corbit (#19)
#22Dann Corbit
DCorbit@connx.com
In reply to: Steve Wampler (#21)
#23Dann Corbit
DCorbit@connx.com
In reply to: Dann Corbit (#22)
#24Bruce Momjian
bruce@momjian.us
In reply to: Dann Corbit (#22)
#25Dann Corbit
DCorbit@connx.com
In reply to: Bruce Momjian (#24)
#26Ron Peacetree
rjpeace@earthlink.net
In reply to: Chris Browne (#17)
#27Ron Peacetree
rjpeace@earthlink.net
In reply to: Chris Browne (#17)
#28Ron Peacetree
rjpeace@earthlink.net
In reply to: Chris Browne (#17)
#29Dann Corbit
DCorbit@connx.com
In reply to: Ron Peacetree (#28)
#30Ron Peacetree
rjpeace@earthlink.net
In reply to: Dann Corbit (#29)
#31Dann Corbit
DCorbit@connx.com
In reply to: Ron Peacetree (#30)
#32Ron Peacetree
rjpeace@earthlink.net
In reply to: Dann Corbit (#31)
#33Dann Corbit
DCorbit@connx.com
In reply to: Ron Peacetree (#32)
#34Ron Peacetree
rjpeace@earthlink.net
In reply to: Dann Corbit (#33)
#35Ron Peacetree
rjpeace@earthlink.net
In reply to: Dann Corbit (#33)
#36Ron Peacetree
rjpeace@earthlink.net
In reply to: Dann Corbit (#33)
#37Ron Peacetree
rjpeace@earthlink.net
In reply to: Dann Corbit (#33)
#38Noname
terr@terralogic.net
In reply to: Ron Peacetree (#37)
#39Chris Browne
cbbrowne@acm.org
In reply to: Ron Peacetree (#34)
#40Dann Corbit
DCorbit@connx.com
In reply to: Chris Browne (#39)