polyphase merge?

Started by Don Marvickabout 17 years ago11 messageshackers
Jump to latest
#1Don Marvick
donmarvick@gmail.com

Dear All,

I apologize if this has been discussed before. I have tried to search to the
mailing list and could not find an exact answer.

Currently, Postgres uses Knuth's Algorithm 5.4.2D to perform external merge
sort. IMHO the algorithm is designed for tapes.

Why don't the simple text book merge pattern described in Knuth Page 365
(5.4.9) is used for disk based system? The same merge pattern is also
described in Ramakrishnan text book and it is optimal if seek time is not
counted, which of course not the real-world case.

Best regards,

Don

#2Don Marvick
donmarvick@gmail.com
In reply to: Don Marvick (#1)
Re: polyphase merge?

Dear All,

Since nobody replied, I would give it a try. I am going to implement the
merge pattern described in Knuth Page 365 (5.4.9), essentially it is as
follow:
- create initial runs using replacement selection (basically this is as in
the current implementation)
- add enough dummy runs of size 0 so that the number of sorted run minus one
can be divided by k-1 (k is merge fan in)
- repetitively merge k smallest runs until only 1 run left

I am new to postgresql, hence any advice would be appreciated.

Can anybody give me some advice on how it can done?
1. how a run should be represented? should I reuse the tape mechanism? e.g.
1 tape 1 run
- or should I use a temp buffer?

2. How memory should be allocated? I assume I divide the memory equally to k
runs, hence each run will get M/k memory. Each read of a run would be of
size M/k bytes.

3. Prefetching. Then, we can precompute the read sequence of blocks of run
during the entire merge process. Based on this, we know the blocks of run to
be prefetched at a point of time.

3. Is it possible to perform read/write I/O while the merge is being
performed? Hence we overlap I/O and computation.

4. any other issue needs consideration?

Best regards,

Don

On Thu, Jan 29, 2009 at 4:11 PM, Don Marvick <donmarvick@gmail.com> wrote:

Show quoted text

Dear All,

I apologize if this has been discussed before. I have tried to search to
the mailing list and could not find an exact answer.

Currently, Postgres uses Knuth's Algorithm 5.4.2D to perform external merge
sort. IMHO the algorithm is designed for tapes.

Why don't the simple text book merge pattern described in Knuth Page 365
(5.4.9) is used for disk based system? The same merge pattern is also
described in Ramakrishnan text book and it is optimal if seek time is not
counted, which of course not the real-world case.

Best regards,

Don

#3Simon Riggs
simon@2ndQuadrant.com
In reply to: Don Marvick (#2)
Re: polyphase merge?

On Wed, 2009-02-04 at 14:42 +0800, Don Marvick wrote:

4. any other issue needs consideration?

Most attempts to improve sorting further have fallen to nothing because
of the lack of repeatable test results. In particular, coming up with
test cases *after* writing the code is a good way to get lost in
discussions and have your work passed over.

The best starting point is to collect a number of test cases, both
typical and extreme cases. Do some research to show that "typical" cases
really are that and be prepared for some expressions of reasonable
doubt. Publish that data so test results can be objectively verified and
then measure those cases on existing code, with various settings of
tunable parameters.

Then it will be a simple matter to prove your changes are effective in
target cases without damaging other cases. We would also want the
changes to work automatically without additional tunable parameters.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#4Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#3)
Re: polyphase merge?

Is this basically the same as our current algorithm but without
multiplexing the tapes onto single files? I have been wondering
whether we multiplex the tapes any better than filesystems can lay out
separate files actually.

Or is there something else that's different here?

--
greg

#5Simon Riggs
simon@2ndQuadrant.com
In reply to: Bruce Momjian (#4)
Re: polyphase merge?

On Wed, 2009-02-04 at 10:55 +0000, Greg Stark wrote:

Is this basically the same as our current algorithm but without
multiplexing the tapes onto single files? I have been wondering
whether we multiplex the tapes any better than filesystems can lay out
separate files actually.

I don't think you'll be able to do that more efficiently than we already
do. Read the top of tuplesort.c

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#4)
Re: polyphase merge?

Greg Stark <stark@enterprisedb.com> writes:

Is this basically the same as our current algorithm but without
multiplexing the tapes onto single files? I have been wondering
whether we multiplex the tapes any better than filesystems can lay out
separate files actually.

The reason for the multiplexing is so that space can get re-used
quickly. If each tape were represented as a separate file, there would
be no way to release blocks as they're read; you could only give back
the whole file after reaching end of tape. Which would at least double
the amount of disk space needed to sort X amount of data. (It's
actually even worse, more like 4X, though the multiplier might depend on
the number of "tapes" --- I don't recall the details anymore.)

The penalty we pay is that in the later merge passes, the blocks
representing a single tape aren't very well ordered.

It might be interesting to think about some compromise that wastes a
little more space in order to get better sequentiality of disk access.
It'd be easy to do if we were willing to accept a 2X space penalty,
but I'm not sure if that would fly or not. It definitely *wasn't*
acceptable to the community a few years ago when the current code was
written. Disks have gotten bigger since then, but so have the problems
people want to solve.

regards, tom lane

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Don Marvick (#2)
Re: polyphase merge?

Don Marvick <donmarvick@gmail.com> writes:

Since nobody replied, I would give it a try. I am going to implement the
merge pattern described in Knuth Page 365 (5.4.9), essentially it is as
follow:
- create initial runs using replacement selection (basically this is as in
the current implementation)
- add enough dummy runs of size 0 so that the number of sorted run minus one
can be divided by k-1 (k is merge fan in)
- repetitively merge k smallest runs until only 1 run left

How is this better than, or even different from, what is there now?

regards, tom lane

#8Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#5)
Re: polyphase merge?

Simon Riggs <simon@2ndQuadrant.com> writes:

On Wed, 2009-02-04 at 10:55 +0000, Greg Stark wrote:

Is this basically the same as our current algorithm but without
multiplexing the tapes onto single files? I have been wondering
whether we multiplex the tapes any better than filesystems can lay out
separate files actually.

I don't think you'll be able to do that more efficiently than we already
do. Read the top of tuplesort.c

Huh?

The question I posed was whether we do it any better than filesystems do, not
whether we could do a better job. If filesystems can do as good a job then we
might as well create separate files for each tape and let the filesystem
decide where to allocate space. That would mean we could massively simplify
logtape.c and just create a separate file for every tape.

Possible reasons that filesystems could do better than us are that they have
access to more information about actual storage layout on disk, that they have
more information about hardware characteristics, and that it would give the
filesystem cache a better idea of the access pattern which logtape.c might be
confusing the picture of currently.

On the other hand possible reasons why filesystems would suck at this include
creating and deleting new files being a slow or locking operation on many
filesystems, and dealing with directories of large numbers of files being
poorly optimized on others.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's RemoteDBA services!

#9Don Marvick
donmarvick@gmail.com
In reply to: Simon Riggs (#3)
Re: polyphase merge?

Good point. I would note this issue. Thanks for the advice :).

Regards,

Don

On Wed, Feb 4, 2009 at 6:09 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Show quoted text

On Wed, 2009-02-04 at 14:42 +0800, Don Marvick wrote:

4. any other issue needs consideration?

Most attempts to improve sorting further have fallen to nothing because
of the lack of repeatable test results. In particular, coming up with
test cases *after* writing the code is a good way to get lost in
discussions and have your work passed over.

The best starting point is to collect a number of test cases, both
typical and extreme cases. Do some research to show that "typical" cases
really are that and be prepared for some expressions of reasonable
doubt. Publish that data so test results can be objectively verified and
then measure those cases on existing code, with various settings of
tunable parameters.

Then it will be a simple matter to prove your changes are effective in
target cases without damaging other cases. We would also want the
changes to work automatically without additional tunable parameters.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#10Don Marvick
donmarvick@gmail.com
In reply to: Tom Lane (#7)
Re: polyphase merge?

This is what I understand from existing implementation:
- fix the merge fan in F, and number of tapes=F+1
- for each sorted run generated, we have a formula to determine which tape
it belongs to
- the sorted run is added to the end of the tape
- all sorted runs have been added to tape. There is always an empty tape
called output tape
- add dummy runs if necessary, to each tape
- merge the input tapes, write result to output tape, until one of the input
tape is empty
- repeat this process until 1 sorted run remains

I think the main difference is that at each step, the current impl does not
merge the smallest k-runs. It merges from the beginning of each tape. For 1
pass, this does not make any difference. For 2 passes onwards there are some
differences. In some complex query, where the memory is eaten by some other
operators, more passes are needed and the differences become larger. This is
the only improvement that can be achieved. Let me know if this does not make
any sense.

Regarding disk layout, I am open to suggestion whether we should reuse the
tape module or not.

On Wed, Feb 4, 2009 at 11:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

Don Marvick <donmarvick@gmail.com> writes:

Since nobody replied, I would give it a try. I am going to implement the
merge pattern described in Knuth Page 365 (5.4.9), essentially it is as
follow:
- create initial runs using replacement selection (basically this is as

in

the current implementation)
- add enough dummy runs of size 0 so that the number of sorted run minus

one

can be divided by k-1 (k is merge fan in)
- repetitively merge k smallest runs until only 1 run left

How is this better than, or even different from, what is there now?

regards, tom lane

#11Don Marvick
donmarvick@gmail.com
In reply to: Bruce Momjian (#8)
Re: polyphase merge?

Hmm probably it's worth to try this. Has anybody try this before? If not, I
am interested to give it a try.

Regards,

Don

On Wed, Feb 4, 2009 at 11:25 PM, Gregory Stark <stark@enterprisedb.com>wrote:

Show quoted text

Simon Riggs <simon@2ndQuadrant.com> writes:

On Wed, 2009-02-04 at 10:55 +0000, Greg Stark wrote:

Is this basically the same as our current algorithm but without
multiplexing the tapes onto single files? I have been wondering
whether we multiplex the tapes any better than filesystems can lay out
separate files actually.

I don't think you'll be able to do that more efficiently than we already
do. Read the top of tuplesort.c

Huh?

The question I posed was whether we do it any better than filesystems do,
not
whether we could do a better job. If filesystems can do as good a job then
we
might as well create separate files for each tape and let the filesystem
decide where to allocate space. That would mean we could massively simplify
logtape.c and just create a separate file for every tape.

Possible reasons that filesystems could do better than us are that they
have
access to more information about actual storage layout on disk, that they
have
more information about hardware characteristics, and that it would give the
filesystem cache a better idea of the access pattern which logtape.c might
be
confusing the picture of currently.

On the other hand possible reasons why filesystems would suck at this
include
creating and deleting new files being a slow or locking operation on many
filesystems, and dealing with directories of large numbers of files being
poorly optimized on others.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's RemoteDBA services!