Index Page Split logging

Started by Simon Riggsabout 18 years ago19 messages
#1Simon Riggs
simon@2ndquadrant.com

When we split an index page we perform a multi-block operation that is
both fairly expensive and complex to reconstruct should we crash partway
through.

If we could log *only* the insert that caused the split, rather than the
split itself, we would avoid that situation entirely. This would then
mean that the recovery code would resolve the split by performing a full
logical split rather than replaying pieces of the original physical
split. Doing that would remove a ton of complexity, as well as reducing
log volumes.

We would need to ensure that the right-hand page of the split reached
disk before the left-hand page. If a crash occurs when only the right
hand page has reached disk then there would be no link (on disk) to it
and so it would be ignored. We would need an orphaned page detection
mechanism to allow the page to be VACUUMed sometime in the future. There
would also be some sort of ordering required in the buffer manager, so
that pages which must be written last are kept pinned until the first
page is written. That sounds like it is fairly straightforward and it
would allow a generic mechanism that worked for all index splits, rather
than requiring individual code for each rmgr.

ISTM that would require Direct I/O to perform physical writes in a
specific order, rather than just issue the writes and fsync. Which
probably kills it for now, even assuming you followed me on every point
up till now...

So I'm mentioning this really to get the idea out there and see if
anybody has any bright ideas, rather than as a well-formed proposal for
immediate implementation.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#1)
Re: Index Page Split logging

Simon Riggs <simon@2ndquadrant.com> writes:

If we could log *only* the insert that caused the split, rather than the
split itself, we would avoid that situation entirely.

How are you going to avoid the need to run user-defined functions
(specifically, the btree comparison functions) during replay?

regards, tom lane

#3Manolo _
mac_man2005@hotmail.it
In reply to: Tom Lane (#2)
Implementing Sorting Refinements

Hi to all.

This mail is aimed at asking some suggestion to face PostgreSQL (PG) development to implement a refinement to PG source code. I'll briefly summarize the idea of the "2-Way Replacement Selection" (2WRS) refinement, just in case. If you already remember what 2WRS is, you can please jump directly to the IMPLEMENTATION ISSUES part of this mail.

2WRS.
Refinement of the Replacement Selection (RS) technique currently used by PG in src/backend/utils/sort/tuplesort.c .
The 2WRS uses two heaps instead of just one in order to create the current (logical) run. Here there are some fundamental points of the 2WRS technique:
- 'qsort' the initial unsorted 'memtuples' array
- divide the 'memtuples' array into two parts and each of those will be managed as a heap
- one of the heaps will arrange its elements in ascending order, while the other heap in descending order
- each heap will spill its current root in its corresponding run (i.e.: we have a run per each of those two heaps), so we are actually creating 2 physical current runs
- those 2 current physical runs could "theoretically" be merged into the same logical run, actually we can make 'mergesort' think they do belong to the same physical run. That reduces the number of comparisons 'mergesort' has to do at each merge step (that means less seek delay time on mass storage). We can also think the average length of logical runs produced by 2WRS will probably be greater than the average length produced by RS (again less seek delay time: the longer each run the less number of final runs produced, that means the less number of comparisons at each 'mergesort' step).

IMPLEMENTATION ISSUES.
Where to place those heaps?
1) I think that both heaps could be arranged on the same 'memtuples' array. That allows easily subsequent resizing those heaps according to their effective use or according to some heuristic, without reallocating memory.

How to arrange those heaps?
2a) It's convenient to arrange those heaps "root to root". That is arranging those heaps with their roots toward the center of 'memtuples' (in a way we can say they lay "face to face", or "root to root" as said before) while their leaves lay towards the extreme indexes of the 'memtuples' array (that is the last leaf of one heap will lay at index 0, the last leaf of the other heap laying at index memtupsize-1. This arrangement prevents overlapping elements between those physical runs associated to the same current logical run.
PRO: once we qsort memtuples and divide it into 2 parts we already get those two heaps, no need to build them.
CONTRA: ???

2b) As in 2a) but arranging heaps "leaf to leaf", that is their roots will lay at the extreme indexes of 'memtuples' while their leaves towards the center of the 'memtuples' array.
Or even start building heaps as soon as we get initial elements, instead of qsort the whole 'memtuples' array.
Any PRO/CONTRA compared to 2a)???

Current run numbers
I think I should duplicate the 'int currentRun' variable in the Tuplesortstate struct. I'll replace it with a 'int currentRunUP' and 'int currentRunDOWN' variables in order to distinguish those two physical runs associated to those 2 heaps. In this case I will give a run number (max{currentRunUP,currentRunDOWN} + 1) to elements not belonging to the current logical run. I suppose no need to touch 'long availMem' nor 'long allowedMem' variables nor any others.

Heap functions
I will duplicate all the heap management functions in order to adapt them to the kind of heap they should be applied to (for example, the tuplesort_heap_siftup function should be replaced with tuplesort_heap_siftupUP and tuplesort_heap_siftupDOWN functions).

Merge Plan
This technique would use a sort of "merge plan" to instruct mergesort on how to use those physical runs. Actually mergesort should consider at first "odd runs" before "pair runs". That is, for example, mergesort should start merging runs with run number 1,3,5,7,... and when run number X terminates start considering run number X+1. Obviously that doesn't need any merge plan, but I remember someone else (as Simon Riggs) was interested in sorting improvements so it could be a good thing to know if I should consider any conventions or paramethers in order to possibly create that merge plan.

DEVELOPMENT CONTEXT
I preferred to use the "last stable release" at the moment, that is 8.2.

Any comment/suggestion/advice ?

Thanks for your attention and for your time.
Regards, Manolo.
_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

#4Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#2)
Re: Index Page Split logging

On Tue, 2008-01-01 at 14:02 -0500, Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

If we could log *only* the insert that caused the split, rather than the
split itself, we would avoid that situation entirely.

How are you going to avoid the need to run user-defined functions
(specifically, the btree comparison functions) during replay?

Seems like a good objection. Just exercising my lateral thought muscles.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

#5Martijn van Oosterhout
kleptog@svana.org
In reply to: Simon Riggs (#4)
Re: Index Page Split logging

On Tue, Jan 01, 2008 at 08:55:58PM +0000, Simon Riggs wrote:

On Tue, 2008-01-01 at 14:02 -0500, Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

If we could log *only* the insert that caused the split, rather than the
split itself, we would avoid that situation entirely.

How are you going to avoid the need to run user-defined functions
(specifically, the btree comparison functions) during replay?

Seems like a good objection. Just exercising my lateral thought muscles.

It seems to me you should be able to manage an intermediate version
where (in for example GiST) you store the output of picksplit. i.e. you
describe your split as: items A,B,C went to the left and the rest went
to the right page. This would allow you to reconstruct the split
completely without actually having to write all the data.

The only thing I'm not sure about is whether it's easy to get the
relevent inforation to make this efficient.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Those who make peaceful revolution impossible will make violent revolution inevitable.
-- John F Kennedy

#6Simon Riggs
simon@2ndquadrant.com
In reply to: Martijn van Oosterhout (#5)
Re: Index Page Split logging

On Tue, 2008-01-01 at 22:20 +0100, Martijn van Oosterhout wrote:

On Tue, Jan 01, 2008 at 08:55:58PM +0000, Simon Riggs wrote:

On Tue, 2008-01-01 at 14:02 -0500, Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

If we could log *only* the insert that caused the split, rather than the
split itself, we would avoid that situation entirely.

How are you going to avoid the need to run user-defined functions
(specifically, the btree comparison functions) during replay?

Seems like a good objection. Just exercising my lateral thought muscles.

It seems to me you should be able to manage an intermediate version
where (in for example GiST) you store the output of picksplit. i.e. you
describe your split as: items A,B,C went to the left and the rest went
to the right page. This would allow you to reconstruct the split
completely without actually having to write all the data.

Yes, but you may have to handle recursive splits in the parent, so the
problem is not fully solved by that manoeuvre.

Perhaps we could do this only in cases where the parent is not split?

Hmm, starting to sound too far fetched for me now.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

#7Gokulakannan Somasundaram
gokul007@gmail.com
In reply to: Simon Riggs (#4)
Re: Index Page Split logging

On Jan 2, 2008 2:25 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Tue, 2008-01-01 at 14:02 -0500, Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

If we could log *only* the insert that caused the split, rather than

the

split itself, we would avoid that situation entirely.

How are you going to avoid the need to run user-defined functions
(specifically, the btree comparison functions) during replay?

Seems like a good objection. Just exercising my lateral thought muscles.

Can this be explained in more detail???

Thanks,
Gokul.

#8Martijn van Oosterhout
kleptog@svana.org
In reply to: Gokulakannan Somasundaram (#7)
Re: Index Page Split logging

On Wed, Jan 02, 2008 at 02:49:35PM +0530, Gokulakannan Somasundaram wrote:

On Jan 2, 2008 2:25 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Tue, 2008-01-01 at 14:02 -0500, Tom Lane wrote:

How are you going to avoid the need to run user-defined functions
(specifically, the btree comparison functions) during replay?

Seems like a good objection. Just exercising my lateral thought muscles.

Can this be explained in more detail???

If the goal is to only store the insert, then we need to determine
during recovery which page the record needs to be added to. To do this
you need to go through the index, which can only be done by calling
user-defined functions.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Those who make peaceful revolution impossible will make violent revolution inevitable.
-- John F Kennedy

#9Gokulakannan Somasundaram
gokul007@gmail.com
In reply to: Martijn van Oosterhout (#8)
Re: Index Page Split logging

On Jan 2, 2008 3:35 PM, Martijn van Oosterhout <kleptog@svana.org> wrote:

If the goal is to only store the insert, then we need to determine
during recovery which page the record needs to be added to. To do this
you need to go through the index, which can only be done by calling
user-defined functions.

Correct me if i am wrong.
User-defined functions act as comparison functions in GIST. But when do they
act as comparison functions in b-tree? I am not able to understand exactly
that part.

Thanks for the explanation.

--
Thanks,
Gokul.

#10Martijn van Oosterhout
kleptog@svana.org
In reply to: Gokulakannan Somasundaram (#9)
Re: Index Page Split logging

On Wed, Jan 02, 2008 at 04:04:48PM +0530, Gokulakannan Somasundaram wrote:

On Jan 2, 2008 3:35 PM, Martijn van Oosterhout <kleptog@svana.org> wrote:

If the goal is to only store the insert, then we need to determine
during recovery which page the record needs to be added to. To do this
you need to go through the index, which can only be done by calling
user-defined functions.

Correct me if i am wrong.
User-defined functions act as comparison functions in GIST. But when do they
act as comparison functions in b-tree? I am not able to understand exactly
that part.

All indexes are done by user-defined functions, even b-trees. People can
make their own b-tree indexes by defining an operator class. Note that
"user-defined" is this case means anything called via the fmgr
interface.

For reference, hash indexes are defined by user-defined functions as
well... It's all about the intereaction between Index AMs and operator
classes.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Those who make peaceful revolution impossible will make violent revolution inevitable.
-- John F Kennedy

In reply to: Martijn van Oosterhout (#10)
Re: Index Page Split logging

All indexes are done by user-defined functions, even b-trees. People can
make their own b-tree indexes by defining an operator class. Note that
"user-defined" is this case means anything called via the fmgr
interface.

Again, i think i have one more wrong understanding. My understanding is,
We are discussing about user-defined functions because, they might be
actually be mutable functions, but the user might have classified as
immutable. This might cause some problems while replaying the log. In the
case of hash-indexes, if the hash-function is mutable, then the user has a
corrupted index.
Is there some other problem also, because of user-defined functions, that
will stall recovery in the proposed idea?

In our standard b-tree(with no user-defined operator classes), there
shouldn't be any problem with replaying right?

Again shouldn't we say b-tree with user-defined op-classes as gist-btree? If
not what is the difference between both?

Again thanks for the explanation.

--
Thanks,
Gokul.

#12Martijn van Oosterhout
kleptog@svana.org
In reply to: Gokulakannan Somasundaram (#11)
Re: Index Page Split logging

On Wed, Jan 02, 2008 at 04:46:11PM +0530, Gokulakannan Somasundaram wrote:

All indexes are done by user-defined functions, even b-trees. People can
make their own b-tree indexes by defining an operator class. Note that
"user-defined" is this case means anything called via the fmgr
interface.

Again, i think i have one more wrong understanding. My understanding is,
We are discussing about user-defined functions because, they might be
actually be mutable functions, but the user might have classified as
immutable.

This is where it gets a bit beyond by depth, so someone may have to
correct me. The point is that during the recovery the system is not yet
fully running and not everything works yet. For example, what happens
if the system crashed halfway through updating a page in pg_proc and
that page needs to be recovered from WAL. Yet to insert into the index
you need to be able to read pg_am, pg_amproc, pg_proc at least,
probably more.

The point being, you can't rely on anything except WAL during recovery.

In our standard b-tree(with no user-defined operator classes), there
shouldn't be any problem with replaying right?

Even inserting into our "standard b-tree" involves looking up the
operator class and associated functions and using the fmgr interface.
There is no difference in the system between the builtin operator
classes and "user defined" ones.

Again shouldn't we say b-tree with user-defined op-classes as gist-btree? If
not what is the difference between both?

gist-btree is a nice example but IIRC it takes more diskspace and can't
handle uniqueness.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Those who make peaceful revolution impossible will make violent revolution inevitable.
-- John F Kennedy

#13Simon Riggs
simon@2ndquadrant.com
In reply to: Martijn van Oosterhout (#12)
Re: Index Page Split logging

On Wed, 2008-01-02 at 13:54 +0100, Martijn van Oosterhout wrote:

On Wed, Jan 02, 2008 at 04:46:11PM +0530, Gokulakannan Somasundaram wrote:

All indexes are done by user-defined functions, even b-trees. People can
make their own b-tree indexes by defining an operator class. Note that
"user-defined" is this case means anything called via the fmgr
interface.

Again, i think i have one more wrong understanding. My understanding is,
We are discussing about user-defined functions because, they might be
actually be mutable functions, but the user might have classified as
immutable.

This is where it gets a bit beyond by depth, so someone may have to
correct me. The point is that during the recovery the system is not yet
fully running and not everything works yet. For example, what happens
if the system crashed halfway through updating a page in pg_proc and
that page needs to be recovered from WAL. Yet to insert into the index
you need to be able to read pg_am, pg_amproc, pg_proc at least,
probably more.

That's right; shame I forgot this before I started the thread...

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

In reply to: Martijn van Oosterhout (#12)
Re: Index Page Split logging

On Jan 2, 2008 6:24 PM, Martijn van Oosterhout <kleptog@svana.org> wrote:

On Wed, Jan 02, 2008 at 04:46:11PM +0530, Gokulakannan Somasundaram wrote:

All indexes are done by user-defined functions, even b-trees. People

can

make their own b-tree indexes by defining an operator class. Note that
"user-defined" is this case means anything called via the fmgr
interface.

Again, i think i have one more wrong understanding. My understanding is,
We are discussing about user-defined functions because, they might be
actually be mutable functions, but the user might have classified as
immutable.

This is where it gets a bit beyond by depth, so someone may have to
correct me. The point is that during the recovery the system is not yet
fully running and not everything works yet. For example, what happens
if the system crashed halfway through updating a page in pg_proc and
that page needs to be recovered from WAL. Yet to insert into the index
you need to be able to read pg_am, pg_amproc, pg_proc at least,
probably more.

The point being, you can't rely on anything except WAL during recovery.

Thanks a lot for the nice explanation. That was something new for

me....:((

--
Thanks,
Gokul.

#15Martijn van Oosterhout
kleptog@svana.org
In reply to: Simon Riggs (#13)
Re: Index Page Split logging

On Wed, Jan 02, 2008 at 01:17:03PM +0000, Simon Riggs wrote:

That's right; shame I forgot this before I started the thread...

Actually, I think your idea has merit, it's just not as easy as
originally thought. All splits, including multiple-level splits, can be
described as a sequence of "split page X with items {S} going to the
left" followed by an "insert item into page X". The trick is that for
multi-level splits you have to split from the top down. Then each split
becomes a simple loggable operation.

But, I think in the live system we split from the bottom up (the split
and insert is a single operation), so I don't think you can actually
combine the current split algorithm with the logging operation I
suggest above.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Those who make peaceful revolution impossible will make violent revolution inevitable.
-- John F Kennedy

#16Decibel!
decibel@decibel.org
In reply to: Manolo _ (#3)
1 attachment(s)
Re: Implementing Sorting Refinements

You'll get better response if you don't hijack threads. :) Also,
there's nothing in here that describes what the benefits of this
change are.

On Jan 1, 2008, at 2:09 PM, Manolo _ wrote:

Hi to all.

This mail is aimed at asking some suggestion to face PostgreSQL
(PG) development to implement a refinement to PG source code. I'll
briefly summarize the idea of the "2-Way Replacement
Selection" (2WRS) refinement, just in case. If you already remember
what 2WRS is, you can please jump directly to the IMPLEMENTATION
ISSUES part of this mail.

2WRS.
Refinement of the Replacement Selection (RS) technique currently
used by PG in src/backend/utils/sort/tuplesort.c .
The 2WRS uses two heaps instead of just one in order to create the
current (logical) run. Here there are some fundamental points of
the 2WRS technique:
- 'qsort' the initial unsorted 'memtuples' array
- divide the 'memtuples' array into two parts and each of those
will be managed as a heap
- one of the heaps will arrange its elements in ascending order,
while the other heap in descending order
- each heap will spill its current root in its corresponding run
(i.e.: we have a run per each of those two heaps), so we are
actually creating 2 physical current runs
- those 2 current physical runs could "theoretically" be merged
into the same logical run, actually we can make 'mergesort' think
they do belong to the same physical run. That reduces the number of
comparisons 'mergesort' has to do at each merge step (that means
less seek delay time on mass storage). We can also think the
average length of logical runs produced by 2WRS will probably be
greater than the average length produced by RS (again less seek
delay time: the longer each run the less number of final runs
produced, that means the less number of comparisons at each
'mergesort' step).

IMPLEMENTATION ISSUES.
Where to place those heaps?
1) I think that both heaps could be arranged on the same
'memtuples' array. That allows easily subsequent resizing those
heaps according to their effective use or according to some
heuristic, without reallocating memory.

How to arrange those heaps?
2a) It's convenient to arrange those heaps "root to root". That is
arranging those heaps with their roots toward the center of
'memtuples' (in a way we can say they lay "face to face", or "root
to root" as said before) while their leaves lay towards the extreme
indexes of the 'memtuples' array (that is the last leaf of one heap
will lay at index 0, the last leaf of the other heap laying at
index memtupsize-1. This arrangement prevents overlapping elements
between those physical runs associated to the same current logical
run.
PRO: once we qsort memtuples and divide it into 2 parts we already
get those two heaps, no need to build them.
CONTRA: ???

2b) As in 2a) but arranging heaps "leaf to leaf", that is their
roots will lay at the extreme indexes of 'memtuples' while their
leaves towards the center of the 'memtuples' array.
Or even start building heaps as soon as we get initial elements,
instead of qsort the whole 'memtuples' array.
Any PRO/CONTRA compared to 2a)???

Current run numbers
I think I should duplicate the 'int currentRun' variable in the
Tuplesortstate struct. I'll replace it with a 'int currentRunUP'
and 'int currentRunDOWN' variables in order to distinguish those
two physical runs associated to those 2 heaps. In this case I will
give a run number (max{currentRunUP,currentRunDOWN} + 1) to
elements not belonging to the current logical run. I suppose no
need to touch 'long availMem' nor 'long allowedMem' variables nor
any others.

Heap functions
I will duplicate all the heap management functions in order to
adapt them to the kind of heap they should be applied to (for
example, the tuplesort_heap_siftup function should be replaced with
tuplesort_heap_siftupUP and tuplesort_heap_siftupDOWN functions).

Merge Plan
This technique would use a sort of "merge plan" to instruct
mergesort on how to use those physical runs. Actually mergesort
should consider at first "odd runs" before "pair runs". That is,
for example, mergesort should start merging runs with run number
1,3,5,7,... and when run number X terminates start considering run
number X+1. Obviously that doesn't need any merge plan, but I
remember someone else (as Simon Riggs) was interested in sorting
improvements so it could be a good thing to know if I should
consider any conventions or paramethers in order to possibly create
that merge plan.

DEVELOPMENT CONTEXT
I preferred to use the "last stable release" at the moment, that is
8.2.

Any comment/suggestion/advice ?

Thanks for your attention and for your time.
Regards, Manolo.
_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's
FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
---------------------------(end of
broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

--
Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Attachments:

smime.p7sapplication/pkcs7-signature; name=smime.p7sDownload
#17Noname
mac_man2005@hotmail.it
In reply to: Simon Riggs (#1)
Re: Implementing Sorting Refinements

Well, sorry for hijacking... ummm how did I do that?

Anyway I'll thank you for giving a "sign of life" when I was almost loosing
my hopes to get any kind of answer from "-hackers".

I suppose the lack of answers was due to the way I wrote my mail. At that
moment I supposed that at least someone reminded the 2WRS technique and
possible benefits described into previous posts.
I think I was wrong, so I'll write it once again hoping meanwhile to get
some suggestions on: HOWTO write a mail to which "-hackers" will give an
answer :) hehehehe

Thanks for your attention.
Manolo.

--------------------------------------------------
From: "Decibel!" <decibel@decibel.org>
Sent: Tuesday, January 08, 2008 12:34 AM
To: "Manolo _" <mac_man2005@hotmail.it>
Cc: <pgsql-hackers@postgresql.org>
Subject: Re: [HACKERS] Implementing Sorting Refinements

Show quoted text

You'll get better response if you don't hijack threads. :) Also, there's
nothing in here that describes what the benefits of this change are.

On Jan 1, 2008, at 2:09 PM, Manolo _ wrote:

Hi to all.

This mail is aimed at asking some suggestion to face PostgreSQL (PG)
development to implement a refinement to PG source code. I'll briefly
summarize the idea of the "2-Way Replacement Selection" (2WRS)
refinement, just in case. If you already remember what 2WRS is, you can
please jump directly to the IMPLEMENTATION ISSUES part of this mail.

2WRS.
Refinement of the Replacement Selection (RS) technique currently used by
PG in src/backend/utils/sort/tuplesort.c .
The 2WRS uses two heaps instead of just one in order to create the
current (logical) run. Here there are some fundamental points of the
2WRS technique:
- 'qsort' the initial unsorted 'memtuples' array
- divide the 'memtuples' array into two parts and each of those will be
managed as a heap
- one of the heaps will arrange its elements in ascending order, while
the other heap in descending order
- each heap will spill its current root in its corresponding run (i.e.:
we have a run per each of those two heaps), so we are actually creating
2 physical current runs
- those 2 current physical runs could "theoretically" be merged into the
same logical run, actually we can make 'mergesort' think they do belong
to the same physical run. That reduces the number of comparisons
'mergesort' has to do at each merge step (that means less seek delay
time on mass storage). We can also think the average length of logical
runs produced by 2WRS will probably be greater than the average length
produced by RS (again less seek delay time: the longer each run the less
number of final runs produced, that means the less number of comparisons
at each 'mergesort' step).

IMPLEMENTATION ISSUES.
Where to place those heaps?
1) I think that both heaps could be arranged on the same 'memtuples'
array. That allows easily subsequent resizing those heaps according to
their effective use or according to some heuristic, without reallocating
memory.

How to arrange those heaps?
2a) It's convenient to arrange those heaps "root to root". That is
arranging those heaps with their roots toward the center of 'memtuples'
(in a way we can say they lay "face to face", or "root to root" as said
before) while their leaves lay towards the extreme indexes of the
'memtuples' array (that is the last leaf of one heap will lay at index
0, the last leaf of the other heap laying at index memtupsize-1. This
arrangement prevents overlapping elements between those physical runs
associated to the same current logical run.
PRO: once we qsort memtuples and divide it into 2 parts we already get
those two heaps, no need to build them.
CONTRA: ???

2b) As in 2a) but arranging heaps "leaf to leaf", that is their roots
will lay at the extreme indexes of 'memtuples' while their leaves
towards the center of the 'memtuples' array.
Or even start building heaps as soon as we get initial elements, instead
of qsort the whole 'memtuples' array.
Any PRO/CONTRA compared to 2a)???

Current run numbers
I think I should duplicate the 'int currentRun' variable in the
Tuplesortstate struct. I'll replace it with a 'int currentRunUP' and
'int currentRunDOWN' variables in order to distinguish those two
physical runs associated to those 2 heaps. In this case I will give a
run number (max{currentRunUP,currentRunDOWN} + 1) to elements not
belonging to the current logical run. I suppose no need to touch 'long
availMem' nor 'long allowedMem' variables nor any others.

Heap functions
I will duplicate all the heap management functions in order to adapt
them to the kind of heap they should be applied to (for example, the
tuplesort_heap_siftup function should be replaced with
tuplesort_heap_siftupUP and tuplesort_heap_siftupDOWN functions).

Merge Plan
This technique would use a sort of "merge plan" to instruct mergesort on
how to use those physical runs. Actually mergesort should consider at
first "odd runs" before "pair runs". That is, for example, mergesort
should start merging runs with run number 1,3,5,7,... and when run
number X terminates start considering run number X+1. Obviously that
doesn't need any merge plan, but I remember someone else (as Simon
Riggs) was interested in sorting improvements so it could be a good
thing to know if I should consider any conventions or paramethers in
order to possibly create that merge plan.

DEVELOPMENT CONTEXT
I preferred to use the "last stable release" at the moment, that is 8.2.

Any comment/suggestion/advice ?

Thanks for your attention and for your time.
Regards, Manolo.
_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

--
Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

#18Guillaume Smet
guillaume.smet@gmail.com
In reply to: Noname (#17)
Re: Implementing Sorting Refinements

On Jan 8, 2008 1:04 AM, <mac_man2005@hotmail.it> wrote:

Well, sorry for hijacking... ummm how did I do that?

Anyway I'll thank you for giving a "sign of life" when I was almost loosing
my hopes to get any kind of answer from "-hackers".

Don't forget that we're just a few days/weeks of 8.3 release so the
attention is a bit focused on it at the moment (and I'm not speaking
of the security releases of today).

Don't feel disappointed about the lack of attention you're suffering
at the moment, just post your proposal again after 8.3 release,
explain what you plan to do and why and perhaps you'll have the time
to write a mock-up and get some numbers to prove your point before
that. That could help too.

Keep up the good work.

Regards,

--
Guillaume

#19Tomasz Ostrowski
tometzky@batory.org.pl
In reply to: Noname (#17)
Re: Implementing Sorting Refinements

On Tue, 08 Jan 2008, mac_man2005@hotmail.it wrote:

Well, sorry for hijacking... ummm how did I do that?

You replied to a post instead of creating a new, unrelated e-mail. It
is different.

Just try to use threaded mode of your e-mail client and you'll get
the idea.

Regards
Tometzky
--
...although Eating Honey was a very good thing to do, there was a
moment just before you began to eat it which was better than when you
were...
Winnie the Pooh