Research/Implementation of Nested Loop Join optimization

Started by Manoel Henriqueover 17 years ago16 messages
#1Manoel Henrique
mhenriquesgbd@gmail.com

Hi! I`m a researcher from PUC-Rio (Brazil) and we`re studying about an
Joins, and we`d like to implement an optimization on the Nested Loop Join,
this optimization consists on while scanning the inner table, the iteration
would go from up-down then backwards(down-up) to take advantage of the
buffer pages in memory.
We`d work with MaterialScan and only NestedLoop (we`re dropping all indexes
and keys to make it this way). The research objective is to show some
students how a DBMS works.

Does PostgreSQL already works this way?
Is it possible to implement such thing? Is it easy? how hard?

Thank you in advance,
Manoel Henrique Souza.

#2Dann Corbit
DCorbit@connx.com
In reply to: Manoel Henrique (#1)
Re: Research/Implementation of Nested Loop Join optimization

When you install the source tree (e.g. in folder \postgresql-8.3.x) you
will want to examine nodeMergejoin.c typically found in a path similar
to this:

\postgresql-8.3.x\src\backend\executor\nodeMergejoin.c

Here are the comments from the version on my machine:

/*

* INTERFACE ROUTINES

* ExecMergeJoin
mergejoin outer and inner relations.

* ExecInitMergeJoin
creates and initializes run time states

* ExecEndMergeJoin
cleans up the node.

*

* NOTES

*

* Merge-join is done by joining the inner
and outer tuples satisfying

* join clauses of the form ((= outerKey
innerKey) ...).

* The join clause list is provided by the
query planner and may contain

* more than one (= outerKey innerKey) clause
(for composite sort key).

*

* However, the query executor needs to know
whether an outer

* tuple is "greater/smaller" than an inner
tuple so that it can

* "synchronize" the two relations. For
example, consider the following

* relations:

*

* outer: (0
^1 1 2 5 5 5 6 6 7) current tuple: 1

* inner: (1
^3 5 5 5 5 6) current tuple: 3

*

* To continue the merge-join, the executor
needs to scan both inner

* and outer relations till the matching
tuples 5. It needs to know

* that currently inner tuple 3 is "greater"
than outer tuple 1 and

* therefore it should scan the outer
relation first to find a

* matching tuple and so on.

*

* Therefore, rather than directly executing
the merge join clauses,

* we evaluate the left and right key
expressions separately and then

* compare the columns one at a time (see
MJCompare). The planner

* passes us enough information about the
sort ordering of the inputs

* to allow us to determine how to make the
comparison. We may use the

* appropriate btree comparison function,
since Postgres' only notion

* of ordering is specified by btree
opfamilies.

*

*

* Consider the above relations and suppose
that the executor has

* just joined the first outer "5" with the
last inner "5". The

* next step is of course to join the second
outer "5" with all

* the inner "5's". This requires
repositioning the inner "cursor"

* to point at the first inner "5". This is
done by "marking" the

* first inner 5 so we can restore the
"cursor" to it before joining

* with the second outer 5. The access method
interface provides

* routines to mark and restore to a tuple.

*

*

* Essential operation of the merge join
algorithm is as follows:

*

* Join {

* get initial outer and
inner tuples
INITIALIZE

* do forever {

* while
(outer != inner) {
SKIP_TEST

*
if (outer < inner)

*
advance outer
SKIPOUTER_ADVANCE

*
else

*
advance inner
SKIPINNER_ADVANCE

* }

* mark inner
position
SKIP_TEST

* do forever
{

*
while (outer == inner) {

*
join tuples
JOINTUPLES

*
advance inner position
NEXTINNER

*
}

*
advance outer position
NEXTOUTER

*
if (outer == mark)
TESTOUTER

*
restore inner position to mark TESTOUTER

*
else

*
break // return to top of outer loop

* }

* }

* }

*

* The merge join operation is coded in the
fashion

* of a state machine. At each state, we do
something and then

* proceed to another state. This state is
stored in the node's

* execution state information and is
preserved across calls to

* ExecMergeJoin. -cim 10/31/89

*/

From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Manoel Henrique
Sent: Wednesday, July 23, 2008 1:17 PM
To: pgsql-hackers@postgresql.org
Subject: [HACKERS] Research/Implementation of Nested Loop Join
optimization

Hi! I`m a researcher from PUC-Rio (Brazil) and we`re studying about an
Joins, and we`d like to implement an optimization on the Nested Loop
Join, this optimization consists on while scanning the inner table, the
iteration would go from up-down then backwards(down-up) to take
advantage of the buffer pages in memory.

We`d work with MaterialScan and only NestedLoop (we`re dropping all
indexes and keys to make it this way). The research objective is to show
some students how a DBMS works.

Does PostgreSQL already works this way?

Is it possible to implement such thing? Is it easy? how hard?

Thank you in advance,

Manoel Henrique Souza.

#3Manoel Henrique
mhenriquesgbd@gmail.com
In reply to: Dann Corbit (#2)
Re: Research/Implementation of Nested Loop Join optimization

The nodeMergejoin.c is the code for the Merge Join isn`t it? I am trying to
find a way to change the Nested Loop Join, It would be more like on
nodeNestloop.c when rescanning the inner plan, (second time scanning the
inner plan and so on) he`d change the scan direction, If the scan direction
was from first tuple to last tuple it would go backwards, if it was from
last to first it would go forward... The code I`m looking atm is from 8.3.1
, seems to have some kind of direction manager but doesn`t seems to be in
use.

--Manoel

On Wed, Jul 23, 2008 at 5:23 PM, Dann Corbit <DCorbit@connx.com> wrote:

Show quoted text

When you install the source tree (e.g. in folder \postgresql-8.3.x) you
will want to examine nodeMergejoin.c typically found in a path similar to
this:

\postgresql-8.3.x\src\backend\executor\nodeMergejoin.c

Here are the comments from the version on my machine:

/*

* INTERFACE ROUTINES

* ExecMergeJoin
mergejoin outer and inner relations.

* ExecInitMergeJoin
creates and initializes run time states

* ExecEndMergeJoin
cleans up the node.

*

* NOTES

*

* Merge-join is done by joining the inner and
outer tuples satisfying

* join clauses of the form ((= outerKey
innerKey) ...).

* The join clause list is provided by the query
planner and may contain

* more than one (= outerKey innerKey) clause
(for composite sort key).

*

* However, the query executor needs to know
whether an outer

* tuple is "greater/smaller" than an inner
tuple so that it can

* "synchronize" the two relations. For example,
consider the following

* relations:

*

* outer: (0 ^1
1 2 5 5 5 6 6 7) current tuple: 1

* inner: (1 ^3
5 5 5 5 6) current tuple: 3

*

* To continue the merge-join, the executor
needs to scan both inner

* and outer relations till the matching tuples
5. It needs to know

* that currently inner tuple 3 is "greater"
than outer tuple 1 and

* therefore it should scan the outer relation
first to find a

* matching tuple and so on.

*

* Therefore, rather than directly executing the
merge join clauses,

* we evaluate the left and right key
expressions separately and then

* compare the columns one at a time (see
MJCompare). The planner

* passes us enough information about the sort
ordering of the inputs

* to allow us to determine how to make the
comparison. We may use the

* appropriate btree comparison function, since
Postgres' only notion

* of ordering is specified by btree opfamilies.

*

*

* Consider the above relations and suppose that
the executor has

* just joined the first outer "5" with the last
inner "5". The

* next step is of course to join the second
outer "5" with all

* the inner "5's". This requires repositioning
the inner "cursor"

* to point at the first inner "5". This is done
by "marking" the

* first inner 5 so we can restore the "cursor"
to it before joining

* with the second outer 5. The access method
interface provides

* routines to mark and restore to a tuple.

*

*

* Essential operation of the merge join
algorithm is as follows:

*

* Join {

* get initial outer and inner
tuples
INITIALIZE

* do forever {

* while (outer
!= inner) {
SKIP_TEST

*
if (outer < inner)

*
advance
outer
SKIPOUTER_ADVANCE

*
else

*
advance
inner
SKIPINNER_ADVANCE

* }

* mark inner
position
SKIP_TEST

* do forever {

*
while (outer == inner) {

*
join
tuples
JOINTUPLES

*
advance inner position
NEXTINNER

*
}

*
advance outer
position
NEXTOUTER

*
if (outer ==
mark)
TESTOUTER

*
restore inner position to mark TESTOUTER

*
else

*
break // return to top of outer loop

* }

* }

* }

*

* The merge join operation is coded in the
fashion

* of a state machine. At each state, we do
something and then

* proceed to another state. This state is
stored in the node's

* execution state information and is preserved
across calls to

* ExecMergeJoin. -cim 10/31/89

*/

*From:* pgsql-hackers-owner@postgresql.org [mailto:
pgsql-hackers-owner@postgresql.org] *On Behalf Of *Manoel Henrique
*Sent:* Wednesday, July 23, 2008 1:17 PM
*To:* pgsql-hackers@postgresql.org
*Subject:* [HACKERS] Research/Implementation of Nested Loop Join
optimization

Hi! I`m a researcher from PUC-Rio (Brazil) and we`re studying about an
Joins, and we`d like to implement an optimization on the Nested Loop Join,
this optimization consists on while scanning the inner table, the iteration
would go from up-down then backwards(down-up) to take advantage of the
buffer pages in memory.

We`d work with MaterialScan and only NestedLoop (we`re dropping all indexes
and keys to make it this way). The research objective is to show some
students how a DBMS works.

Does PostgreSQL already works this way?

Is it possible to implement such thing? Is it easy? how hard?

Thank you in advance,

Manoel Henrique Souza.

#4Dann Corbit
DCorbit@connx.com
In reply to: Manoel Henrique (#3)
Re: Research/Implementation of Nested Loop Join optimization

From: Manoel Henrique [mailto:mhenriquesgbd@gmail.com]
Sent: Wednesday, July 23, 2008 1:47 PM
To: Dann Corbit
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Research/Implementation of Nested Loop Join
optimization

The nodeMergejoin.c is the code for the Merge Join isn`t it? I am trying
to find a way to change the Nested Loop Join, It would be more like on
nodeNestloop.c when rescanning the inner plan, (second time scanning the
inner plan and so on) he`d change the scan direction, If the scan
direction was from first tuple to last tuple it would go backwards, if
it was from last to first it would go forward... The code I`m looking
atm is from 8.3.1 , seems to have some kind of direction manager but
doesn`t seems to be in use.

You are right. You want file: nodeNestloop.c

<<

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Manoel Henrique (#3)
Re: Research/Implementation of Nested Loop Join optimization

"Manoel Henrique" <mhenriquesgbd@gmail.com> writes:

The nodeMergejoin.c is the code for the Merge Join isn`t it? I am trying to
find a way to change the Nested Loop Join, It would be more like on
nodeNestloop.c when rescanning the inner plan, (second time scanning the
inner plan and so on) he`d change the scan direction, If the scan direction
was from first tuple to last tuple it would go backwards, if it was from
last to first it would go forward... The code I`m looking atm is from 8.3.1
, seems to have some kind of direction manager but doesn`t seems to be in
use.

I find this a bit dubious. If the inner rel is small enough to fit in
memory then it buys nothing. If not, then you win only to the extent
that a pretty large fraction of the inner rel fits in memory. In any
case you are relying on the assumption that backwards scan is just as
efficient as forward scan, which seems to me to be a pretty large
assumption --- we expect forward seqscans to get a performance boost
from kernel readahead, but I'd be surprised if the kernel recognized
what was happening in a backwards scan.

Note also that backwards scan doesn't work at all in some plan
node types (cf ExecSupportsBackwardScan). You'd need to check
what the inner input node was before trying this.

regards, tom lane

#6Manoel Henrique
mhenriquesgbd@gmail.com
In reply to: Tom Lane (#5)
Re: Research/Implementation of Nested Loop Join optimization

Yes, I'm relying on the assumption that backwards scan has the same cost as
forward scan, why shouldn't it?

Yet, all plan node types we are testing works with backwards scan (looking
on ExecSupportsBackwardScan). But, is there a easy way to make a query
execute only in backwards scan? How we can do that? Our first objective is
to make a backwards scan only and then test a forward-and-backward scan.

-- Manoel

On Thu, Jul 24, 2008 at 2:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

"Manoel Henrique" <mhenriquesgbd@gmail.com> writes:

The nodeMergejoin.c is the code for the Merge Join isn`t it? I am trying

to

find a way to change the Nested Loop Join, It would be more like on
nodeNestloop.c when rescanning the inner plan, (second time scanning the
inner plan and so on) he`d change the scan direction, If the scan

direction

was from first tuple to last tuple it would go backwards, if it was from
last to first it would go forward... The code I`m looking atm is from

8.3.1

, seems to have some kind of direction manager but doesn`t seems to be in
use.

I find this a bit dubious. If the inner rel is small enough to fit in
memory then it buys nothing. If not, then you win only to the extent
that a pretty large fraction of the inner rel fits in memory. In any
case you are relying on the assumption that backwards scan is just as
efficient as forward scan, which seems to me to be a pretty large
assumption --- we expect forward seqscans to get a performance boost
from kernel readahead, but I'd be surprised if the kernel recognized
what was happening in a backwards scan.

Note also that backwards scan doesn't work at all in some plan
node types (cf ExecSupportsBackwardScan). You'd need to check
what the inner input node was before trying this.

regards, tom lane

#7Gregory Stark
stark@enterprisedb.com
In reply to: Manoel Henrique (#6)
Re: Research/Implementation of Nested Loop Join optimization

"Manoel Henrique" <mhenriquesgbd@gmail.com> writes:

Yes, I'm relying on the assumption that backwards scan has the same cost as
forward scan, why shouldn't it?

Because hard drives only spin one direction

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's RemoteDBA services!

#8Jonah H. Harris
jonah.harris@gmail.com
In reply to: Gregory Stark (#7)
Re: Research/Implementation of Nested Loop Join optimization

On Fri, Jul 25, 2008 at 6:10 PM, Gregory Stark <stark@enterprisedb.com> wrote:

"Manoel Henrique" <mhenriquesgbd@gmail.com> writes:

Yes, I'm relying on the assumption that backwards scan has the same cost as
forward scan, why shouldn't it?

Because hard drives only spin one direction

:)

--
Jonah H. Harris, Senior DBA
myYearbook.com

#9Joshua D. Drake
jd@commandprompt.com
In reply to: Jonah H. Harris (#8)
Re: Research/Implementation of Nested Loop Join optimization

On Fri, 2008-07-25 at 19:31 -0400, Jonah H. Harris wrote:

On Fri, Jul 25, 2008 at 6:10 PM, Gregory Stark <stark@enterprisedb.com> wrote:

"Manoel Henrique" <mhenriquesgbd@gmail.com> writes:

Yes, I'm relying on the assumption that backwards scan has the same cost as
forward scan, why shouldn't it?

Because hard drives only spin one direction

:)

What if you are below the equator?

--
Jonah H. Harris, Senior DBA
myYearbook.com

--
The PostgreSQL Company since 1997: http://www.commandprompt.com/
PostgreSQL Community Conference: http://www.postgresqlconference.org/
United States PostgreSQL Association: http://www.postgresql.us/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate

#10Alvaro Herrera
alvherre@commandprompt.com
In reply to: Joshua D. Drake (#9)
Re: Research/Implementation of Nested Loop Join optimization

Joshua D. Drake escribi�:

On Fri, 2008-07-25 at 19:31 -0400, Jonah H. Harris wrote:

On Fri, Jul 25, 2008 at 6:10 PM, Gregory Stark <stark@enterprisedb.com> wrote:

"Manoel Henrique" <mhenriquesgbd@gmail.com> writes:

Yes, I'm relying on the assumption that backwards scan has the same cost as
forward scan, why shouldn't it?

Because hard drives only spin one direction

:)

What if you are below the equator?

They spin the same direction here too, thanks :-) (Coriolis does not
affect much in this case)

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Gregory Stark (#7)
Re: Research/Implementation of Nested Loop Join optimization

Gregory Stark <stark@enterprisedb.com> writes:

"Manoel Henrique" <mhenriquesgbd@gmail.com> writes:

Yes, I'm relying on the assumption that backwards scan has the same cost as
forward scan, why shouldn't it?

Because hard drives only spin one direction

Good joke, but to be serious: we expect that forward scans will result
in the kernel doing read-ahead, which will allow overlapping of
CPU work to process one page with the I/O to bring in the next page.
A backwards scan will get no such overlapping and thus be up to 2X
slower, unless the kernel is smart enough to do read-ahead for
descending-order read requests. Which seems not too probable. A fairly
typical kernel behavior is that read-ahead is triggered by successive
read() requests without any intervening seek(), and this is impossible
for a backward scan.

(Yes, we do optimize out the seek calls in a forward scan. IIRC it's
done in fd.c.)

regards, tom lane

#12Alvaro Herrera
alvherre@commandprompt.com
In reply to: Tom Lane (#11)
Re: Research/Implementation of Nested Loop Join optimization

Tom Lane escribi�:

Gregory Stark <stark@enterprisedb.com> writes:

"Manoel Henrique" <mhenriquesgbd@gmail.com> writes:

Yes, I'm relying on the assumption that backwards scan has the same cost as
forward scan, why shouldn't it?

Because hard drives only spin one direction

Good joke, but to be serious: we expect that forward scans will result
in the kernel doing read-ahead, which will allow overlapping of
CPU work to process one page with the I/O to bring in the next page.

I wonder if this is spoiled (or rather, the backwards case fixed) by the
attempts to call posix_fadvise() on certain types of scan.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#12)
Re: Research/Implementation of Nested Loop Join optimization

Alvaro Herrera <alvherre@commandprompt.com> writes:

Tom Lane escribi�:

Good joke, but to be serious: we expect that forward scans will result
in the kernel doing read-ahead, which will allow overlapping of
CPU work to process one page with the I/O to bring in the next page.

I wonder if this is spoiled (or rather, the backwards case fixed) by the
attempts to call posix_fadvise() on certain types of scan.

Yeah, I started wondering about that too after sending off the above.
The fadvise patch might eliminate the distinction ... on platforms where
fadvise exists and actually works well.

regards, tom lane

#14Ron Mayer
rm_pg@cheapcomplexdevices.com
In reply to: Tom Lane (#11)
Re: Research/Implementation of Nested Loop Join optimization

Tom Lane wrote:

Gregory Stark <stark@enterprisedb.com> writes:

"Manoel Henrique" <mhenriquesgbd@gmail.com> writes:

Yes, I'm relying on the assumption that backwards scan has the same cost as
forward scan, why shouldn't it?

G...we expect that forward scans will result
in the kernel doing read-ahead, ...
A backwards scan will get no such overlapping and thus be up to 2X
slower, unless the kernel is smart enough to do read-ahead for
descending-order read requests. Which seems not too probable.

Linux's old adaptive readahead patches claimed to[1]http://lwn.net/Articles/185469/:
It also have methods to detect some less common cases:
- reading backward"
Interestingly the author of that patch used postgres as the example
application that benefits from the patch (30%).

I'm not sure if the backward reading feature got kept
in the simplified on-demand readahead that seems to have
superseded the adaptive readahead stuff in 2.6.23[2]http://kernelnewbies.org/Linux_2_6_23#head-102af265937262a7a21766ae58fddc1a29a5d8d7.

[1]: http://lwn.net/Articles/185469/
[2]: http://kernelnewbies.org/Linux_2_6_23#head-102af265937262a7a21766ae58fddc1a29a5d8d7

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Ron Mayer (#14)
Re: Research/Implementation of Nested Loop Join optimization

Ron Mayer <rm_pg@cheapcomplexdevices.com> writes:

Tom Lane wrote:

A backwards scan will get no such overlapping and thus be up to 2X
slower, unless the kernel is smart enough to do read-ahead for
descending-order read requests. Which seems not too probable.

Linux's old adaptive readahead patches claimed to[1]:

I didn't say that there were *no* platforms that could do it.

regards, tom lane

#16Gregory Stark
stark@enterprisedb.com
In reply to: Tom Lane (#11)
Re: Research/Implementation of Nested Loop Join optimization

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

Gregory Stark <stark@enterprisedb.com> writes:

"Manoel Henrique" <mhenriquesgbd@gmail.com> writes:

Yes, I'm relying on the assumption that backwards scan has the same cost as
forward scan, why shouldn't it?

Because hard drives only spin one direction

Good joke, but to be serious: we expect that forward scans will result
in the kernel doing read-ahead, which will allow overlapping of
CPU work to process one page with the I/O to bring in the next page.

Well it wasn't a joke but you're right that it's not the whole picture. But
then neither is considering interleaving of I/O with CPU work.

Hard drives spin in a particular direction which means if you stream I/O
requests to them in that direction you can stream data off the hard drive as
fast as it passes under the read head. That's going to be 50-60MB/s for a
single modern 7200rpm drive.

On the other hand if you send an I/O request for the previous block then you
have to wait a whole rotation before it passes under the head. On a 7200rpm
drive that's over 8ms which is a *lot* of CPU work to interleave. The most
bandwidth you'll be able to get is under 1MB/s.

However there's another reason fadvise helps -- the kernel or the drive gets a
chance to reorder the I/O. If we read-ahead a whole track's worth of I/O
backwards then during the first 8ms latency the kernel has a chance to notice
that it should reorder the queued up requests and do them the right way
around.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!