Fixing r-tree semantics

Started by Tom Laneover 20 years ago15 messages
#1Tom Lane
tgl@sss.pgh.pa.us

I looked into the r-tree breakage discussed in this thread:
http://archives.postgresql.org/pgsql-general/2004-03/msg01135.php

The executive summary: r-tree index opclasses contain four two-dimensional
operators, which behave correctly, and four one-dimensional operators
which do not. There is a basic logic error in the handling of the 1-D
operators. One could also legitimately ask why the opclasses don't cover
both directions (X and Y) for 1-D inquiries. The same problems exist in
the contrib/rtree_gist implementation, because it just copied the r-tree
logic without inquiring too closely into it :-(

We currently have built-in opclasses for types "box" and "polygon", both
of which are fundamentally 2-D objects. The 2-D operators that the r-tree
opclasses handle are:
~= same (ordinary equality)
&& overlaps
~ contains
@ is contained by
with pretty much the intuitive definitions of these things. The 1-D
operators in the opclasses are
<< left l.max_x < r.min_x

right l.min_x > r.max_x

&< overleft l.max_x <= r.max_x
&> overright l.min_x >= r.min_x
(I'm not here to argue about whether these definitions are right in
detail, particularly about the equality cases; that's the way it's been
since Berkeley and I'm not proposing to change them.)

Now the problem is that given a query box, say "index_col << some_box",
the rtree code has to decide whether to descend to a child page of the
index tree based on what is in the parent index page's entry for the
child --- and what is there is the union, or minimum combined bounding
box, of the boxes or polygons in the child. So the test for descending
is not the same as the test for whether a leaf index entry actually
matches the query. Fine ... but somebody got this wrong long ago.
If you think about it, the criterion for descending during a << (left)
search is properly
union_box.min_x < query.min_x
If this is true, there might be boxes within the union that satisfy
the << requirement (box.max_x < query.min_x); if it is not true then
there clearly can be no such boxes. So, given the available operators,
the correct test for descending is "!overright". But the actual test
being done, according to rtstrat.c, is "overleft". This causes the search
to fail to search child pages that should be searched (and probably also
to waste time searching pages that shouldn't be searched). The observed
result is, not surprisingly, that the indexscan fails to find some rows
it should find.

In the same way, the correct descent tests for the other operators are
overleft: !right
right: !overleft
overright: !left
overlaps: overlaps
same: contains
contains: contains
contained: overlaps
rtstrat.c gets the first three of these wrong, but the last four cases
covering the 2-D operators are correct.

This analysis explains why we've not heard more complaints about such a
fundamental breakage: the cases most people care about, when using an
r-tree, are 2-D inquiries. And what's more, the default selectivity
estimates for 1-D inquiries are so low that the index never got used.
In 8.1 this will change, because a bitmap index scan looks attractive
to the planner even at rather low selectivity --- which means that we
are probably going to hear more complaints, if we don't fix it.

Fixing the existing operators seems relatively straightforward; there will
need to be some extension to rtstrat.c to represent "NOT this operator"
as well as "this operator", but that's not hard. I plan to do this, and
make the corresponding fixes in contrib/rtree_gist as well.

What needs more discussion is that it seems to me to make sense to extend
the standard opclasses to handle the four Y-direction operators comparable
to the X-direction operators that are already there, that is "above",
"below", "overabove", and "overbelow". The polygon type has none of these
operators at the moment. Box has
<^ below l.max_y <= r.min_y

^ above l.min_y >= r.max_y

but not the overlap variants.

If you compare these to the X-direction versions:
<< left l.max_x < r.min_x

right l.min_x > r.max_x

there are two obvious discrepancies: the names aren't very similar and
the equality cases are handled differently.

We could incorporate the existing box_above and box_below operators into
an r-tree opclass if we defined overabove and overbelow to not match on
the equality case:
overbelow l.max_y < r.max_y
overabove l.min_y > r.min_y
However, it seems just plain weird to me to have different edge-case
behaviors in the X and Y directions. So I don't much care for that.
I would prefer that the operators added to the opclasses act the same
in both directions.

We could avoid any backwards-compatibility complaints if we left
those two operators alone (maybe redocumenting them as "below or touching"
and "above or touching", though these descriptions are a bit misleading)
and invented four new operators to be the Y-direction opclass members,
say
<<^ below l.max_y < r.min_y

^ above l.min_y > r.max_y

&<^ overbelow l.max_y <= r.max_y
&>^ overabove l.min_y >= r.min_y
This has a lot to recommend it: backwards compatibility and operator names
that line up with the X-direction case. On the other hand, it'll confuse
people to have operators that are so close in function, and having one be
indexable and the other not seems like a gotcha.

Plan C would be to just change the above and below operators, on the
grounds that it is an obvious typo that they don't match left and right
to begin with. We have made greater changes in the behavior of geometric
operators in the past, so this isn't an obviously bogus choice.

Not quite sure what to do, but I'd like to do something with it.
Thoughts?

regards, tom lane

#2Andrew - Supernews
andrew+nonews@supernews.com
In reply to: Tom Lane (#1)
Re: Fixing r-tree semantics

On 2005-06-23, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I looked into the r-tree breakage discussed in this thread:
http://archives.postgresql.org/pgsql-general/2004-03/msg01135.php

See also http://archives.postgresql.org/pgsql-bugs/2005-01/msg00328.php
in which I made most of the same points.

Notice also that contrib/seg and contrib/cube have their own, and
incompatible, idea of what the semantics of &< and &> should be.

--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services

#3Michael Fuhr
mike@fuhr.org
In reply to: Tom Lane (#1)
Re: Fixing r-tree semantics

On Thu, Jun 23, 2005 at 05:59:25PM -0400, Tom Lane wrote:

Fixing the existing operators seems relatively straightforward; there will
need to be some extension to rtstrat.c to represent "NOT this operator"
as well as "this operator", but that's not hard. I plan to do this, and
make the corresponding fixes in contrib/rtree_gist as well.

Excellent. If the fix is straightforward, is it going to be
backpatched at least to 8.0? Or is backpatching not worthwhile,
considering that hardly anybody stumbles across the problem or
complains about it?

--
Michael Fuhr
http://www.fuhr.org/~mfuhr/

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Fuhr (#3)
Re: Fixing r-tree semantics

Michael Fuhr <mike@fuhr.org> writes:

On Thu, Jun 23, 2005 at 05:59:25PM -0400, Tom Lane wrote:

Fixing the existing operators seems relatively straightforward; there will
need to be some extension to rtstrat.c to represent "NOT this operator"
as well as "this operator", but that's not hard. I plan to do this, and
make the corresponding fixes in contrib/rtree_gist as well.

Excellent. If the fix is straightforward, is it going to be
backpatched at least to 8.0? Or is backpatching not worthwhile,
considering that hardly anybody stumbles across the problem or
complains about it?

In principle it could be backpatched, because this is just a change in
the search behavior and not a change in either catalog entries or rtree
index contents; hence no initdb needed. However, given that the
behavior has been broken since the rtree code was written and nobody
noticed except bwhite, I think it's pretty low-priority to back-patch.
I find it more significant for 8.1 because (a) it's now more likely that
indexscans will get used for these queries, and (b) I'm thinking we
really ought to fold rtree_gist into the core so that we have at least
some built-in gist opclasses (for testing purposes if nothing else).
I started out looking for the bug in rtree_gist, and eventually realized
that it had slavishly copied rtree's bug...

regards, tom lane

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew - Supernews (#2)
Re: Fixing r-tree semantics

Andrew - Supernews <andrew+nonews@supernews.com> writes:

On 2005-06-23, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I looked into the r-tree breakage discussed in this thread:
http://archives.postgresql.org/pgsql-general/2004-03/msg01135.php

See also http://archives.postgresql.org/pgsql-bugs/2005-01/msg00328.php
in which I made most of the same points.

So you did --- I had forgotten. Good to see that we arrived at the same
conclusions.

Notice also that contrib/seg and contrib/cube have their own, and
incompatible, idea of what the semantics of &< and &> should be.

Um. Not sure what to do about these ... any opinions?

regards, tom lane

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Fuhr (#3)
Re: Fixing r-tree semantics

Hmmm ... just when you thought it was safe to go back in the water ...

I was only looking closely at the "box" case earlier today, assuming
that the polygon code was set up identically. Well, it isn't. In
particular it appears that the poly_overleft and poly_overright
definitions are different from box's, which means that rtrees are
still broken for polygon searches.

I'm of the opinion that this is a flat-out bug and we should just
fix it, ie, change the operator definitions. The polygon definitions
aren't even self-consistent (overleft accepts equality and overright
doesn't).

poly_left
result = polya->boundbox.high.x < polyb->boundbox.low.x;
poly_overleft:
result = polya->boundbox.low.x <= polyb->boundbox.high.x;
poly_right:
result = polya->boundbox.low.x > polyb->boundbox.high.x;
poly_overright:
result = polya->boundbox.high.x > polyb->boundbox.low.x;

By analogy to the box case these should be

poly_overleft:
result = polya->boundbox.high.x <= polyb->boundbox.high.x;
poly_overright:
result = polya->boundbox.low.x >= polyb->boundbox.low.x;

regards, tom lane

#7William White
bwhite@frognet.net
In reply to: Tom Lane (#4)
Re: Fixing r-tree semantics

Tom Lane wrote:

However, given that the
behavior has been broken since the rtree code was written and nobody
noticed except bwhite, I think it's pretty low-priority to back-patch.

Well, leave it to me to find the obscure bugs in other people's code,
and miss the blatant ones in my own.

It's been awhile since I've looked at this and I've pretty much swapped
my PG interals knowledge out of my brain. As I recall you can (or
could) demonstrate the error with the default test suite, but you have
to forcibly override the search strategy cost constants so that PG will
actually do r-tree index searches (or maybe it was comparisons, it's
been awhile) instead of sequential scan. Check the thread, I think I
did send in a test case. In reality, with the default constants, you'd
need a rather large set before you saw any problems.

If anyone is curious, my intended application was time intervals. That
is to say, *real* mathematical intervals with two rational numbers as
endpoints, not just durations (displacements) which as I recall is what
SQL time "intervals" actually are. Frankly, I've always considered it a
serious oversight that PG doesn't provide this as a native type with its
own indexing and operators, especially given that the default r-tree
operators don't really work with right-open intervals (though that's
another rant). In any case 1D was pretty much universal, barring
radical changes to the spacetime continuum. I abandoned the project,
but not before writing a general set of 1D interval operators to handle
all cases of open and closed endpoints.

I was under the impression that a fix had already been applied, but it's
nice to see it happen. I say this because we discussed possible fixes
-- either changes to operator semantics or new operators -- and someone
with a wizard hat nodded in agreement.

-- Bill

#8Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#6)
Re: Fixing r-tree semantics

FYI, TODO has:

* Fix incorrect rtree results due to wrong assumptions about "over"
operator semantics [rtree]

---------------------------------------------------------------------------

Tom Lane wrote:

Hmmm ... just when you thought it was safe to go back in the water ...

I was only looking closely at the "box" case earlier today, assuming
that the polygon code was set up identically. Well, it isn't. In
particular it appears that the poly_overleft and poly_overright
definitions are different from box's, which means that rtrees are
still broken for polygon searches.

I'm of the opinion that this is a flat-out bug and we should just
fix it, ie, change the operator definitions. The polygon definitions
aren't even self-consistent (overleft accepts equality and overright
doesn't).

poly_left
result = polya->boundbox.high.x < polyb->boundbox.low.x;
poly_overleft:
result = polya->boundbox.low.x <= polyb->boundbox.high.x;
poly_right:
result = polya->boundbox.low.x > polyb->boundbox.high.x;
poly_overright:
result = polya->boundbox.high.x > polyb->boundbox.low.x;

By analogy to the box case these should be

poly_overleft:
result = polya->boundbox.high.x <= polyb->boundbox.high.x;
poly_overright:
result = polya->boundbox.low.x >= polyb->boundbox.low.x;

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#9Oleg Bartunov
oleg@sai.msu.su
In reply to: Tom Lane (#1)
Re: Fixing r-tree semantics

On Thu, 23 Jun 2005, Tom Lane wrote:

both directions (X and Y) for 1-D inquiries. The same problems exist in
the contrib/rtree_gist implementation, because it just copied the r-tree
logic without inquiring too closely into it :-(

you're right, it was the beginning of our GiST experiments. Later we were
interested in split algorithm and we never actually used rtree because we
have used pg_sphere for working with spherical data. Glad to see
you fixed these longstanding problem ! Does it means we could expect
rtree_gist opclasses moved to core ?

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

#10Mark Cave-Ayland
m.cave-ayland@webbased.co.uk
In reply to: Oleg Bartunov (#9)
Re: Fixing r-tree semantics

Hi Tom,

What needs more discussion is that it seems to me to make sense to extend

the standard

opclasses to handle the four Y-direction operators comparable to the

X-direction

operators that are already there, that is "above", "below", "overabove",

and

"overbelow".

As part of PostGIS (http://postgis.refractions.net), I submitted a patch a
while back to add additional Y-direction operators to the code which is a
slightly modified version of rtree_gist (and yes, the PostGIS implementation
will suffer from the same issues you've found with the existing R-tree
implementations).

The operators I went for were as follows:

A &<| B - true if A's bounding box overlaps or is below B's bounding
box
A |&> B - true if B's bounding box overlaps or is above B's bounding
box
A <<| B - true if A's bounding box is strictly below B's bounding
box
A |>> B - true if A's bounding box is strictly above B's bounding
box

Since the rtree_gist implementation and operators were 2D, my thoughts were
to use another op-class only if the indexing were upgraded to 3D. So with
this in mind, I created the following new GiST strategies:

#define RTOverBelowStrategyNumber 9
#define RTBelowStrategyNumber 10
#define RTAboveStrategyNumber 11
#define RTOverAboveStrategyNumber 12

This is basically what you are suggesting but with a | instead of a ^ in the
operator name (my original choice was to try and use } to indicate the
positive sense of the Y-axis but this was not accepted as a valid operator
character which is why I changed to |).

It would be harder for us to change these operators since they already
exist, but then again it would be useful from a maintenance point of view to
keep the strategy numbers and operators the same across both
implementations. Of course strategy numbers are just used internally so
these aren't such a big issue - it's more the choice of operators that would
be useful to agree on.

Kind regards,

Mark.

------------------------
WebBased Ltd
17 Research Way
Tamar Science Park
Plymouth
PL6 8BT

T: +44 (0)1752 797131
F: +44 (0)1752 791023
W: http://www.webbased.co.uk

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mark Cave-Ayland (#10)
Re: Fixing r-tree semantics

"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:

The operators I went for were as follows:

A &<| B - true if A's bounding box overlaps or is below B's bounding
box
A |&> B - true if B's bounding box overlaps or is above B's bounding
box
A <<| B - true if A's bounding box is strictly below B's bounding
box
A |>> B - true if A's bounding box is strictly above B's bounding
box

Well, I was proposing more or less that but with ^ because of the
precedent of the two existing box_above/box_below operator names.
However, I'm quite happy to adopt your names, since that's probably
a more widely used precedent. Sold, unless there are objections.

(BTW, it does look a bit odd that the "|" moves around in your names.
But I don't dislike it enough to not follow the precedent.)

It would be harder for us to change these operators since they already
exist, but then again it would be useful from a maintenance point of view to
keep the strategy numbers and operators the same across both
implementations.

Agreed, I'll use your strategy number assignments too.

regards, tom lane

#12Mark Cave-Ayland
m.cave-ayland@webbased.co.uk
In reply to: Tom Lane (#11)
Re: Fixing r-tree semantics

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: 24 June 2005 14:27
To: Mark Cave-Ayland (External)
Cc: bwhite@frognet.net; teodor@sigaev.ru; oleg@sai.msu.su;
pgsql-hackers@postgresql.org; 'PostGIS Development Discussion'
Subject: Re: Fixing r-tree semantics

(cut)

Well, I was proposing more or less that but with ^ because of
the precedent of the two existing box_above/box_below
operator names. However, I'm quite happy to adopt your names,
since that's probably a more widely used precedent. Sold,
unless there are objections.

(BTW, it does look a bit odd that the "|" moves around in
your names. But I don't dislike it enough to not follow the
precedent.)

The thinking behind it was that the | represents the x-axis if you tilt you
head right 90 degrees. In effect, it would allow you to 'read' the operator
without having to go and look up what it does.

It would be harder for us to change these operators since they already
exist, but then again it would be useful from a maintenance point of
view to keep the strategy numbers and operators the same across both
implementations.

Agreed, I'll use your strategy number assignments too.

Alright no problems.

Many thanks,

Mark.

------------------------
WebBased Ltd
17 Research Way
Tamar Science Park
Plymouth
PL6 8BT

T: +44 (0)1752 797131
F: +44 (0)1752 791023
W: http://www.webbased.co.uk

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#5)
Re: Fixing r-tree semantics

I wrote:

Andrew - Supernews <andrew+nonews@supernews.com> writes:

Notice also that contrib/seg and contrib/cube have their own, and
incompatible, idea of what the semantics of &< and &> should be.

Um. Not sure what to do about these ... any opinions?

Having looked at this, I propose the following:

contrib/seg: fix the semantics of &< and &> to agree with box's
semantics. There's no obvious usefulness to the way these operators
are defined now, and since the code is using the former rtree indexing
logic, they are clearly broken as-is.

contrib/cube: I quote from cube.c:

/* The following four methods compare the projections of the boxes
onto the 0-th coordinate axis. These methods are useless for dimensions
larger than 2, but it seems that R-tree requires all its strategies
map to real functions that return something */

Now that the module uses GIST instead of r-tree, there's no very strong
reason why it should provide these operators at all. I propose removing
all of << >> &< &> from contrib/cube, leaving only the four
n-dimensional indexing operators (&& ~= ~ @).

Any objections?

regards, tom lane

#14Bruno Wolff III
bruno@wolff.to
In reply to: Tom Lane (#13)
Re: Fixing r-tree semantics

On Sun, Jun 26, 2005 at 09:52:03 -0400,
Tom Lane <tgl@sss.pgh.pa.us> wrote:

Now that the module uses GIST instead of r-tree, there's no very strong
reason why it should provide these operators at all. I propose removing
all of << >> &< &> from contrib/cube, leaving only the four
n-dimensional indexing operators (&& ~= ~ @).

Any objections?

I seem to remember there being a problem if <, <=, > and >= operators
didn't exist and doing some operations (distinct or group by?) that
required sorting the data type. I am not sure that you are suggesting
that these operators be removed, as you didn't list them in either the
remove or keep list above.

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruno Wolff III (#14)
Re: Fixing r-tree semantics

Bruno Wolff III <bruno@wolff.to> writes:

I seem to remember there being a problem if <, <=, > and >= operators
didn't exist and doing some operations (distinct or group by?) that
required sorting the data type. I am not sure that you are suggesting
that these operators be removed,

No, I wasn't.

regards, tom lane