[PATCH] we have added support for box type in SP-GiST index

Started by Alexander Lebedevover 10 years ago24 messageshackers
Jump to latest
#1Alexander Lebedev
a.lebedev@postgrespro.ru

Hello, Hacker.

* [PATCH] add a box index to sp-gist

We have extended sp-gist with an index that keeps track of boxes

We have used ideas underlying sp-gist range implementation to
represent 2D boxes as points in 4D space. We use quad tree
analogue, but in 4-dimensional space. We call this tree q4d. Each
node of this tree is a box (a point in 4D space) which splits space
in 16 hyperrectangles.

Rationale: r-tree assumes that boxes we're trying to index don't
overlap much. When this assumption fails, r-tree performs badly,
while our proposal to represent a rectangle as a point in 4D space
solves this problem.

NB: the index uses traversalValue introduced in a separate patch.

* [PATCH] add traversalValue in sp-gist

During implementation of box index for sp-gist we saw that we only
keep rectangles, but to determine traversal direction we may need
to know boundaries of a hyperrectangle. So we calculate them
on the fly and store them in traversalValue, available
when traversing child nodes, because otherwise we would't be able to
calculate them from inside the inner_consistent function call.

This patch was written by Teodor Sigaev.

* [PATCH] change reconstructValue -> traversalValue in range_spgist.c

During traversal, local context is
insufficient to pick traversal direction. As of now, this is worked
around with the help of reconstructValue. reconstructValue only
stores data of the same type as a tree node, that is, a range.

We have written a patch that calculates auxillary values and makes
them accessible during traversal.

We then use traversalValue in a new box index and change
range_spgist.c to use it in place of reconstructValue.

NB: apply this patch after traversalValue patch.

Attachments:

traversalValue.patchtext/plain; charset=UTF-8; name=traversalValue.patchDownload+34-0
q4d.patchtext/plain; charset=UTF-8; name=q4d.patchDownload+883-3
range.patchtext/plain; charset=UTF-8; name=range.patchDownload+21-9
#2Oleg Bartunov
oleg@sai.msu.su
In reply to: Alexander Lebedev (#1)
Re: [PATCH] we have added support for box type in SP-GiST index

On Sat, Oct 31, 2015 at 9:49 PM, Alexander Lebedev <a.lebedev@postgrespro.ru

wrote:

Hello, Hacker.

* [PATCH] add a box index to sp-gist

We have extended sp-gist with an index that keeps track of boxes

We have used ideas underlying sp-gist range implementation to
represent 2D boxes as points in 4D space. We use quad tree
analogue, but in 4-dimensional space. We call this tree q4d. Each
node of this tree is a box (a point in 4D space) which splits space
in 16 hyperrectangles.

Rationale: r-tree assumes that boxes we're trying to index don't
overlap much. When this assumption fails, r-tree performs badly,
while our proposal to represent a rectangle as a point in 4D space
solves this problem.

NB: the index uses traversalValue introduced in a separate patch.

* [PATCH] add traversalValue in sp-gist

During implementation of box index for sp-gist we saw that we only
keep rectangles, but to determine traversal direction we may need
to know boundaries of a hyperrectangle. So we calculate them
on the fly and store them in traversalValue, available
when traversing child nodes, because otherwise we would't be able to
calculate them from inside the inner_consistent function call.

This patch was written by Teodor Sigaev.

* [PATCH] change reconstructValue -> traversalValue in range_spgist.c

During traversal, local context is
insufficient to pick traversal direction. As of now, this is worked
around with the help of reconstructValue. reconstructValue only
stores data of the same type as a tree node, that is, a range.

We have written a patch that calculates auxillary values and makes
them accessible during traversal.

We then use traversalValue in a new box index and change
range_spgist.c to use it in place of reconstructValue.

NB: apply this patch after traversalValue patch.

Did you forget to show us performance numbers ?

Show quoted text

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alexander Lebedev (#1)
Re: [PATCH] we have added support for box type in SP-GiST index

Alexander Lebedev wrote:

Hello, Hacker.

* [PATCH] add a box index to sp-gist

I closed this patch as "returned with feedback" because the author
didn't reply in three months.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Teodor Sigaev
teodor@sigaev.ru
In reply to: Alexander Lebedev (#1)
Re: [PATCH] we have added support for box type in SP-GiST index

It's very pity but author is not able to continue work on this patch, and I
would like to raise this flag.

I'd like to add some comments about patches:

traversalValue patch adds arbitrary value assoО©╫iated with branch in SP-GiST tree
walk. Unlike to recostructedValue it could be just pointer, it havn't to be a
regular pgsql type. Also, it could be used independly from reconstructedValue.
This patch is used both following attached patches.

range patch just switchs using reconstructedValue to traversalValue in range
opclasses. reconstructedValue was used just because it was an acceptable
workaround in case of range type. Range opclase stores a full value in leafs and
doesn't need to use reconstructedValue to return tuple in index only scan.
See /messages/by-id/5399.1343512250@sss.pgh.pa.us

q4d patch implements index over boxes using SP-GiST. Basic idea was an
observation, range opclass thinks about one-dimensional ranges as 2D points.
Following this idea, we can think that 2D box (what is 2 points or 4 numbers)
could represent a 4D point. We hoped that this approach will be much more
effective than traditional R-Tree in case of many overlaps in box's collection.
Performance results are coming shortly.

In near future (say, tommorrow) I hope to suggest a opclass over polygons based
on this approach.

Alexander Lebedev wrote:

Hello, Hacker.

* [PATCH] add a box index to sp-gist

We have extended sp-gist with an index that keeps track of boxes

We have used ideas underlying sp-gist range implementation to
represent 2D boxes as points in 4D space. We use quad tree
analogue, but in 4-dimensional space. We call this tree q4d. Each
node of this tree is a box (a point in 4D space) which splits space
in 16 hyperrectangles.

Rationale: r-tree assumes that boxes we're trying to index don't
overlap much. When this assumption fails, r-tree performs badly,
while our proposal to represent a rectangle as a point in 4D space
solves this problem.

NB: the index uses traversalValue introduced in a separate patch.

* [PATCH] add traversalValue in sp-gist

During implementation of box index for sp-gist we saw that we only
keep rectangles, but to determine traversal direction we may need
to know boundaries of a hyperrectangle. So we calculate them
on the fly and store them in traversalValue, available
when traversing child nodes, because otherwise we would't be able to
calculate them from inside the inner_consistent function call.

This patch was written by Teodor Sigaev.

* [PATCH] change reconstructValue -> traversalValue in range_spgist.c

During traversal, local context is
insufficient to pick traversal direction. As of now, this is worked
around with the help of reconstructValue. reconstructValue only
stores data of the same type as a tree node, that is, a range.

We have written a patch that calculates auxillary values and makes
them accessible during traversal.

We then use traversalValue in a new box index and change
range_spgist.c to use it in place of reconstructValue.

NB: apply this patch after traversalValue patch.

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

Attachments:

traversalValue-1.patch.gzapplication/x-gzip; name=traversalValue-1.patch.gzDownload
range-1.patch.gzapplication/x-gzip; name=range-1.patch.gzDownload
q4d-1.patch.gzapplication/x-gzip; name=q4d-1.patch.gzDownload
#5Oleg Bartunov
oleg@sai.msu.su
In reply to: Teodor Sigaev (#4)
Re: [PATCH] we have added support for box type in SP-GiST index

On Mon, Feb 15, 2016 at 6:29 PM, Teodor Sigaev <teodor@sigaev.ru> wrote:

It's very pity but author is not able to continue work on this patch, and
I would like to raise this flag.

I'd like to add some comments about patches:

traversalValue patch adds arbitrary value assoсiated with branch in
SP-GiST tree walk. Unlike to recostructedValue it could be just pointer, it
havn't to be a regular pgsql type. Also, it could be used independly from
reconstructedValue. This patch is used both following attached patches.

range patch just switchs using reconstructedValue to traversalValue in
range opclasses. reconstructedValue was used just because it was an
acceptable workaround in case of range type. Range opclase stores a full
value in leafs and doesn't need to use reconstructedValue to return tuple
in index only scan.
See /messages/by-id/5399.1343512250@sss.pgh.pa.us

q4d patch implements index over boxes using SP-GiST. Basic idea was an
observation, range opclass thinks about one-dimensional ranges as 2D points.
Following this idea, we can think that 2D box (what is 2 points or 4
numbers) could represent a 4D point. We hoped that this approach will be
much more effective than traditional R-Tree in case of many overlaps in
box's collection.
Performance results are coming shortly.

I made some benchmarks using two real-life datasets. Streets data set is
consists of 3691620 MBRs of russian streets with not much overlaps. As
expected, spgist is slower than rtree (gist), because of much bigger
number of blocks readed. Build time, however, is faster for spgist than
rtree (~ 1.3 times).

Bounds data set is consists of 7803499 MBRs of some objects and are highly
overlapped, see attached picture (crop with center in Moscow) of boxes.
Create index time (ms):
spgist: 47036.051
rtree: 68491.242

Size:
spgist: 505 MB
rtree: 663 MB
heap: 871 MB

Let's consider a small box in downtown of Moscow: box
'(37.607164,55.756489), (37.607797,55.756725)'.
&& оператор (returns 23958 boxes) :
spgist: 14.649
rtree: 23.715
seqscan: 924.344

@> operator (returns 23868 boxes):

spgist: 14.113
rtree: 26.853
seqscan: 909.147

<@ operator (returns 0 boxes):
spgist: 0..268 ms
rtree: 12.563

My conclusion is that for "spagetti" data new opclass for spgist is good to
have - it's smaller and faster (both, created index and search) that rtree
(gist based).

In near future (say, tommorrow) I hope to suggest a opclass over polygons
based on this approach.

Yes, it'd be interested.

Show quoted text

Alexander Lebedev wrote:

Hello, Hacker.

* [PATCH] add a box index to sp-gist

We have extended sp-gist with an index that keeps track of boxes

We have used ideas underlying sp-gist range implementation to
represent 2D boxes as points in 4D space. We use quad tree
analogue, but in 4-dimensional space. We call this tree q4d. Each
node of this tree is a box (a point in 4D space) which splits space
in 16 hyperrectangles.

Rationale: r-tree assumes that boxes we're trying to index don't
overlap much. When this assumption fails, r-tree performs badly,
while our proposal to represent a rectangle as a point in 4D space
solves this problem.

NB: the index uses traversalValue introduced in a separate patch.

* [PATCH] add traversalValue in sp-gist

During implementation of box index for sp-gist we saw that we only
keep rectangles, but to determine traversal direction we may need
to know boundaries of a hyperrectangle. So we calculate them
on the fly and store them in traversalValue, available
when traversing child nodes, because otherwise we would't be able to
calculate them from inside the inner_consistent function call.

This patch was written by Teodor Sigaev.

* [PATCH] change reconstructValue -> traversalValue in range_spgist.c

During traversal, local context is
insufficient to pick traversal direction. As of now, this is worked
around with the help of reconstructValue. reconstructValue only
stores data of the same type as a tree node, that is, a range.

We have written a patch that calculates auxillary values and makes
them accessible during traversal.

We then use traversalValue in a new box index and change
range_spgist.c to use it in place of reconstructValue.

NB: apply this patch after traversalValue patch.

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW:
http://www.sigaev.ru/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachments:

11148387_10207592076787467_3791931596034486026_o.jpgimage/jpeg; name=11148387_10207592076787467_3791931596034486026_o.jpgDownload+7-9
#6David Steele
david@pgmasters.net
In reply to: Teodor Sigaev (#4)
Re: [PATCH] we have added support for box type in SP-GiST index

On 2/15/16 10:29 AM, Teodor Sigaev wrote:

It's very pity but author is not able to continue work on this patch,
and I would like to raise this flag.

I'd like to add some comments about patches:

traversalValue patch adds arbitrary value assoсiated with branch in
SP-GiST tree walk. Unlike to recostructedValue it could be just pointer,
it havn't to be a regular pgsql type. Also, it could be used independly
from reconstructedValue. This patch is used both following attached
patches.

range patch just switchs using reconstructedValue to traversalValue in
range opclasses. reconstructedValue was used just because it was an
acceptable workaround in case of range type. Range opclase stores a full
value in leafs and doesn't need to use reconstructedValue to return
tuple in index only scan.
See /messages/by-id/5399.1343512250@sss.pgh.pa.us

q4d patch implements index over boxes using SP-GiST. Basic idea was an
observation, range opclass thinks about one-dimensional ranges as 2D
points.
Following this idea, we can think that 2D box (what is 2 points or 4
numbers) could represent a 4D point. We hoped that this approach will be
much more effective than traditional R-Tree in case of many overlaps in
box's collection.
Performance results are coming shortly.

It appears that the issues raised in this thread have been addressed but
the patch still has not gone though a real review.

Anybody out there willing to take a crack at a review? All three
patches apply (with whitespace issues).

Thanks,
--
-David
david@pgmasters.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Oleg Bartunov
oleg@sai.msu.su
In reply to: David Steele (#6)
Re: [PATCH] we have added support for box type in SP-GiST index

On Mon, Mar 14, 2016 at 9:01 PM, David Steele <david@pgmasters.net> wrote:

On 2/15/16 10:29 AM, Teodor Sigaev wrote:

It's very pity but author is not able to continue work on this patch,

and I would like to raise this flag.

I'd like to add some comments about patches:

traversalValue patch adds arbitrary value assoсiated with branch in
SP-GiST tree walk. Unlike to recostructedValue it could be just pointer,
it havn't to be a regular pgsql type. Also, it could be used independly
from reconstructedValue. This patch is used both following attached
patches.

range patch just switchs using reconstructedValue to traversalValue in
range opclasses. reconstructedValue was used just because it was an
acceptable workaround in case of range type. Range opclase stores a full
value in leafs and doesn't need to use reconstructedValue to return
tuple in index only scan.
See /messages/by-id/5399.1343512250@sss.pgh.pa.us

q4d patch implements index over boxes using SP-GiST. Basic idea was an
observation, range opclass thinks about one-dimensional ranges as 2D
points.
Following this idea, we can think that 2D box (what is 2 points or 4
numbers) could represent a 4D point. We hoped that this approach will be
much more effective than traditional R-Tree in case of many overlaps in
box's collection.
Performance results are coming shortly.

It appears that the issues raised in this thread have been addressed but
the patch still has not gone though a real review.

Anybody out there willing to take a crack at a review? All three patches
apply (with whitespace issues).

Emre Hasegeli will review the patch.

Show quoted text

Thanks,
--
-David
david@pgmasters.net

#8Emre Hasegeli
emre@hasegeli.com
In reply to: Oleg Bartunov (#7)
Re: [PATCH] we have added support for box type in SP-GiST index

Here are my first comments. I haven't read the actual index
implementation, yet.

I think traversal value is a useful addition. It is nice that the
implementation for the range types is also changed. My questions
about them are:

Do reconstructedValues is required now? Wouldn't it be cleaner to use
the new field on the prefix tree implementation, too?

We haven't had specific memory context for reconstructedValues. Why
is it required for the new field?

+       Memory for <structfield>traversalValues</> should be allocated in
+       the default context, but make sure to switch to
+       <structfield>traversalMemoryContext</> before allocating memory
+       for pointers themselves.

This sentence is not understandable. Shouldn't it say "the memory
should *not* be allocated in the default context"? What are the
"pointers themselves"?

The box opclass is broken because of the floating point arithmetics of
the box type. You can reproduce it with the attached script. We
really need to fix the geometric types, before building more on them.
They fail to serve the only purpose they are useful for, demonstrating
features.

It think the opclass should support the cross data type operators like
box @> point. Ideally we should do it by using multiple opclasses in
an opfamily. The floating problem will bite us again in here, because
some of the operators are not using FP macros.

The tests check not supported operator "@". It should be "<@". It is
nice that the opclass doesn't support long deprecated operators.

We needs tests for the remaining operators and for adding new values
to the index. I am not sure create_index.sql is the best place for
them. Maybe we should use the test scripts of "box".

+ #define LT      -1
+ #define GT       1
+ #define EQ       0

Using these numbers is a very well established pattern. I don't think
macros make the code any more readable.

+ /* SP-GiST API functions */
+ Datum       spg_box_quad_config(PG_FUNCTION_ARGS);
+ Datum       spg_box_quad_choose(PG_FUNCTION_ARGS);
+ Datum       spg_box_quad_picksplit(PG_FUNCTION_ARGS);
+ Datum       spg_box_quad_inner_consistent(PG_FUNCTION_ARGS);
+ Datum       spg_box_quad_leaf_consistent(PG_FUNCTION_ARGS);

I guess they should go to the header file.

Attachments:

box-index-test.sqlapplication/octet-stream; name=box-index-test.sqlDownload
#9Teodor Sigaev
teodor@sigaev.ru
In reply to: Emre Hasegeli (#8)
Re: [PATCH] we have added support for box type in SP-GiST index

Do reconstructedValues is required now? Wouldn't it be cleaner to use
the new field on the prefix tree implementation, too?

reconstructedValues is needed to reconctruct full indexed value and should match
to type info indexed column

We haven't had specific memory context for reconstructedValues. Why
is it required for the new field?

because pg knows type of reconstructedValues (see above why) but traversalValue
just a void*, SP-GiST core doesn't knnow how to free memory of allocated for it.

+       Memory for <structfield>traversalValues</> should be allocated in
+       the default context, but make sure to switch to
+       <structfield>traversalMemoryContext</> before allocating memory
+       for pointers themselves.

This sentence is not understandable. Shouldn't it say "the memory
should *not* be allocated in the default context"? What are the
"pointers themselves"?

Clarified, I hope

The box opclass is broken because of the floating point arithmetics of
the box type. You can reproduce it with the attached script. We

hope, fixed. At least, seems, syncronized with seqscan

really need to fix the geometric types, before building more on them.
They fail to serve the only purpose they are useful for, demonstrating
features.

agree, but it's not a deal for this patch

It think the opclass should support the cross data type operators like
box @> point. Ideally we should do it by using multiple opclasses in
an opfamily. The floating problem will bite us again in here, because
some of the operators are not using FP macros.

Again, agree. But I suggest to do it by separate patch.

The tests check not supported operator "@". It should be "<@". It is
nice that the opclass doesn't support long deprecated operators.

fixed

+ #define LT      -1
+ #define GT       1
+ #define EQ       0

Using these numbers is a very well established pattern. I don't think
macros make the code any more readable.

fixed

+ /* SP-GiST API functions */
+ Datum       spg_box_quad_config(PG_FUNCTION_ARGS);
+ Datum       spg_box_quad_choose(PG_FUNCTION_ARGS);
+ Datum       spg_box_quad_picksplit(PG_FUNCTION_ARGS);
+ Datum       spg_box_quad_inner_consistent(PG_FUNCTION_ARGS);
+ Datum       spg_box_quad_leaf_consistent(PG_FUNCTION_ARGS);

I guess they should go to the header file.

Remove them because they are presented in auto-generated file
./src/include/utils/builtins.h

range patch is unchanged, but still attached to keep them altogether.

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

Attachments:

q4d-2.patch.gzapplication/x-gzip; name=q4d-2.patch.gzDownload
traversalValue-2.patch.gzapplication/x-gzip; name=traversalValue-2.patch.gzDownload
range-1.patch.gzapplication/x-gzip; name=range-1.patch.gzDownload
#10Emre Hasegeli
emre@hasegeli.com
In reply to: Teodor Sigaev (#9)
Re: [PATCH] we have added support for box type in SP-GiST index

Here are my comments about the operator class implementation:

+ * implementation of quad-4d tree over boxes for SP-GiST.

Isn't the whole thing actually 3D?

+ * For example consider the case of intersection.

No need for a new line after this.

+ * A quadrant has bounds, but sp-gist keeps only 4-d point (box) in inner nodes.

I couldn't get the term 4D point. Maybe it means that we are using
box datatype as the prefix, but we are not treating them as boxes.

+ * We use traversalValue to calculate quadrant bounds from parent's quadrant
+ * bounds.

I couldn't understand this sentence. We are using the parent's prefix
and the quadrant to generate the traversalValue. I think this is the
crucial part of the opclass. It would be nice to explain it more
clearly.

+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group

2016.

+ * src/backend/utils/adt/boxtype_spgist.c

Maybe we should name this file as geo_spgist.c to support other types
in the future.

+ #include "utils/builtins.h";

Extra ;.

+ #include "utils/datum.h"

I think this is unnecessary.

+ /* InfR type implements doubles and +- infinity */
+ typedef struct
+ {
+    int         infFlag;
+    double      val;
+ }   InfR;

Why do we need this? Can't we store infinity in "double"?

+ /* wrap double to InfR */
+ static InfR
+ toInfR(double v, InfR * r)
+ {
+    r->infFlag = NotInf;
+    r->val = v;
+ }

This isn't returning the value.

+ typedef struct
+ {
+    Range       range_x;
+    Range       range_y;
+ }   Rectangle;

Wouldn't it be simpler to just using BOX instead of this?

+ static void
+ evalIRangeBox(const IRangeBox *range_box, const Range *range, const int half1,
+              const int half2, IRangeBox *new_range_box)
+ static void
+ evalIRectBox(const IRectBox *rect_box, const Rectangle *centroid,
+             const uint8 quadrant, IRectBox * new_rect_box)
+ inline static void
+ initializeUnboundedBox(IRectBox * rect_box)

Wouldn't it be simpler to palloc and return the value on those functions?

+ static int
+ intersect2D(const Range * range, const IRangeBox * range_box)

Wouldn't it be better to return "bool" on those functions.

+ return ((p1 >= 0) && (p2 <= 0));

Extra parentheses.

+ i    const spgChooseIn *in = (spgChooseIn *) PG_GETARG_POINTER(0);
+ i    spgChooseOut *out = (spgChooseOut *) PG_GETARG_POINTER(1);

Many variables are defined with "const". I am not sure they are all
that doesn't change. If it applies to the pointer, "out" should also
not change in here. Am I wrong?

+    /*
+     * Begin. This block evaluates the median of coordinates of boxes
+     */

I would rather explain what the function does on the function header.

+ memcpy(new_rect_box, rect_box, sizeof(IRectBox));

Do we really need to copy the traversalValues on allTheSame case.
Wouldn't it work if just the same value is passed for all of them.
The search shouldn't continue after allTheSame case.

+ for (i = 0; flag && i < in->nkeys && flag; i++)

It checks flag two times.

+            boxPointerToRectangle(
+                DatumGetBoxP(in->scankeys[i].sk_argument),
+                p_query_rect
+            );

I don't think this matches the project coding style.

+ int flag = 1,

Wouldn't it be better as "bool"?

The regression tests for the remaining operators are still missing.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Teodor Sigaev
teodor@sigaev.ru
In reply to: Emre Hasegeli (#10)
Re: [PATCH] we have added support for box type in SP-GiST index

+ * implementation of quad-4d tree over boxes for SP-GiST.

Isn't the whole thing actually 3D?

No. The idea if this work is a representation of 2d box as 4d point. Quad means
quadrant of 2d plane. Originally such kind of tree was developed for 2d point,
and each node of tree splits plane on 4 quadrant. For 3d tree each node splits
space for 8 "quadrants", 4d - 16 ones.

+ * For example consider the case of intersection.

No need for a new line after this.

fixed

+ * A quadrant has bounds, but sp-gist keeps only 4-d point (box) in inner nodes.

I couldn't get the term 4D point. Maybe it means that we are using
box datatype as the prefix, but we are not treating them as boxes.

exactly, we treat boxes as 4-dimentional points.

+ * We use traversalValue to calculate quadrant bounds from parent's quadrant
+ * bounds.

I couldn't understand this sentence. We are using the parent's prefix
and the quadrant to generate the traversalValue. I think this is the
crucial part of the opclass. It would be nice to explain it more
clearly.

I'll try to explain with two-dimensional example over points. ASCII-art:
|
|
1 | 2
|
-----------+-------------
|P
3 | 4
|
|

'+' with 'A' represents a centroid or, other words, point which splits plane for
4 quadrants and it strorend in parent node. 1,2,3,4 are labels of quadrants,
each labeling will be the same for all pictures and all centriods, and i will
not show them in pictures later to prevent too complicated images. Let we add C
- child node (and again, it will split plane for 4 quads):

| |
----+---------+---
X |B |C
| |
-----------+---------+---
|P |A
| |
|
|

A and B are points of intersection of lines. So, box PBCAis a bounding box for
points contained in 3-rd (see labeling above). For example X labeled point is
not a descendace of child node with centroid C because it must be in branch of
1-st quad of parent node. So, each node (except root) will have a limitation in
its quadrant. To transfer that limitation the traversalValue is used.

To keep formatting I attached a txt file with this fragment.

+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group

2016.

fixed

+ * src/backend/utils/adt/boxtype_spgist.c

Maybe we should name this file as geo_spgist.c to support other types
in the future.

done

+ #include "utils/builtins.h";

Extra ;.

fixed

+ #include "utils/datum.h"

I think this is unnecessary.

removed

+ /* InfR type implements doubles and +- infinity */
+ typedef struct
+ {
+    int         infFlag;
+    double      val;
+ }   InfR;

Why do we need this? Can't we store infinity in "double"?

Hmm, you are right. fixed.

This isn't returning the value.

removed

+ typedef struct
+ {
+    Range       range_x;
+    Range       range_y;
+ }   Rectangle;

Wouldn't it be simpler to just using BOX instead of this?

Too much the same names in RectBox

+ static void
+ evalIRangeBox(const IRangeBox *range_box, const Range *range, const int half1,
+              const int half2, IRangeBox *new_range_box)
+ static void
+ evalIRectBox(const IRectBox *rect_box, const Rectangle *centroid,
+             const uint8 quadrant, IRectBox * new_rect_box)
+ inline static void
+ initializeUnboundedBox(IRectBox * rect_box)

Wouldn't it be simpler to palloc and return the value on those functions?

evalRangeBox() initializes part of RectBox, so, it could not palloc its result.
Other methods use the same signature just for consistency.

+ static int
+ intersect2D(const Range * range, const IRangeBox * range_box)

Wouldn't it be better to return "bool" on those functions.

done

+ return ((p1 >= 0) && (p2 <= 0));

Extra parentheses.

removed

+ i    const spgChooseIn *in = (spgChooseIn *) PG_GETARG_POINTER(0);
+ i    spgChooseOut *out = (spgChooseOut *) PG_GETARG_POINTER(1);

Many variables are defined with "const". I am not sure they are all
that doesn't change. If it applies to the pointer, "out" should also
not change in here. Am I wrong?

No, changes

+    /*
+     * Begin. This block evaluates the median of coordinates of boxes
+     */

I would rather explain what the function does on the function header.

fixed

+ memcpy(new_rect_box, rect_box, sizeof(IRectBox));

Do we really need to copy the traversalValues on allTheSame case.
Wouldn't it work if just the same value is passed for all of them.
The search shouldn't continue after allTheSame case.

Seems, yes. spgist tree could contain a locng branches with allTheSame.

+ for (i = 0; flag && i < in->nkeys && flag; i++)

It checks flag two times.

fixed

+            boxPointerToRectangle(
+                DatumGetBoxP(in->scankeys[i].sk_argument),
+                p_query_rect
+            );

I don't think this matches the project coding style.

fixed

+ int flag = 1,

Wouldn't it be better as "bool"?

fixed.

The regression tests for the remaining operators are still missing.

Ooops, will be soon.

Attached all patches to simplify review. Thank you very much!

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

Attachments:

art.txttext/plain; charset=UTF-8; name=art.txtDownload
traversalValue-2.patch.gzapplication/x-gzip; name=traversalValue-2.patch.gzDownload
q4d-3.patch.gzapplication/x-gzip; name=q4d-3.patch.gzDownload
range-1.patch.gzapplication/x-gzip; name=range-1.patch.gzDownload
#12Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Teodor Sigaev (#11)
Re: [PATCH] we have added support for box type in SP-GiST index

On Mon, Mar 21, 2016 at 11:57 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:

I couldn't get the term 4D point. Maybe it means that we are
using box datatype as the prefix, but we are not treating them
as boxes.

exactly, we treat boxes as 4-dimentional points.

I'm not entirely sure I understand the terminology either, since I
tend to think of a *point* as having *zero* dimensions. Would it
perhaps be more accurate to say we are treating a 2-dimensional box
as a point in 4-dimensional space?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Stas Kelvich
s.kelvich@postgrespro.ru
In reply to: Kevin Grittner (#12)
Re: [PATCH] we have added support for box type in SP-GiST index

On 21 Mar 2016, at 22:38, Kevin Grittner <kgrittn@gmail.com> wrote:

On Mon, Mar 21, 2016 at 11:57 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:

I couldn't get the term 4D point. Maybe it means that we are
using box datatype as the prefix, but we are not treating them
as boxes.

exactly, we treat boxes as 4-dimentional points.

I'm not entirely sure I understand the terminology either, since I
tend to think of a *point* as having *zero* dimensions. Would it
perhaps be more accurate to say we are treating a 2-dimensional box
as a point in 4-dimensional space?

Or just say 4-d vector instead of 4-d point. Look like it will be enough rigorous.

Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Teodor Sigaev (#11)
Re: [PATCH] we have added support for box type in SP-GiST index

On 3/21/16 11:57 AM, Teodor Sigaev wrote:

A and B are points of intersection of lines. So, box PBCAis a bounding
box for points contained in 3-rd (see labeling above). For example X
labeled point is not a descendace of child node with centroid C because
it must be in branch of 1-st quad of parent node. So, each node (except
root) will have a limitation in its quadrant. To transfer that
limitation the traversalValue is used.

Isn't this basically the same thing that the cube contrib module does?
(Which has the added benefit of kNN-capable operators).

If that's true then ISTM it'd be better to work on pulling cube's
features into box?

If it's not true, I'm still wondering if there's enough commonality here
that we should pull cube into core...
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Stas Kelvich
s.kelvich@postgrespro.ru
In reply to: Jim Nasby (#14)
Re: [PATCH] we have added support for box type in SP-GiST index

On 22 Mar 2016, at 01:47, Jim Nasby <Jim.Nasby@BlueTreble.com> wrote:

On 3/21/16 11:57 AM, Teodor Sigaev wrote:

A and B are points of intersection of lines. So, box PBCAis a bounding
box for points contained in 3-rd (see labeling above). For example X
labeled point is not a descendace of child node with centroid C because
it must be in branch of 1-st quad of parent node. So, each node (except
root) will have a limitation in its quadrant. To transfer that
limitation the traversalValue is used.

Isn't this basically the same thing that the cube contrib module does? (Which has the added benefit of kNN-capable operators).

Cube and postgres own geometric types are indexed by building R-tree. While this is one of the most popular solutions it
degrades almost to linear search when bounding boxes for each index node overlaps a lot. This problem will arise when one will
try to index streets in the city with grid system — a lot of street's bounding boxes will just coincide with bounding box for whole city.

But that patch use totally different strategy for building index and do not suffer from above problem.

If that's true then ISTM it'd be better to work on pulling cube's features into box?

If it's not true, I'm still wondering if there's enough commonality here that we should pull cube into core…

There is also intarray module with very common functionality, but it also addressing different data pattern.

Optimal indexing and kNN strategy are very sensible to the data itself and if we want some general multidimensional search routines,
then I think none of the mentioned extensions is general enough. Cube barely applicable when dimensions number is higher than 10-20,
intarray performs bad on data with big sets of possible coordinates, this patch is also intended to help with specific, niche problem.

While people tends to use machine learning and regressions models more and more it is interesting to have some general n-dim indexing with kNN,
but I think it is different problem and should be solved in a different way.

Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Stas Kelvich (#15)
Re: [PATCH] we have added support for box type in SP-GiST index

On 3/21/16 7:41 PM, Stas Kelvich wrote:

While people tends to use machine learning and regressions models more and more it is interesting to have some general n-dim indexing with kNN,
but I think it is different problem and should be solved in a different way.

I think one of the issues here is it's not very clear to someone without
a good amount of ML knowledge how these things relate. I hear "box' and
'cube' and think it's just a 2D vs 3D issue, and intarray isn't even on
the radar.

Maybe what's needed are actual vector and matrix types?

In any case, if you've got a good reason why box and cube should stay
separate then further discussion should happen in another thread.

BTW, if you haven't seen it, take a look at http://madlib.apache.org/
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Teodor Sigaev
teodor@sigaev.ru
In reply to: Kevin Grittner (#12)
Re: [PATCH] we have added support for box type in SP-GiST index

tend to think of a *point* as having *zero* dimensions. Would it
perhaps be more accurate to say we are treating a 2-dimensional box
as a point in 4-dimensional space?

Exactly, sorry for ambiguity.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Teodor Sigaev
teodor@sigaev.ru
In reply to: Jim Nasby (#14)
Re: [PATCH] we have added support for box type in SP-GiST index

Isn't this basically the same thing that the cube contrib module does? (Which
has the added benefit of kNN-capable operators).

No, cube module introduces new type - N-dimensional box. And adds an index
support for it.

Current patch suggests non-traditional indexing technique for 2D boxes by
treating them as point in 4D space. With such representation it's became
possible to use quad-tree technique which:
1 supports only points to index
2 already supported by SP-GiST

Such technique provides some benefits in comparance with traditional R-Tree
which implemented in pg with a help GiST. See Oleg's message.

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Emre Hasegeli
emre@hasegeli.com
In reply to: Teodor Sigaev (#11)
Re: [PATCH] we have added support for box type in SP-GiST index

+ * boxtype_spgist.c

The names on the file header need to be changed, too.

I'll try to explain with two-dimensional example over points. ASCII-art:
|
|
1 | 2
|
-----------+-------------
|P
3 | 4
|
|

'+' with 'A' represents a centroid or, other words, point which splits plane
for 4 quadrants and it strorend in parent node. 1,2,3,4 are labels of
quadrants, each labeling will be the same for all pictures and all
centriods, and i will not show them in pictures later to prevent too
complicated images. Let we add C - child node (and again, it will split
plane for 4 quads):

| |
----+---------+---
X |B |C
| |
-----------+---------+---
|P |A
| |
|
|

A and B are points of intersection of lines. So, box PBCAis a bounding box
for points contained in 3-rd (see labeling above). For example X labeled
point is not a descendace of child node with centroid C because it must be
in branch of 1-st quad of parent node. So, each node (except root) will have
a limitation in its quadrant. To transfer that limitation the traversalValue
is used.

To keep formatting I attached a txt file with this fragment.

Thank you for the explanation. Should we incorporate this with the patch.

+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development
Group

2016.

fixed

Not on the patch.

+ cmp_double(const double a, const double b)

Does this function necessary? We can just compare the doubles.

+ boxPointerToRangeBox(BOX *box, RangeBox * rectangle)

The "rectangle" variable in here should be renamed.

+ Assert(is_infinite(b) == 0);

This is failing on my test. You can reproduce with the script I have sent.

Wouldn't it be simpler to palloc and return the value on those functions?

evalRangeBox() initializes part of RectBox, so, it could not palloc its
result.
Other methods use the same signature just for consistency.

I couldn't understand it. evalRangeBox() can palloc and return the
result. evalRectBox() can return the result palloc'ed by
evalRangeBox(). The caller wouldn't need to palloc anything.

Many variables are defined with "const". I am not sure they are all
that doesn't change. If it applies to the pointer, "out" should also
not change in here. Am I wrong?

No, changes

I now read about "const". I am sorry for not being experienced in C.
The new version of the patch marks them as "const" by mistake.

I went through all other variables:

+ int r = is_infinite(a);

+ double x = *(double *) a;
+ double y = *(double *) b;

+ spgInnerConsistentIn *in = (spgInnerConsistentIn *) PG_GETARG_POINTER(0);

+ spgLeafConsistentIn *in = (spgLeafConsistentIn *) PG_GETARG_POINTER(0);

+ BOX *leafBox = DatumGetBoxP(in->leafDatum);

Shouldn't they be "const", too?

+    /*
+     * Begin. This block evaluates the median of coordinates of boxes
+     */

I would rather explain what the function does on the function header.

fixed

The "end" part of it is still there.

Do we really need to copy the traversalValues on allTheSame case.
Wouldn't it work if just the same value is passed for all of them.
The search shouldn't continue after allTheSame case.

Seems, yes. spgist tree could contain a locng branches with allTheSame.

Does SP-GiST allows any node under allTheSame to not being allTheSame?
Not setting traversalValues for allTheSame worked fine with my test.

+ if (in->allTheSame)

Most of the things happening before this check is not used in there.
Shouldn't we move this to the top of the function?

+ out->nodeNumbers = (int *) palloc(sizeof(int) * in->nNodes);

This could go before allTheSame block.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Teodor Sigaev
teodor@sigaev.ru
In reply to: Emre Hasegeli (#19)
Re: [PATCH] we have added support for box type in SP-GiST index

+ * boxtype_spgist.c

The names on the file header need to be changed, too.

Oops. fixed

I'll try to explain with two-dimensional example over points. ASCII-art:

Thank you for the explanation. Should we incorporate this with the patch.

added

+ cmp_double(const double a, const double b)

Does this function necessary? We can just compare the doubles.

Yes, it compares with limited precision as it does by geometry operations.
Renamed to point actual arguments.

+ boxPointerToRangeBox(BOX *box, RangeBox * rectangle)

The "rectangle" variable in here should be renamed.

fixed

+ Assert(is_infinite(b) == 0);

This is failing on my test. You can reproduce with the script I have sent.

I didn't know:
# select '(1,inf)'::point;
point
---------
(1,inf)

fixed

Wouldn't it be simpler to palloc and return the value on those functions?

evalRangeBox() initializes part of RectBox, so, it could not palloc its
result.
Other methods use the same signature just for consistency.

I couldn't understand it. evalRangeBox() can palloc and return the
result. evalRectBox() can return the result palloc'ed by
evalRangeBox(). The caller wouldn't need to palloc anything.

evalRangeBox() is used to initialize fields of RangeBox in evalRectBox(). If
evalRangeBox() will palloc its result then we need to copy its result into
RangeBox struct and free. Let both fucntion use the same interface.

I went through all other variables:

+     int r = is_infinite(a);
+     double      x = *(double *) a;
+     double      y = *(double *) b;
+     spgInnerConsistentIn *in = (spgInnerConsistentIn *) PG_GETARG_POINTER(0);
+     spgLeafConsistentIn *in = (spgLeafConsistentIn *) PG_GETARG_POINTER(0);
+     BOX        *leafBox = DatumGetBoxP(in->leafDatum);

Shouldn't they be "const", too?

They could. But it doesn't required. To be in consistent state I've removed
const modifier where possible

+    /*
+     * Begin. This block evaluates the median of coordinates of boxes
+     */

I would rather explain what the function does on the function header.

fixed

The "end" part of it is still there.

Oops again, fixed

Do we really need to copy the traversalValues on allTheSame case.
Wouldn't it work if just the same value is passed for all of them.
The search shouldn't continue after allTheSame case.

Seems, yes. spgist tree could contain a locng branches with allTheSame.

Does SP-GiST allows any node under allTheSame to not being allTheSame?

No, SP-GiST doesn't allow that

Not setting traversalValues for allTheSame worked fine with my test.

it works until allthesame branch contains only one inner node. Otherwise
traversalValue will be freeed several times, see spgWalk().
# select i as id, '1,2,3,4'::box as b into x from generate_series(1,1000000) i;
# create index ix on i using spgist (b);
# select count(*) from x where b && '1,2,3,4'::box; -- coredump
gdb:
#0 0x000000080143564a in thr_kill () from /lib/libc.so.7
#1 0x0000000801435636 in raise () from /lib/libc.so.7
#2 0x00000008014355b9 in abort () from /lib/libc.so.7
#3 0x0000000000a80739 in in_error_recursion_trouble () at elog.c:199
#4 0x0000000000abb748 in pfree (pointer=0x801e90868) at mcxt.c:1016
#5 0x000000000053330c in freeScanStackEntry (so=0x801e8d358,
stackEntry=0x801e935d8) at spgscan.c:47
#6 0x0000000000532cdb in spgWalk (index=0x801f1c588, so=0x801e8d358,
scanWholeIndex=1 '\001', storeRes=0x532d10 <storeBitm
...

+ if (in->allTheSame)

Most of the things happening before this check is not used in there.
Shouldn't we move this to the top of the function?

yep, fixed

+ out->nodeNumbers = (int *) palloc(sizeof(int) * in->nNodes);

This could go before allTheSame block.

yep, fixed

I've attached all patches again. Thank you very much!

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

Attachments:

q4d-4.patch.gzapplication/x-gzip; name=q4d-4.patch.gzDownload
traversalValue-2.patch.gzapplication/x-gzip; name=traversalValue-2.patch.gzDownload
range-1.patch.gzapplication/x-gzip; name=range-1.patch.gzDownload
#21Emre Hasegeli
emre@hasegeli.com
In reply to: Teodor Sigaev (#20)
#22Teodor Sigaev
teodor@sigaev.ru
In reply to: Emre Hasegeli (#21)
#23Emre Hasegeli
emre@hasegeli.com
In reply to: Teodor Sigaev (#22)
#24Bruce Momjian
bruce@momjian.us
In reply to: Emre Hasegeli (#23)