KNN-GiST with recheck
Hackers!
This patch was split from thread:
/messages/by-id/CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=goN-gfA@mail.gmail.com
I've split it to separate thead, because it's related to partial sort only
conceptually not technically. Also I renamed it to "knn-gist-recheck" from
"partial-knn" as more appropriate name. In the attached version docs are
updated. Possible weak point of this patch design is that it fetches heap
tuple from GiST scan. However, I didn't receive any notes about its design,
so, I'm going to put it to commitfest.
Here goes a desription of this patch same as in original thread.
KNN-GiST provides ability to get ordered results from index, but this order
is based only on index information. For instance, GiST index contains
bounding rectangles for polygons, and we can't get exact distance to
polygon from index (similar situation is in PostGIS). In attached patch,
GiST distance method can set recheck flag (similar to consistent method).
This flag means that distance method returned lower bound of distance and
we should recheck it from heap.
See an example.
create table test as (select id, polygon(3+(random()*10)::int,
circle(point(random(), random()), 0.0003 + random()*0.001)) as p from
generate_series(1,1000000) id);
create index test_idx on test using gist (p);
We can get results ordered by distance from polygon to point.
postgres=# select id, p <-> point(0.5,0.5) from test order by p <->
point(0.5,0.5) limit 10;
id | ?column?
--------+----------------------
755611 | 0.000405855808916853
807562 | 0.000464123777564343
437778 | 0.000738524708741959
947860 | 0.00076250998760724
389843 | 0.000886362723569568
17586 | 0.000981960100555216
411329 | 0.00145338112316853
894191 | 0.00149399559703506
391907 | 0.0016647896049741
235381 | 0.00167554614889509
(10 rows)
It's fast using just index scan.
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.29..1.86 rows=10 width=36) (actual time=0.180..0.230
rows=10 loops=1)
-> Index Scan using test_idx on test (cost=0.29..157672.29
rows=1000000 width=36) (actual time=0.179..0.228 rows=10 loops=1)
Order By: (p <-> '(0.5,0.5)'::point)
Total runtime: 0.305 ms
(4 rows)
------
With best regards,
Alexander Korotkov.
Attachments:
On 01/13/2014 07:17 PM, Alexander Korotkov wrote:
Here goes a desription of this patch same as in original thread.
KNN-GiST provides ability to get ordered results from index, but this order
is based only on index information. For instance, GiST index contains
bounding rectangles for polygons, and we can't get exact distance to
polygon from index (similar situation is in PostGIS). In attached patch,
GiST distance method can set recheck flag (similar to consistent method).
This flag means that distance method returned lower bound of distance and
we should recheck it from heap.See an example.
create table test as (select id, polygon(3+(random()*10)::int,
circle(point(random(), random()), 0.0003 + random()*0.001)) as p from
generate_series(1,1000000) id);
create index test_idx on test using gist (p);We can get results ordered by distance from polygon to point.
postgres=# select id, p <-> point(0.5,0.5) from test order by p <->
point(0.5,0.5) limit 10;
id | ?column?
--------+----------------------
755611 | 0.000405855808916853
807562 | 0.000464123777564343
437778 | 0.000738524708741959
947860 | 0.00076250998760724
389843 | 0.000886362723569568
17586 | 0.000981960100555216
411329 | 0.00145338112316853
894191 | 0.00149399559703506
391907 | 0.0016647896049741
235381 | 0.00167554614889509
(10 rows)It's fast using just index scan.
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.29..1.86 rows=10 width=36) (actual time=0.180..0.230
rows=10 loops=1)
-> Index Scan using test_idx on test (cost=0.29..157672.29
rows=1000000 width=36) (actual time=0.179..0.228 rows=10 loops=1)
Order By: (p <-> '(0.5,0.5)'::point)
Total runtime: 0.305 ms
(4 rows)
Nice! Some thoughts:
1. This patch introduces a new "polygon <-> point" operator. That seems
useful on its own, with or without this patch.
2. I wonder how useful it really is to allow mixing exact and non-exact
return values from the distance function. The distance function included
in the patch always returns recheck=true. I have a feeling that all
other distance function will also always return either true or false.
3. A binary heap would be a better data structure to buffer the
rechecked values. A Red-Black tree allows random insertions and
deletions, but in this case you need to insert arbitrary values but only
remove the minimum item. That's exactly what a binary heap excels at. We
have a nice binary heap implementation in the backend that you can use,
see src/backend/lib/binaryheap.c.
4. (as you mentioned in the other thread: ) It's a modularity violation
that you peek into the heap tuple from gist. I think the proper way to
do this would be to extend the IndexScan executor node to perform the
re-shuffling of tuples that come from the index in wrong order, or
perhaps add a new node type for it.
Of course that's exactly what your partial sort patch does :-). I
haven't looked at that in detail, but I don't think the approach the
partial sort patch takes will work here as is. In the KNN-GiST case, the
index is returning tuples roughly in the right order, but a tuple that
it returns might in reality belong somewhere later in the ordering. In
the partial sort patch, the "input stream" of tuples is divided into
non-overlapping groups, so that the tuples within the group are not
sorted, but the groups are. I think the partial sort case is a special
case of the KNN-GiST case, if you consider the lower bound of each tuple
to be the leading keys that you don't need to sort.
BTW, this capability might also be highly useful for the min/max indexes
as well. A min/max index cannot return an exact ordering of tuples, but
it can also give a lower bound for a group of tuples.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jan 28, 2014 at 5:54 PM, Heikki Linnakangas <hlinnakangas@vmware.com
wrote:
On 01/13/2014 07:17 PM, Alexander Korotkov wrote:
Here goes a desription of this patch same as in original thread.
KNN-GiST provides ability to get ordered results from index, but this
order
is based only on index information. For instance, GiST index contains
bounding rectangles for polygons, and we can't get exact distance to
polygon from index (similar situation is in PostGIS). In attached patch,
GiST distance method can set recheck flag (similar to consistent method).
This flag means that distance method returned lower bound of distance and
we should recheck it from heap.See an example.
create table test as (select id, polygon(3+(random()*10)::int,
circle(point(random(), random()), 0.0003 + random()*0.001)) as p from
generate_series(1,1000000) id);
create index test_idx on test using gist (p);We can get results ordered by distance from polygon to point.
postgres=# select id, p <-> point(0.5,0.5) from test order by p <->
point(0.5,0.5) limit 10;
id | ?column?
--------+----------------------
755611 | 0.000405855808916853
807562 | 0.000464123777564343
437778 | 0.000738524708741959
947860 | 0.00076250998760724
389843 | 0.000886362723569568
17586 | 0.000981960100555216
411329 | 0.00145338112316853
894191 | 0.00149399559703506
391907 | 0.0016647896049741
235381 | 0.00167554614889509
(10 rows)It's fast using just index scan.
QUERY PLAN
------------------------------------------------------------
----------------------------------------------------------------------
Limit (cost=0.29..1.86 rows=10 width=36) (actual time=0.180..0.230
rows=10 loops=1)
-> Index Scan using test_idx on test (cost=0.29..157672.29
rows=1000000 width=36) (actual time=0.179..0.228 rows=10 loops=1)
Order By: (p <-> '(0.5,0.5)'::point)
Total runtime: 0.305 ms
(4 rows)Nice! Some thoughts:
1. This patch introduces a new "polygon <-> point" operator. That seems
useful on its own, with or without this patch.
Yeah, but exact-knn cant come with no one implementation. But it would
better come in a separate patch.
2. I wonder how useful it really is to allow mixing exact and non-exact
return values from the distance function. The distance function included in
the patch always returns recheck=true. I have a feeling that all other
distance function will also always return either true or false.
For geometrical datatypes recheck variations in consistent methods are also
very rare (I can't remember any). But imagine opclass for arrays where keys
have different representation depending on array length. For such opclass
and knn on similarity recheck flag could be useful.
3. A binary heap would be a better data structure to buffer the rechecked
values. A Red-Black tree allows random insertions and deletions, but in
this case you need to insert arbitrary values but only remove the minimum
item. That's exactly what a binary heap excels at. We have a nice binary
heap implementation in the backend that you can use, see
src/backend/lib/binaryheap.c.
Hmm. For me binary heap would be a better data structure for KNN-GiST at
all :-)
4. (as you mentioned in the other thread: ) It's a modularity violation
that you peek into the heap tuple from gist. I think the proper way to do
this would be to extend the IndexScan executor node to perform the
re-shuffling of tuples that come from the index in wrong order, or perhaps
add a new node type for it.Of course that's exactly what your partial sort patch does :-). I haven't
looked at that in detail, but I don't think the approach the partial sort
patch takes will work here as is. In the KNN-GiST case, the index is
returning tuples roughly in the right order, but a tuple that it returns
might in reality belong somewhere later in the ordering. In the partial
sort patch, the "input stream" of tuples is divided into non-overlapping
groups, so that the tuples within the group are not sorted, but the groups
are. I think the partial sort case is a special case of the KNN-GiST case,
if you consider the lower bound of each tuple to be the leading keys that
you don't need to sort.
Yes. But, for instance btree accesses heap for unique checking. Is really
it so crimilal? :-)
This is not only question of a new node or extending existing node. We need
to teach planner/executor access method can return value of some expression
which is lower bound of another expression. AFICS now access method can
return only original indexed datums and TIDs. So, I afraid that enormous
infrastructure changes are required. And I can hardly imagine what they
should look like.
------
With best regards,
Alexander Korotkov.
On 01/28/2014 04:12 PM, Alexander Korotkov wrote:
On Tue, Jan 28, 2014 at 5:54 PM, Heikki Linnakangas <hlinnakangas@vmware.com
wrote:
4. (as you mentioned in the other thread: ) It's a modularity violation
that you peek into the heap tuple from gist. I think the proper way to do
this would be to extend the IndexScan executor node to perform the
re-shuffling of tuples that come from the index in wrong order, or perhaps
add a new node type for it.Of course that's exactly what your partial sort patch does :-). I haven't
looked at that in detail, but I don't think the approach the partial sort
patch takes will work here as is. In the KNN-GiST case, the index is
returning tuples roughly in the right order, but a tuple that it returns
might in reality belong somewhere later in the ordering. In the partial
sort patch, the "input stream" of tuples is divided into non-overlapping
groups, so that the tuples within the group are not sorted, but the groups
are. I think the partial sort case is a special case of the KNN-GiST case,
if you consider the lower bound of each tuple to be the leading keys that
you don't need to sort.Yes. But, for instance btree accesses heap for unique checking. Is really
it so crimilal? :-)
Well, it is generally considered an ugly hack in b-tree too. I'm not
100% opposed to doing such a hack in GiST, but would very much prefer
not to.
This is not only question of a new node or extending existing node. We need
to teach planner/executor access method can return value of some expression
which is lower bound of another expression. AFICS now access method can
return only original indexed datums and TIDs. So, I afraid that enormous
infrastructure changes are required. And I can hardly imagine what they
should look like.
Yeah, I'm not sure either. Maybe a new field in IndexScanDesc, along
with xs_itup. Or as an attribute of xs_itup itself.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jan 28, 2014 at 9:32 PM, Heikki Linnakangas <hlinnakangas@vmware.com
wrote:
On 01/28/2014 04:12 PM, Alexander Korotkov wrote:
On Tue, Jan 28, 2014 at 5:54 PM, Heikki Linnakangas <
hlinnakangas@vmware.comwrote:
4. (as you mentioned in the other thread: ) It's a modularity violation
that you peek into the heap tuple from gist. I think the proper way to do
this would be to extend the IndexScan executor node to perform the
re-shuffling of tuples that come from the index in wrong order, or
perhaps
add a new node type for it.Of course that's exactly what your partial sort patch does :-). I haven't
looked at that in detail, but I don't think the approach the partial sort
patch takes will work here as is. In the KNN-GiST case, the index is
returning tuples roughly in the right order, but a tuple that it returns
might in reality belong somewhere later in the ordering. In the partial
sort patch, the "input stream" of tuples is divided into non-overlapping
groups, so that the tuples within the group are not sorted, but the
groups
are. I think the partial sort case is a special case of the KNN-GiST
case,
if you consider the lower bound of each tuple to be the leading keys that
you don't need to sort.Yes. But, for instance btree accesses heap for unique checking. Is really
it so crimilal? :-)Well, it is generally considered an ugly hack in b-tree too. I'm not 100%
opposed to doing such a hack in GiST, but would very much prefer not to.This is not only question of a new node or extending existing node. We
need
to teach planner/executor access method can return value of some
expression
which is lower bound of another expression. AFICS now access method can
return only original indexed datums and TIDs. So, I afraid that enormous
infrastructure changes are required. And I can hardly imagine what they
should look like.Yeah, I'm not sure either. Maybe a new field in IndexScanDesc, along with
xs_itup. Or as an attribute of xs_itup itself.
This shouldn't look like a hack too. Otherwise I see no point of it: it's
better to have some isolated hack in access method than hack in
planner/executor. So I see following changes to be needed to implement this
right way:
1) Implement new relation between operators: operator1 is lower bound of
operator2.
2) Extend am interface to let it return values of operators.
3) Implement new node for knn-sorting.
However, it requires a lot of changes in PostgreSQL infrastructure and can
appear to be not enough general too (we don't know until we have another
application).
------
With best regards,
Alexander Korotkov.
1. This patch introduces a new "polygon <-> point" operator. That seems
useful on its own, with or without this patch.Yeah, but exact-knn cant come with no one implementation. But it would
better come in a separate patch.
I tried to split them. Separated patches are attached. I changed
the order of the arguments as point <-> polygon, because point was
the first one on all the others. Its commutator was required for
the index, so I added it on the second patch. I also added tests
for the operator. I think it is ready for committer as a separate
patch. We can add it to the open CommitFest.
I have made some cosmetic changes on the patches. I hope they are
useful.
I added support to point <-> circle operator with the same GiST
distance function you added for polygon. I can change it, if it is not
the right way.
2. I wonder how useful it really is to allow mixing exact and non-exact
return values from the distance function. The distance function included in
the patch always returns recheck=true. I have a feeling that all other
distance function will also always return either true or false.For geometrical datatypes recheck variations in consistent methods are also
very rare (I can't remember any). But imagine opclass for arrays where keys
have different representation depending on array length. For such opclass
and knn on similarity recheck flag could be useful.
I also wonder how useful it is. Your example is convincing, but maybe
setting it index-wide will make the decisions on the framework easier.
For example, how hard would it be to decide if further sorting is
required or not on the planner?
4. (as you mentioned in the other thread: ) It's a modularity violation
that you peek into the heap tuple from gist. I think the proper way to do
this would be to extend the IndexScan executor node to perform the
re-shuffling of tuples that come from the index in wrong order, or perhaps
add a new node type for it.Of course that's exactly what your partial sort patch does :-). I haven't
looked at that in detail, but I don't think the approach the partial sort
patch takes will work here as is. In the KNN-GiST case, the index is
returning tuples roughly in the right order, but a tuple that it returns
might in reality belong somewhere later in the ordering. In the partial
sort patch, the "input stream" of tuples is divided into non-overlapping
groups, so that the tuples within the group are not sorted, but the groups
are. I think the partial sort case is a special case of the KNN-GiST case,
if you consider the lower bound of each tuple to be the leading keys that
you don't need to sort.Yes. But, for instance btree accesses heap for unique checking. Is really
it so crimilal? :-)
This is not only question of a new node or extending existing node. We need
to teach planner/executor access method can return value of some expression
which is lower bound of another expression. AFICS now access method can
return only original indexed datums and TIDs. So, I afraid that enormous
infrastructure changes are required. And I can hardly imagine what they
should look like.
Unfortunately, I am not experienced enough to judge your implementation.
As far as I understand the problem is partially sorting rows on
the index scan node. It can lead the planner to choose non-optimal
plans, because of not taking into account the cost of sorting.
On Sun, Aug 3, 2014 at 5:48 PM, Emre Hasegeli <emre@hasegeli.com> wrote:
1. This patch introduces a new "polygon <-> point" operator. That seems
useful on its own, with or without this patch.Yeah, but exact-knn cant come with no one implementation. But it would
better come in a separate patch.I tried to split them. Separated patches are attached. I changed
the order of the arguments as point <-> polygon, because point was
the first one on all the others. Its commutator was required for
the index, so I added it on the second patch. I also added tests
for the operator. I think it is ready for committer as a separate
patch. We can add it to the open CommitFest.I have made some cosmetic changes on the patches. I hope they are
useful.I added support to point <-> circle operator with the same GiST
distance function you added for polygon. I can change it, if it is not
the right way.
Great, thanks!
2. I wonder how useful it really is to allow mixing exact and non-exact
return values from the distance function. The distance function
included in
the patch always returns recheck=true. I have a feeling that all other
distance function will also always return either true or false.For geometrical datatypes recheck variations in consistent methods are
also
very rare (I can't remember any). But imagine opclass for arrays where
keys
have different representation depending on array length. For such opclass
and knn on similarity recheck flag could be useful.I also wonder how useful it is. Your example is convincing, but maybe
setting it index-wide will make the decisions on the framework easier.
For example, how hard would it be to decide if further sorting is
required or not on the planner?
I think that for most use cases just some operators require further sorting
and some of them not. But it could appear one day that some index gives
part of its knn answers exact and part of them inexact. Same happen to
recheck of regular operators. Initially recheck flag was defined in
opclass. But later recheck became runtime flag.
4. (as you mentioned in the other thread: ) It's a modularity violation
that you peek into the heap tuple from gist. I think the proper way to
do
this would be to extend the IndexScan executor node to perform the
re-shuffling of tuples that come from the index in wrong order, orperhaps
add a new node type for it.
Of course that's exactly what your partial sort patch does :-). I
haven't
looked at that in detail, but I don't think the approach the partial
sort
patch takes will work here as is. In the KNN-GiST case, the index is
returning tuples roughly in the right order, but a tuple that itreturns
might in reality belong somewhere later in the ordering. In the partial
sort patch, the "input stream" of tuples is divided intonon-overlapping
groups, so that the tuples within the group are not sorted, but the
groups
are. I think the partial sort case is a special case of the KNN-GiST
case,
if you consider the lower bound of each tuple to be the leading keys
that
you don't need to sort.
Yes. But, for instance btree accesses heap for unique checking. Is really
it so crimilal? :-)
This is not only question of a new node or extending existing node. Weneed
to teach planner/executor access method can return value of some
expression
which is lower bound of another expression. AFICS now access method can
return only original indexed datums and TIDs. So, I afraid that enormous
infrastructure changes are required. And I can hardly imagine what they
should look like.Unfortunately, I am not experienced enough to judge your implementation.
As far as I understand the problem is partially sorting rows on
the index scan node. It can lead the planner to choose non-optimal
plans, because of not taking into account the cost of sorting.
Cost estimation of GiST is a big problem anyway. It doesn't care (and
can't) about amount of recheck for regular operators. In this patch, same
would be for knn recheck. The problem is that touching heap from access
method breaks incapsulation. One idea about this is to do sorting in
another nodes. However, I wonder if it would be an overengineering and
overhead. In attached patch I propose a different approach: put code
touching heap into separate index_get_heap_values function. Also new
version of patch includes regression tests and some cleanup.
------
With best regards,
Alexander Korotkov.
Attachments:
knn-gist-recheck-3.patchapplication/octet-stream; name=knn-gist-recheck-3.patchDownload+472-132
I added the point to polygon distance operator patch to the open
CommitFest as ready for committer and added myself as reviewer to
both of the patches.
I think that for most use cases just some operators require further sorting
and some of them not. But it could appear one day that some index gives
part of its knn answers exact and part of them inexact. Same happen to
recheck of regular operators. Initially recheck flag was defined in
opclass. But later recheck became runtime flag.
I cannot think of an use case, but it makes sense to add the flag to
the distance function just like the consistent function if we will go
with this implementation.
Cost estimation of GiST is a big problem anyway. It doesn't care (and
can't) about amount of recheck for regular operators. In this patch, same
would be for knn recheck. The problem is that touching heap from access
method breaks incapsulation. One idea about this is to do sorting in
another nodes. However, I wonder if it would be an overengineering and
overhead. In attached patch I propose a different approach: put code
touching heap into separate index_get_heap_values function. Also new
version of patch includes regression tests and some cleanup.
While looking it at I found a bug. It returns the second column
in wrong order when both of the distance functions return recheck = true.
Test script attached to run on the regression database. I tried to
fix but could not. searchTreeItemDistanceRecheck function is not
very easy to follow. I think it deserves more comments.
Attachments:
On Sun, Sep 14, 2014 at 10:09 PM, Emre Hasegeli <emre@hasegeli.com> wrote:
I added the point to polygon distance operator patch to the open
CommitFest as ready for committer and added myself as reviewer to
both of the patches.
Thanks.
Cost estimation of GiST is a big problem anyway. It doesn't care (and
can't) about amount of recheck for regular operators. In this patch, same
would be for knn recheck. The problem is that touching heap from access
method breaks incapsulation. One idea about this is to do sorting in
another nodes. However, I wonder if it would be an overengineering and
overhead. In attached patch I propose a different approach: put code
touching heap into separate index_get_heap_values function. Also new
version of patch includes regression tests and some cleanup.While looking it at I found a bug. It returns the second column
in wrong order when both of the distance functions return recheck = true.
Test script attached to run on the regression database. I tried to
fix but could not. searchTreeItemDistanceRecheck function is not
very easy to follow. I think it deserves more comments.
Fixed, thanks. It was logical error in comparison function implementation.
------
With best regards,
Alexander Korotkov.
Attachments:
knn-gist-recheck-4.patchapplication/octet-stream; name=knn-gist-recheck-4.patchDownload+476-136
While looking it at I found a bug. It returns the second column
in wrong order when both of the distance functions return recheck = true.
Test script attached to run on the regression database. I tried to
fix but could not. searchTreeItemDistanceRecheck function is not
very easy to follow. I think it deserves more comments.Fixed, thanks. It was logical error in comparison function implementation.
I managed to break it again by ordering rows only by the second column
of the index. Test script attached.
Attachments:
I managed to break it again by ordering rows only by the second column
of the index. Test script attached.
I was confused. It is undefined behavior. Sorry for the noise.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Sep 17, 2014 at 12:30 PM, Emre Hasegeli <emre@hasegeli.com> wrote:
I managed to break it again by ordering rows only by the second column
of the index. Test script attached.I was confused. It is undefined behavior. Sorry for the noise.
No problem. Thanks a lot for testing.
------
With best regards,
Alexander Korotkov.
Fixed, thanks.
Here are my questions and comments about the code.
doc/src/sgml/gist.sgml:812:
be rechecked from heap tuple before tuple is returned. If
<literal>recheck</> flag isn't set then it's true by default for
compatibility reasons. The <literal>recheck</> flag can be used only
Recheck flag is set to false on gistget.c so I think it should say
"false by default". On the other hand, it is true by default on
the consistent function. It is written as "the safest assumption"
on the code comments. I don't know why the safest is chosen over
the backwards compatible for the consistent function.
src/backend/access/gist/gistget.c:505:
/* Recheck distance from heap tuple if needed */
if (GISTSearchItemIsHeap(*item) &&
searchTreeItemNeedDistanceRecheck(scan, so->curTreeItem))
{
searchTreeItemDistanceRecheck(scan, so->curTreeItem, item);
continue;
}
Why so->curTreeItem is passed to these functions? They can use
scan->opaque->curTreeItem.
src/backend/access/gist/gistscan.c:49:
/*
* When all distance values are the same, items without recheck
* can be immediately returned. So they are placed first.
*/
if (recheckCmp == 0 && distance_a.recheck != distance_b.recheck)
recheckCmp = distance_a.recheck ? 1 : -1;
I don't understand why items without recheck can be immediately
returned. Do you think it will work correctly when there is
an operator class which will return recheck true and false for
the items under the same page?
src/backend/access/index/indexam.c:258:
/* Prepare data structures for getting original indexed values from heap */
scan->indexInfo = BuildIndexInfo(scan->indexRelation);
scan->estate = CreateExecutorState();
scan->slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation));
With the changes in indexam.c, heap access become legal for all index
access methods. I think it is better than the previous version but
I am leaving the judgement to someone experienced. I will try to
summarize the pros and cons of sorting the rows in the GiST access
method, as far as I understand.
Pros:
* It does not require another queue. It should be effective to sort
the rows inside the queue the GiST access method already has.
* It does not complicate index access method infrastructure.
Cons:
* It could be done without additional heap access.
* Other access methods could make use of the sorting infrastructure
one day.
* It could be more transparent to the users. Sorting information
could be shown on the explain output.
* A more suitable data structure like binary heap could be used
for the queue to sort the rows.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 25, 2014 at 9:00 PM, Emre Hasegeli <emre@hasegeli.com> wrote:
Fixed, thanks.
Here are my questions and comments about the code.
doc/src/sgml/gist.sgml:812:
be rechecked from heap tuple before tuple is returned. If
<literal>recheck</> flag isn't set then it's true by default for
compatibility reasons. The <literal>recheck</> flag can be usedonly
Recheck flag is set to false on gistget.c so I think it should say
"false by default". On the other hand, it is true by default on
the consistent function. It is written as "the safest assumption"
on the code comments. I don't know why the safest is chosen over
the backwards compatible for the consistent function.
Agree. It should be clarified in docs.
src/backend/access/gist/gistget.c:505:
/* Recheck distance from heap tuple if needed */
if (GISTSearchItemIsHeap(*item) &&
searchTreeItemNeedDistanceRecheck(scan,so->curTreeItem))
{
searchTreeItemDistanceRecheck(scan,so->curTreeItem, item);
continue;
}Why so->curTreeItem is passed to these functions? They can use
scan->opaque->curTreeItem.
I didn't get the difference. Few lines before:
GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
src/backend/access/gist/gistscan.c:49:
/*
* When all distance values are the same, items withoutrecheck
* can be immediately returned. So they are placed first.
*/
if (recheckCmp == 0 && distance_a.recheck !=distance_b.recheck)
recheckCmp = distance_a.recheck ? 1 : -1;
I don't understand why items without recheck can be immediately
returned. Do you think it will work correctly when there is
an operator class which will return recheck true and false for
the items under the same page?
Yes, I believe so. Item with recheck can't decrease it's distance, it can
only increase it. In the corner case item can have same distance after
recheck as it was before. Then anyway items which distances are the same
can be returned in any order.
src/backend/access/index/indexam.c:258:
/* Prepare data structures for getting original indexed values
from heap */
scan->indexInfo = BuildIndexInfo(scan->indexRelation);
scan->estate = CreateExecutorState();
scan->slot =MakeSingleTupleTableSlot(RelationGetDescr(heapRelation));
With the changes in indexam.c, heap access become legal for all index
access methods. I think it is better than the previous version but
I am leaving the judgement to someone experienced. I will try to
summarize the pros and cons of sorting the rows in the GiST access
method, as far as I understand.Pros:
* It does not require another queue. It should be effective to sort
the rows inside the queue the GiST access method already has.
* It does not complicate index access method infrastructure.Cons:
* It could be done without additional heap access.
* Other access methods could make use of the sorting infrastructure
one day.
* It could be more transparent to the users. Sorting information
could be shown on the explain output.
It would be also nice to show some information about KNN itself.
* A more suitable data structure like binary heap could be used
for the queue to sort the rows.
Binary heap seems to be better data structure for whole KNN-GiST. But it's
a subject for a separate patch: replace RB-tree to heap in KNN-GiST. It's
not related to recheck stuff.
------
With best regards,
Alexander Korotkov.
On Sun, Sep 14, 2014 at 11:34:26PM +0400, Alexander Korotkov wrote:
Cost estimation of GiST is a big problem anyway. It doesn't care (and
can't) about amount of recheck for regular operators. In this patch, same
would be for knn recheck. The problem is that touching heap from access
This is very important work. While our existing KNN-GiST index code
works fine for scalar values and point-to-point distance ordering, it
doesn't work well for 2-dimensional objects because they are only
indexed by their bounding boxes (a rectangle around the object). The
indexed bounding box can't produce accurate distances to other objects.
As an example, see this PostGIS blog post showing how to use LIMIT in a
CTE to filter results and then compute the closest object (search for
"LIMIT 50"):
http://shisaa.jp/postset/postgis-postgresqls-spatial-partner-part-3.html
This patch fixes our code for distances from a point to indexed 2-D
objects.
Does this also fix the identical PostGIS problem or is there something
PostGIS needs to do?
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ Everyone has their own god. +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 26, 2014 at 5:18 AM, Bruce Momjian <bruce@momjian.us> wrote:
On Sun, Sep 14, 2014 at 11:34:26PM +0400, Alexander Korotkov wrote:
Cost estimation of GiST is a big problem anyway. It doesn't care
(and
can't) about amount of recheck for regular operators. In this
patch, same
would be for knn recheck. The problem is that touching heap from
access
This is very important work. While our existing KNN-GiST index code
works fine for scalar values and point-to-point distance ordering, it
doesn't work well for 2-dimensional objects because they are only
indexed by their bounding boxes (a rectangle around the object). The
indexed bounding box can't produce accurate distances to other objects.As an example, see this PostGIS blog post showing how to use LIMIT in a
CTE to filter results and then compute the closest object (search for
"LIMIT 50"):http://shisaa.jp/postset/postgis-postgresqls-spatial-partner-part-3.html
This patch fixes our code for distances from a point to indexed 2-D
objects.Does this also fix the identical PostGIS problem or is there something
PostGIS needs to do?
This patch provides general infrastructure for recheck in KNN-GiST. PostGIS
need corresponding change in its GiST opclass. Since PostGIS already define
<-> and <#> operators as distance to bounding box border and bounding box
center, it can't change their behaviour.
it has to support new operator "exact distance" in opclass.
------
With best regards,
Alexander Korotkov.
On Fri, Sep 26, 2014 at 10:49:42AM +0400, Alexander Korotkov wrote:
Does this also fix the identical PostGIS problem or is there something
PostGIS needs to do?This patch provides general infrastructure for recheck in KNN-GiST. PostGIS
need corresponding change in its GiST opclass. Since PostGIS already define <->
and <#> operators as distance to bounding box border and bounding box center,
it can't change their behaviour.
it has to support new operator "exact distance" in opclass.�
Ah, OK, so they just need something that can be used for the recheck. I
think they currently use ST_Distance() for that. Does it have to be an
operator? If they defined an operator for ST_Distance(), would
ST_Distance() work too for KNN-GiST?
In summary, you still create a normal GiST index on the column:
http://shisaa.jp/postset/postgis-postgresqls-spatial-partner-part-3.html
CREATE INDEX planet_osm_line_ref_index ON planet_osm_line(ref);
which indexes by the bounding box. The new code will allow ordered
index hits to be filtered by something like ST_Distance(), rather than
having to a LIMIT 50 in a CTE, then call ST_Distance(), like this:
EXPLAIN ANALYZE WITH distance AS (
SELECT way AS road, ref AS route
FROM planet_osm_line
WHERE highway = 'secondary'
ORDER BY ST_GeomFromText('POLYGON((14239931.42 3054117.72,14239990.49 3054224.25,14240230.15 3054091.38,14240171.08 3053984.84,14239931.42 3054117.72))', 900913) <#> way
LIMIT 50
)
SELECT ST_Distance(ST_GeomFromText('POLYGON((14239931.42 3054117.72,14239990.49 3054224.25,14240230.15 3054091.38,14240171.08 3053984.84,14239931.42 3054117.72))', 900913), road) AS true_distance, route
FROM distance
ORDER BY true_distance
LIMIT 1;
Notice the CTE uses <#> (bounding box center), and then the outer query
uses ST_Distance and LIMIT 1 to find the closest item.
Excellent!
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ Everyone has their own god. +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Sep 29, 2014 at 6:16 AM, Bruce Momjian <bruce@momjian.us> wrote:
On Fri, Sep 26, 2014 at 10:49:42AM +0400, Alexander Korotkov wrote:
Does this also fix the identical PostGIS problem or is there
something
PostGIS needs to do?
This patch provides general infrastructure for recheck in KNN-GiST.
PostGIS
need corresponding change in its GiST opclass. Since PostGIS already
define <->
and <#> operators as distance to bounding box border and bounding box
center,
it can't change their behaviour.
it has to support new operator "exact distance" in opclass.Ah, OK, so they just need something that can be used for the recheck. I
think they currently use ST_Distance() for that. Does it have to be an
operator? If they defined an operator for ST_Distance(), would
ST_Distance() work too for KNN-GiST?
Currently, ST_Distance is a function, but it's no problem to make it an
operator with KNN-GiST support.
In summary, you still create a normal GiST index on the column:
http://shisaa.jp/postset/postgis-postgresqls-spatial-partner-part-3.html
CREATE INDEX planet_osm_line_ref_index ON planet_osm_line(ref);
which indexes by the bounding box. The new code will allow ordered
index hits to be filtered by something like ST_Distance(), rather than
having to a LIMIT 50 in a CTE, then call ST_Distance(), like this:EXPLAIN ANALYZE WITH distance AS (
SELECT way AS road, ref AS route
FROM planet_osm_line
WHERE highway = 'secondary'
ORDER BY ST_GeomFromText('POLYGON((14239931.42
3054117.72,14239990.49 3054224.25,14240230.15 3054091.38,14240171.08
3053984.84,14239931.42 3054117.72))', 900913) <#> way
LIMIT 50
)
SELECT ST_Distance(ST_GeomFromText('POLYGON((14239931.42
3054117.72,14239990.49 3054224.25,14240230.15 3054091.38,14240171.08
3053984.84,14239931.42 3054117.72))', 900913), road) AS true_distance, route
FROM distance
ORDER BY true_distance
LIMIT 1;
Yeah. It this query 50 is pure empirical value. It could be both too low or
too high. Too low value can cause wrong query answers. Too high value can
cause lower performance. With patch simple KNN query will work like this
query with always right value in LIMIT clause.
Notice the CTE uses <#> (bounding box center), and then the outer query
uses ST_Distance and LIMIT 1 to find the closest item.Excellent!
Thanks. The main question now is design of this patch. Currently, it does
all the work inside access method. We already have some discussion of pro
and cons of this method. I would like to clarify alternatives now. I can
see following way:
1. Implement new executor node which performs sorting by priority queue.
Let's call it "Priority queue". I think it should be separate node from
"Sort" node. Despite "Priority queue" and "Sort" are essentially similar
from user view, they would be completely different in implementation.
2. Implement some interface to transfer distance values from access
method to "Priority queue" node.
3. Somehow tell the planner that it could use "Priority queue" in
corresponding cases. I see two ways of doing this:
- Add flag to operator in opclass indicating that index can only
order by lower bound of "col op value", not by "col op value" itself.
- Define new relation between operators. Value of one operator could
be lower bound for value of another operator. So, planner can
put "Priority
queue" node when lower bound ordering is possible from index. Also "ALTER
OPERATOR" command would be reasonable, so extensions could upgrade.
Besides overhead, this way makes significant infrastructural changes. So,
it may be over-engineering. However, it's probably more clean and beautiful
solution.
I would like to get some feedback from people familiar with KNN-GiST like
Heikki or Tom. What do you think about this? Any other ideas?
------
With best regards,
Alexander Korotkov.
Thanks. The main question now is design of this patch. Currently, it does
all the work inside access method. We already have some discussion of pro
and cons of this method. I would like to clarify alternatives now. I can
see following way:1. Implement new executor node which performs sorting by priority queue.
Let's call it "Priority queue". I think it should be separate node from
"Sort" node. Despite "Priority queue" and "Sort" are essentially similar
from user view, they would be completely different in implementation.
2. Implement some interface to transfer distance values from access
method to "Priority queue" node.
If we assume that all of them need recheck, maybe it can be done
without passing distance values.
3. Somehow tell the planner that it could use "Priority queue" in
corresponding cases. I see two ways of doing this:
- Add flag to operator in opclass indicating that index can only
order by lower bound of "col op value", not by "col op value" itself.
- Define new relation between operators. Value of one operator could
be lower bound for value of another operator. So, planner can
put "Priority
queue" node when lower bound ordering is possible from index. Also "ALTER
OPERATOR" command would be reasonable, so extensions could upgrade.
I think, it would be better to make it a property of the operator
class. We can add a column to pg_amop or define another value for
amoppurpose on pg_amop. Syntax can be something like this:
CREATE OPERATOR CLASS circle_ops DEFAULT
FOR TYPE circle USING gist AS
OPERATOR 15 <->(circle, point) FOR ORDER BY pg_catalog.float_ops LOWER BOUND;
While looking at it, I realize that current version of the patch does
not use the sort operator family defined with the operator class. It
assumes that the distance function will return values compatible with
the operator. Operator class definition makes me think that there is
not such an assumption.
Besides overhead, this way makes significant infrastructural changes. So,
it may be over-engineering. However, it's probably more clean and beautiful
solution.
I would like to get some feedback from people familiar with KNN-GiST like
Heikki or Tom. What do you think about this? Any other ideas?
I would be happy to test and review the changes. I think it is nice
to solve the problem in a generalized way improving the access method
infrastructure. Definitely, we should have a consensus on the design
before working on the infrastructure changes.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Jan 13, 2014 at 9:17 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
This patch was split from thread:
/messages/by-id/CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=goN-gfA@mail.gmail.comI've split it to separate thead, because it's related to partial sort only
conceptually not technically. Also I renamed it to "knn-gist-recheck" from
"partial-knn" as more appropriate name. In the attached version docs are
updated. Possible weak point of this patch design is that it fetches heap
tuple from GiST scan. However, I didn't receive any notes about its design,
so, I'm going to put it to commitfest.
The partial sort thing is not in the current 2014-10 commitfest
(although this patch is). Is that intentional?
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers