pgsql: Allow GiST distance function to return merely a lower-bound.
Allow GiST distance function to return merely a lower-bound.
The distance function can now set *recheck = false, like index quals. The
executor will then re-check the ORDER BY expressions, and use a queue to
reorder the results on the fly.
This makes it possible to do kNN-searches on polygons and circles, which
don't store the exact value in the index, but just a bounding box.
Alexander Korotkov and me
Branch
------
master
Details
-------
http://git.postgresql.org/pg/commitdiff/35fcb1b3d038a501f3f4c87c05630095abaaadab
Modified Files
--------------
doc/src/sgml/gist.sgml | 35 ++-
src/backend/access/gist/gistget.c | 30 ++-
src/backend/access/gist/gistproc.c | 37 +++
src/backend/access/gist/gistscan.c | 5 +
src/backend/executor/nodeIndexscan.c | 379 +++++++++++++++++++++++++++-
src/backend/optimizer/plan/createplan.c | 73 ++++--
src/backend/utils/adt/geo_ops.c | 27 ++
src/include/access/genam.h | 3 +
src/include/access/relscan.h | 9 +
src/include/catalog/catversion.h | 2 +-
src/include/catalog/pg_amop.h | 2 +
src/include/catalog/pg_amproc.h | 2 +
src/include/catalog/pg_operator.h | 8 +-
src/include/catalog/pg_proc.h | 4 +
src/include/nodes/execnodes.h | 20 ++
src/include/nodes/plannodes.h | 10 +-
src/include/utils/geo_decls.h | 3 +
src/test/regress/expected/create_index.out | 78 ++++++
src/test/regress/sql/create_index.sql | 12 +
19 files changed, 699 insertions(+), 40 deletions(-)
--
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers
On Fri, May 15, 2015 at 8:27 PM, Heikki Linnakangas
<heikki.linnakangas@iki.fi> wrote:
Allow GiST distance function to return merely a lower-bound.
The distance function can now set *recheck = false, like index quals. The
executor will then re-check the ORDER BY expressions, and use a queue to
reorder the results on the fly.This makes it possible to do kNN-searches on polygons and circles, which
don't store the exact value in the index, but just a bounding box.Alexander Korotkov and me
Branch
------
masterDetails
-------
http://git.postgresql.org/pg/commitdiff/35fcb1b3d038a501f3f4c87c05630095abaaadabModified Files
--------------
doc/src/sgml/gist.sgml | 35 ++-
src/backend/access/gist/gistget.c | 30 ++-
src/backend/access/gist/gistproc.c | 37 +++
src/backend/access/gist/gistscan.c | 5 +
src/backend/executor/nodeIndexscan.c | 379 +++++++++++++++++++++++++++-
src/backend/optimizer/plan/createplan.c | 73 ++++--
src/backend/utils/adt/geo_ops.c | 27 ++
src/include/access/genam.h | 3 +
src/include/access/relscan.h | 9 +
src/include/catalog/catversion.h | 2 +-
src/include/catalog/pg_amop.h | 2 +
src/include/catalog/pg_amproc.h | 2 +
src/include/catalog/pg_operator.h | 8 +-
src/include/catalog/pg_proc.h | 4 +
src/include/nodes/execnodes.h | 20 ++
src/include/nodes/plannodes.h | 10 +-
src/include/utils/geo_decls.h | 3 +
src/test/regress/expected/create_index.out | 78 ++++++
src/test/regress/sql/create_index.sql | 12 +
19 files changed, 699 insertions(+), 40 deletions(-)
Seems this patch causes the regression test of pg_trgm fail.
The regression diff that I got is:
*** /home/postgres/pgsql/head/contrib/pg_trgm/expected/pg_trgm.out
2013-07-23 16:46:22.212488785 +0900
--- /home/postgres/pgsql/head/contrib/pg_trgm/results/pg_trgm.out
2015-05-15 20:59:16.574926732 +0900
***************
*** 2332,2343 ****
(3 rows)
select t <-> 'q0987wertyu0988', t from test_trgm order by t <->
'q0987wertyu0988' limit 2;
! ?column? | t
! ----------+-------------
! 0.411765 | qwertyu0988
! 0.5 | qwertyu0987
! (2 rows)
!
drop index trgm_idx;
create index trgm_idx on test_trgm using gin (t gin_trgm_ops);
set enable_seqscan=off;
--- 2332,2338 ----
(3 rows)
select t <-> 'q0987wertyu0988', t from test_trgm order by t <->
'q0987wertyu0988' limit 2;
! ERROR: index returned tuples in wrong order
drop index trgm_idx;
create index trgm_idx on test_trgm using gin (t gin_trgm_ops);
set enable_seqscan=off;
Regards,
--
Fujii Masao
--
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers
On 05/15/2015 03:05 PM, Fujii Masao wrote:
Seems this patch causes the regression test of pg_trgm fail.
The regression diff that I got is:*** /home/postgres/pgsql/head/contrib/pg_trgm/expected/pg_trgm.out 2013-07-23 16:46:22.212488785 +0900 --- /home/postgres/pgsql/head/contrib/pg_trgm/results/pg_trgm.out 2015-05-15 20:59:16.574926732 +0900 *************** *** 2332,2343 **** (3 rows)select t <-> 'q0987wertyu0988', t from test_trgm order by t <-> 'q0987wertyu0988' limit 2; ! ?column? | t ! ----------+------------- ! 0.411765 | qwertyu0988 ! 0.5 | qwertyu0987 ! (2 rows) ! drop index trgm_idx; create index trgm_idx on test_trgm using gin (t gin_trgm_ops); set enable_seqscan=off; --- 2332,2338 ---- (3 rows)select t <-> 'q0987wertyu0988', t from test_trgm order by t <->
'q0987wertyu0988' limit 2;
! ERROR: index returned tuples in wrong order
drop index trgm_idx;
create index trgm_idx on test_trgm using gin (t gin_trgm_ops);
set enable_seqscan=off;
Hmm, OK. pg_trgm works for me, but I'll take a look. (rover_firefly also
went red, due to rounding differences in the regression test)
- Heikki
--
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers
On 05/15/2015 03:17 PM, Heikki Linnakangas wrote:
On 05/15/2015 03:05 PM, Fujii Masao wrote:
Seems this patch causes the regression test of pg_trgm fail.
The regression diff that I got is:*** /home/postgres/pgsql/head/contrib/pg_trgm/expected/pg_trgm.out 2013-07-23 16:46:22.212488785 +0900 --- /home/postgres/pgsql/head/contrib/pg_trgm/results/pg_trgm.out 2015-05-15 20:59:16.574926732 +0900 *************** *** 2332,2343 **** (3 rows)select t <-> 'q0987wertyu0988', t from test_trgm order by t <-> 'q0987wertyu0988' limit 2; ! ?column? | t ! ----------+------------- ! 0.411765 | qwertyu0988 ! 0.5 | qwertyu0987 ! (2 rows) ! drop index trgm_idx; create index trgm_idx on test_trgm using gin (t gin_trgm_ops); set enable_seqscan=off; --- 2332,2338 ---- (3 rows)select t <-> 'q0987wertyu0988', t from test_trgm order by t <->
'q0987wertyu0988' limit 2;
! ERROR: index returned tuples in wrong order
drop index trgm_idx;
create index trgm_idx on test_trgm using gin (t gin_trgm_ops);
set enable_seqscan=off;Hmm, OK. pg_trgm works for me, but I'll take a look. (rover_firefly also
went red, due to rounding differences in the regression test)
There are two issues here:
1. I forgot to initialize the "recheck" variable before calling the
distance function. pg_trgm's distance function is never lossy, and it
doesn't set recheck to anything. If 'recheck' happens to be true,
because it's uninitialized, the executor node will try to reorder the
tuples.
2. There is confusion on the datatype of the distance. pg_trgm's <->
operator returns float4, but the GIST code and pg_trgm's GIST distance
function always returns a float8. Gist thus returns the distance as a
float8, but the executor node re-calculates it as a float4, and then
tries to compare the two using float8 < operator.
The immediate problem goes away if we fix 1. and initialize the
variable, as the executor won't try to re-calculate or compare the ORDER
BY expressions if the index never returns a lossy tuple, but the
confusion on the datatypes is still real.
It's not very nice that gist has a hardcoded assumption that the
distance is always measured as a float8. It would be better to support
whatever datatype the distance function returns, and use the type's
comparison functions.
But as a quick fix, I think we should add a special case for float4.
Gist should check if the datatype of the original ordering operator is
float4, and convert the float8 used internally to float4 before
returning it. If the datatype is anything other than float4 or float8,
throw an error.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Heikki Linnakangas <heikki.linnakangas@iki.fi> writes:
Allow GiST distance function to return merely a lower-bound.
I wondered how come this patch did not touch nodeIndexonlyscan.c.
After some investigation, it seems the only reason that this patch even
appears to work is that none of the built-in or contrib opclasses support
both lossy ordering operators and GIST fetch functions. Otherwise we
would do an index-only scan and the executor would simply ignore the
recheck flag.
I doubt we can ship this in this state; even if the core code doesn't
exercise the problematic combination, surely third-party opclasses
will want to?
I thought about hacking the planner to not select index-only scans,
but there's no way for it to know whether the opclass might return
recheck = true at runtime. I think the only real fix is to actually
propagate all the changes in nodeIndexscan.c into nodeIndexonlyscan.c.
Testing it would be problematic without a suitable opclass, though :-(
A short-term hack might be to throw a "not implemented" error in
nodeIndexonlyscan.c if it sees the distance-recheck flag set.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
I wrote:
Heikki Linnakangas <heikki.linnakangas@iki.fi> writes:
Allow GiST distance function to return merely a lower-bound.
I wondered how come this patch did not touch nodeIndexonlyscan.c.
After some investigation, it seems the only reason that this patch even
appears to work is that none of the built-in or contrib opclasses support
both lossy ordering operators and GIST fetch functions. Otherwise we
would do an index-only scan and the executor would simply ignore the
recheck flag.
I doubt we can ship this in this state; even if the core code doesn't
exercise the problematic combination, surely third-party opclasses
will want to?
On further thought, maybe it's OK: we'd only be using an index-only scan
if the index can return the exact value of the indexed column (as the
circle and polygon GIST opclasses cannot). If it can do that, then it
should be able to compute exact distances too --- worst case, it could
reconstitute the column value and apply the distance operator itself.
So maybe we don't need to add a pile of code for that.
We do need a not-implemented error check though, so I added one.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers