ToDo: KNN Search should to support DISTINCT clasuse?
Hello
I should to search distinct values based on similarity
postgres=# explain select nazobce, nazobce <-> 'Benešov' from obce
order by nazobce <-> 'Benešov' limit 10
;
QUERY PLAN
-------------------------------------------------------------------------------------------
Limit (cost=0.00..0.86 rows=10 width=10)
-> Index Scan using obce_nazobce_idx on obce (cost=0.00..1433.14
rows=16644 width=10)
Order By: (nazobce <-> 'Benešov'::text)
(3 rows)
Time: 0.576 ms
but using DISTINCT breaks KNN searching optimization
postgres=# explain select distinct nazobce, nazobce <-> 'Benešov' from
obce order by nazobce <-> 'Benešov' limit 10
;
QUERY PLAN
-----------------------------------------------------------------------------
Limit (cost=600.45..600.47 rows=10 width=10)
-> Sort (cost=600.45..613.80 rows=5341 width=10)
Sort Key: ((nazobce <-> 'Benešov'::text))
-> HashAggregate (cost=418.27..485.03 rows=5341 width=10)
-> Seq Scan on obce (cost=0.00..335.05 rows=16644 width=10)
(5 rows)
Regards
Pavel Stehule
Pavel Stehule <pavel.stehule@gmail.com> writes:
but using DISTINCT breaks KNN searching optimization
postgres=# explain select distinct nazobce, nazobce <-> 'Benešov' from
obce order by nazobce <-> 'Benešov' limit 10
Don't hold your breath. There are two ways the system could implement
the DISTINCT clause: either sort and uniq, or hashaggregate.
hashaggregate will destroy any input ordering, so there's no value in
using the index as input. sort and uniq requires the input to be sorted
by *all* the columns being distinct'ed, not just one, so again this
index isn't useful. You could get a plan using the index if you only
wanted the <-> output column, eg
contrib_regression=# explain select distinct t <-> 'foo' from test_trgm order by t <-> 'foo' limit 10;
QUERY PLAN
-------------------------------------------------------------------------------------
Limit (cost=0.00..0.87 rows=10 width=12)
-> Unique (cost=0.00..86.75 rows=1000 width=12)
-> Index Scan using ti on test_trgm (cost=0.00..84.25 rows=1000 width=12)
Order By: (t <-> 'foo'::text)
(4 rows)
Perhaps it would be close enough to what you want to use DISTINCT ON:
contrib_regression=# explain select distinct on( t <-> 'foo') *,t <-> 'foo' from test_trgm order by t <-> 'foo' limit 10;
QUERY PLAN
-------------------------------------------------------------------------------------
Limit (cost=0.00..0.87 rows=10 width=12)
-> Unique (cost=0.00..86.75 rows=1000 width=12)
-> Index Scan using ti on test_trgm (cost=0.00..84.25 rows=1000 width=12)
Order By: (t <-> 'foo'::text)
(4 rows)
regards, tom lane
2012/10/22 Tom Lane <tgl@sss.pgh.pa.us>:
Pavel Stehule <pavel.stehule@gmail.com> writes:
but using DISTINCT breaks KNN searching optimization
postgres=# explain select distinct nazobce, nazobce <-> 'Benešov' from
obce order by nazobce <-> 'Benešov' limit 10Don't hold your breath. There are two ways the system could implement
the DISTINCT clause: either sort and uniq, or hashaggregate.
hashaggregate will destroy any input ordering, so there's no value in
using the index as input. sort and uniq requires the input to be sorted
by *all* the columns being distinct'ed, not just one, so again this
index isn't useful. You could get a plan using the index if you only
wanted the <-> output column, egcontrib_regression=# explain select distinct t <-> 'foo' from test_trgm order by t <-> 'foo' limit 10;
QUERY PLAN
-------------------------------------------------------------------------------------
Limit (cost=0.00..0.87 rows=10 width=12)
-> Unique (cost=0.00..86.75 rows=1000 width=12)
-> Index Scan using ti on test_trgm (cost=0.00..84.25 rows=1000 width=12)
Order By: (t <-> 'foo'::text)
(4 rows)Perhaps it would be close enough to what you want to use DISTINCT ON:
contrib_regression=# explain select distinct on( t <-> 'foo') *,t <-> 'foo' from test_trgm order by t <-> 'foo' limit 10;
QUERY PLAN
-------------------------------------------------------------------------------------
Limit (cost=0.00..0.87 rows=10 width=12)
-> Unique (cost=0.00..86.75 rows=1000 width=12)
-> Index Scan using ti on test_trgm (cost=0.00..84.25 rows=1000 width=12)
Order By: (t <-> 'foo'::text)
(4 rows)regards, tom lane
good tip - it's working
thank you
Regards
Pavel
On Mon, Oct 22, 2012 at 11:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Pavel Stehule <pavel.stehule@gmail.com> writes:
but using DISTINCT breaks KNN searching optimization
postgres=# explain select distinct nazobce, nazobce <-> 'Benešov' from
obce order by nazobce <-> 'Benešov' limit 10Don't hold your breath. There are two ways the system could implement
the DISTINCT clause: either sort and uniq, or hashaggregate.
hashaggregate will destroy any input ordering, so there's no value in
using the index as input.
Isn't that an implementation limitation though, rather than a
fundamental limitation?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
On Mon, Oct 22, 2012 at 11:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Don't hold your breath. There are two ways the system could implement
the DISTINCT clause: either sort and uniq, or hashaggregate.
hashaggregate will destroy any input ordering, so there's no value in
using the index as input.
Isn't that an implementation limitation though, rather than a
fundamental limitation?
Perhaps, but it's not a simple one to surmount, and I'm dubious about
putting the amount of work that'd be required into such a corner case.
regards, tom lane
2012/10/22 Tom Lane <tgl@sss.pgh.pa.us>:
Robert Haas <robertmhaas@gmail.com> writes:
On Mon, Oct 22, 2012 at 11:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Don't hold your breath. There are two ways the system could implement
the DISTINCT clause: either sort and uniq, or hashaggregate.
hashaggregate will destroy any input ordering, so there's no value in
using the index as input.Isn't that an implementation limitation though, rather than a
fundamental limitation?Perhaps, but it's not a simple one to surmount, and I'm dubious about
putting the amount of work that'd be required into such a corner case.
I don't think so this use case is too special - but workaround working well
Regards
Pavel
Show quoted text
regards, tom lane
On Mon, Oct 22, 2012 at 4:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Don't hold your breath. There are two ways the system could implement
the DISTINCT clause: either sort and uniq, or hashaggregate.
hashaggregate will destroy any input ordering, so there's no value in
using the index as input. sort and uniq requires the input to be sorted
by *all* the columns being distinct'ed, not just one, so again this
index isn't useful.
We already have some bits that understand functional dependencies for
distinct/group by don't we? Do we detect that col <-> 'foo' is
functionally dependent on col? If so is it possible to construct a
multicolumn index that can produce an ordering like [col <-> 'foo',
col] which could be used to get distinct values of col in the knn
order (since the first column is functionally dependent on the second
it can be ignored for grouping purposes).
Not that we can do this now but I wonder whether a lot of the pieces
are already there and just need to be hooked up together.
--
greg
On Mon, Oct 22, 2012 at 7:09 PM, Greg Stark <stark@mit.edu> wrote:
On Mon, Oct 22, 2012 at 4:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Don't hold your breath. There are two ways the system could implement
the DISTINCT clause: either sort and uniq, or hashaggregate.
hashaggregate will destroy any input ordering, so there's no value in
using the index as input. sort and uniq requires the input to be sorted
by *all* the columns being distinct'ed, not just one, so again this
index isn't useful.We already have some bits that understand functional dependencies for
distinct/group by don't we? Do we detect that col <-> 'foo' is
functionally dependent on col? If so is it possible to construct a
multicolumn index that can produce an ordering like [col <-> 'foo',
col] which could be used to get distinct values of col in the knn
order (since the first column is functionally dependent on the second
it can be ignored for grouping purposes).Not that we can do this now but I wonder whether a lot of the pieces
are already there and just need to be hooked up together.
I think the real issue is that a hash aggregate doesn't emit any rows
until the entire input is read, so it doesn't play nice with LIMIT.
In general there's no other possible strategy, because you could get
another row belonging to an existing group at any time right up
through the end of the input. However, when the HashAgg node is only
implementing DISTINCT (ON), you can emit each new row as soon as you
see it, and just make the hash table entry to be certain you don't
emit it again. I think someone recently submitted a patch along these
lines and we rejected it because the use case was too thin ... but
this example is making me think that the use case might not be all
that thin after all.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Oct 23, 2012 at 7:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:
through the end of the input. However, when the HashAgg node is only
implementing DISTINCT (ON), you can emit each new row as soon as you
see it, and just make the hash table entry to be certain you don't
emit it again. I think someone recently submitted a patch along these
lines and we rejected it because the use case was too thin ... but
this example is making me think that the use case might not be all
that thin after all.
If anyone wants to play around, the initial patch is available here:
It looks to me like it should work for the distinct on KNN search case
out of the box. However it needs some planner adjustment for more
complex plans and maybe some refactoring to lose duplication in the
executor to be worth considering committing.
Regards,
Ants Aasma
Pavel Stehule wrote:
2012/10/22 Tom Lane <tgl@sss.pgh.pa.us>:
Perhaps it would be close enough to what you want to use DISTINCT ON:
contrib_regression=# explain select distinct on( t <-> 'foo') *,t <-> 'foo' from test_trgm order by t <-> 'foo' limit 10;
good tip - it's working
If two or more values happen to be at exactly the same distance,
wouldn't you just get one of them?
-Kevin
Import Notes
Resolved by subject fallback
"Kevin Grittner" <kgrittn@mail.com> writes:
Pavel Stehule wrote:
2012/10/22 Tom Lane <tgl@sss.pgh.pa.us>:
Perhaps it would be close enough to what you want to use DISTINCT ON:
contrib_regression=# explain select distinct on( t <-> 'foo') *,t <-> 'foo' from test_trgm order by t <-> 'foo' limit 10;
good tip - it's working
If two or more values happen to be at exactly the same distance,
wouldn't you just get one of them?
Yeah, that is a hazard. I'm not sure whether <->'s results are
sufficiently quantized to make that a big problem in practice.
regards, tom lane
Tom Lane wrote:
"Kevin Grittner" <kgrittn@mail.com> writes:
Pavel Stehule wrote:
2012/10/22 Tom Lane <tgl@sss.pgh.pa.us>:
Perhaps it would be close enough to what you want to use DISTINCT ON:
contrib_regression=# explain select distinct on( t <-> 'foo') *,t <-> 'foo' from test_trgm order by t <-> 'foo' limit 10;good tip - it's working
If two or more values happen to be at exactly the same distance,
wouldn't you just get one of them?Yeah, that is a hazard. I'm not sure whether <->'s results are
sufficiently quantized to make that a big problem in practice.
It doesn't seem too far-fetched for trigram queries:
test=# select nm, nm <-> 'anders' from (values ('anderson'),('andersen'),('andersly')) x(nm);
nm | ?column?
----------+----------
anderson | 0.4
andersen | 0.4
andersly | 0.4
(3 rows)
test=# select distinct on (nm <-> 'anders') nm, nm <-> 'anders' from (values ('anderson'),('andersen'),('andersly')) x(nm) order by nm <-> 'anders' limit 3;
nm | ?column?
----------+----------
anderson | 0.4
(1 row)
-Kevin
Import Notes
Resolved by subject fallback
2012/10/25 Kevin Grittner <kgrittn@mail.com>:
Tom Lane wrote:
"Kevin Grittner" <kgrittn@mail.com> writes:
Pavel Stehule wrote:
2012/10/22 Tom Lane <tgl@sss.pgh.pa.us>:
Perhaps it would be close enough to what you want to use DISTINCT ON:
contrib_regression=# explain select distinct on( t <-> 'foo') *,t <-> 'foo' from test_trgm order by t <-> 'foo' limit 10;good tip - it's working
If two or more values happen to be at exactly the same distance,
wouldn't you just get one of them?Yeah, that is a hazard. I'm not sure whether <->'s results are
sufficiently quantized to make that a big problem in practice.It doesn't seem too far-fetched for trigram queries:
test=# select nm, nm <-> 'anders' from (values ('anderson'),('andersen'),('andersly')) x(nm);
nm | ?column?
----------+----------
anderson | 0.4
andersen | 0.4
andersly | 0.4
(3 rows)test=# select distinct on (nm <-> 'anders') nm, nm <-> 'anders' from (values ('anderson'),('andersen'),('andersly')) x(nm) order by nm <-> 'anders' limit 3;
nm | ?column?
----------+----------
anderson | 0.4
(1 row)
yes it is issue - but I am thinking about simple "fuzzy" searching, so
exact result is not strongly expected. On second hand if SELECT
DISTINCT * will be supported it should be nice.
Pavel
Show quoted text
-Kevin