ToDo: KNN Search should to support DISTINCT clasuse?

Started by Pavel Stehuleover 13 years ago13 messageshackers

pavel.stehule@gmail.com

over 13 years ago

Hello

I should to search distinct values based on similarity

postgres=# explain select nazobce, nazobce <-> 'Benešov' from obce
order by nazobce <-> 'Benešov' limit 10
;
QUERY PLAN
-------------------------------------------------------------------------------------------
Limit (cost=0.00..0.86 rows=10 width=10)
-> Index Scan using obce_nazobce_idx on obce (cost=0.00..1433.14
rows=16644 width=10)
Order By: (nazobce <-> 'Benešov'::text)
(3 rows)

Time: 0.576 ms

but using DISTINCT breaks KNN searching optimization

postgres=# explain select distinct nazobce, nazobce <-> 'Benešov' from
obce order by nazobce <-> 'Benešov' limit 10
;
QUERY PLAN
-----------------------------------------------------------------------------
Limit (cost=600.45..600.47 rows=10 width=10)
-> Sort (cost=600.45..613.80 rows=5341 width=10)
Sort Key: ((nazobce <-> 'Benešov'::text))
-> HashAggregate (cost=418.27..485.03 rows=5341 width=10)
-> Seq Scan on obce (cost=0.00..335.05 rows=16644 width=10)
(5 rows)

Regards

Pavel Stehule

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Pavel Stehule (#1)

Re: ToDo: KNN Search should to support DISTINCT clasuse?

Pavel Stehule <pavel.stehule@gmail.com> writes:

but using DISTINCT breaks KNN searching optimization

postgres=# explain select distinct nazobce, nazobce <-> 'Benešov' from
obce order by nazobce <-> 'Benešov' limit 10

Don't hold your breath. There are two ways the system could implement
the DISTINCT clause: either sort and uniq, or hashaggregate.
hashaggregate will destroy any input ordering, so there's no value in
using the index as input. sort and uniq requires the input to be sorted
by *all* the columns being distinct'ed, not just one, so again this
index isn't useful. You could get a plan using the index if you only
wanted the <-> output column, eg

contrib_regression=# explain select distinct t <-> 'foo' from test_trgm order by t <-> 'foo' limit 10;
QUERY PLAN
-------------------------------------------------------------------------------------
Limit (cost=0.00..0.87 rows=10 width=12)
-> Unique (cost=0.00..86.75 rows=1000 width=12)
-> Index Scan using ti on test_trgm (cost=0.00..84.25 rows=1000 width=12)
Order By: (t <-> 'foo'::text)
(4 rows)

Perhaps it would be close enough to what you want to use DISTINCT ON:

contrib_regression=# explain select distinct on( t <-> 'foo') *,t <-> 'foo' from test_trgm order by t <-> 'foo' limit 10;
QUERY PLAN
-------------------------------------------------------------------------------------
Limit (cost=0.00..0.87 rows=10 width=12)
-> Unique (cost=0.00..86.75 rows=1000 width=12)
-> Index Scan using ti on test_trgm (cost=0.00..84.25 rows=1000 width=12)
Order By: (t <-> 'foo'::text)
(4 rows)

regards, tom lane

pavel.stehule@gmail.com

over 13 years ago

In reply to: Tom Lane (#2)

Re: ToDo: KNN Search should to support DISTINCT clasuse?

2012/10/22 Tom Lane <tgl@sss.pgh.pa.us>:

Pavel Stehule <pavel.stehule@gmail.com> writes:

but using DISTINCT breaks KNN searching optimization

postgres=# explain select distinct nazobce, nazobce <-> 'Benešov' from
obce order by nazobce <-> 'Benešov' limit 10

Don't hold your breath. There are two ways the system could implement
the DISTINCT clause: either sort and uniq, or hashaggregate.
hashaggregate will destroy any input ordering, so there's no value in
using the index as input. sort and uniq requires the input to be sorted
by *all* the columns being distinct'ed, not just one, so again this
index isn't useful. You could get a plan using the index if you only
wanted the <-> output column, eg

contrib_regression=# explain select distinct t <-> 'foo' from test_trgm order by t <-> 'foo' limit 10;
QUERY PLAN
-------------------------------------------------------------------------------------
Limit (cost=0.00..0.87 rows=10 width=12)
-> Unique (cost=0.00..86.75 rows=1000 width=12)
-> Index Scan using ti on test_trgm (cost=0.00..84.25 rows=1000 width=12)
Order By: (t <-> 'foo'::text)
(4 rows)

Perhaps it would be close enough to what you want to use DISTINCT ON:

contrib_regression=# explain select distinct on( t <-> 'foo') *,t <-> 'foo' from test_trgm order by t <-> 'foo' limit 10;
QUERY PLAN
-------------------------------------------------------------------------------------
Limit (cost=0.00..0.87 rows=10 width=12)
-> Unique (cost=0.00..86.75 rows=1000 width=12)
-> Index Scan using ti on test_trgm (cost=0.00..84.25 rows=1000 width=12)
Order By: (t <-> 'foo'::text)
(4 rows)

regards, tom lane

good tip - it's working

thank you

Regards

Pavel

robertmhaas@gmail.com

over 13 years ago

In reply to: Tom Lane (#2)

Re: ToDo: KNN Search should to support DISTINCT clasuse?

On Mon, Oct 22, 2012 at 11:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Pavel Stehule <pavel.stehule@gmail.com> writes:

but using DISTINCT breaks KNN searching optimization

postgres=# explain select distinct nazobce, nazobce <-> 'Benešov' from
obce order by nazobce <-> 'Benešov' limit 10

Don't hold your breath. There are two ways the system could implement
the DISTINCT clause: either sort and uniq, or hashaggregate.
hashaggregate will destroy any input ordering, so there's no value in
using the index as input.

Isn't that an implementation limitation though, rather than a
fundamental limitation?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Robert Haas (#4)

Re: ToDo: KNN Search should to support DISTINCT clasuse?

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Oct 22, 2012 at 11:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Don't hold your breath. There are two ways the system could implement
the DISTINCT clause: either sort and uniq, or hashaggregate.
hashaggregate will destroy any input ordering, so there's no value in
using the index as input.

Isn't that an implementation limitation though, rather than a
fundamental limitation?

Perhaps, but it's not a simple one to surmount, and I'm dubious about
putting the amount of work that'd be required into such a corner case.

regards, tom lane

pavel.stehule@gmail.com

over 13 years ago

In reply to: Tom Lane (#5)

Re: ToDo: KNN Search should to support DISTINCT clasuse?

2012/10/22 Tom Lane <tgl@sss.pgh.pa.us>:

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Oct 22, 2012 at 11:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Don't hold your breath. There are two ways the system could implement
the DISTINCT clause: either sort and uniq, or hashaggregate.
hashaggregate will destroy any input ordering, so there's no value in
using the index as input.

Isn't that an implementation limitation though, rather than a
fundamental limitation?

Perhaps, but it's not a simple one to surmount, and I'm dubious about
putting the amount of work that'd be required into such a corner case.

I don't think so this use case is too special - but workaround working well

Regards

Pavel

Show quoted text

regards, tom lane

bruce@momjian.us

over 13 years ago

In reply to: Tom Lane (#2)

Re: ToDo: KNN Search should to support DISTINCT clasuse?

On Mon, Oct 22, 2012 at 4:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Don't hold your breath. There are two ways the system could implement
the DISTINCT clause: either sort and uniq, or hashaggregate.
hashaggregate will destroy any input ordering, so there's no value in
using the index as input. sort and uniq requires the input to be sorted
by *all* the columns being distinct'ed, not just one, so again this
index isn't useful.

We already have some bits that understand functional dependencies for
distinct/group by don't we? Do we detect that col <-> 'foo' is
functionally dependent on col? If so is it possible to construct a
multicolumn index that can produce an ordering like [col <-> 'foo',
col] which could be used to get distinct values of col in the knn
order (since the first column is functionally dependent on the second
it can be ignored for grouping purposes).

Not that we can do this now but I wonder whether a lot of the pieces
are already there and just need to be hooked up together.

--
greg

robertmhaas@gmail.com

over 13 years ago

In reply to: Bruce Momjian (#7)

Re: ToDo: KNN Search should to support DISTINCT clasuse?

On Mon, Oct 22, 2012 at 7:09 PM, Greg Stark <stark@mit.edu> wrote:

On Mon, Oct 22, 2012 at 4:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Don't hold your breath. There are two ways the system could implement
the DISTINCT clause: either sort and uniq, or hashaggregate.
hashaggregate will destroy any input ordering, so there's no value in
using the index as input. sort and uniq requires the input to be sorted
by *all* the columns being distinct'ed, not just one, so again this
index isn't useful.

We already have some bits that understand functional dependencies for
distinct/group by don't we? Do we detect that col <-> 'foo' is
functionally dependent on col? If so is it possible to construct a
multicolumn index that can produce an ordering like [col <-> 'foo',
col] which could be used to get distinct values of col in the knn
order (since the first column is functionally dependent on the second
it can be ignored for grouping purposes).

Not that we can do this now but I wonder whether a lot of the pieces
are already there and just need to be hooked up together.

I think the real issue is that a hash aggregate doesn't emit any rows
until the entire input is read, so it doesn't play nice with LIMIT.
In general there's no other possible strategy, because you could get
another row belonging to an existing group at any time right up
through the end of the input. However, when the HashAgg node is only
implementing DISTINCT (ON), you can emit each new row as soon as you
see it, and just make the hash table entry to be certain you don't
emit it again. I think someone recently submitted a patch along these
lines and we rejected it because the use case was too thin ... but
this example is making me think that the use case might not be all
that thin after all.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

ants.aasma@cybertec.at

over 13 years ago

In reply to: Robert Haas (#8)

Re: ToDo: KNN Search should to support DISTINCT clasuse?

On Tue, Oct 23, 2012 at 7:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:

through the end of the input. However, when the HashAgg node is only
implementing DISTINCT (ON), you can emit each new row as soon as you
see it, and just make the hash table entry to be certain you don't
emit it again. I think someone recently submitted a patch along these
lines and we rejected it because the use case was too thin ... but
this example is making me think that the use case might not be all
that thin after all.

If anyone wants to play around, the initial patch is available here:

http://archives.postgresql.org/message-id/CA%2BCSw_uE-RCyQd_bXJNe%3DusrXkq%2BkeFrQrahkc%2B8ou%2BWs4Y%3DVw%40mail.gmail.com

It looks to me like it should work for the distinct on KNN search case
out of the box. However it needs some planner adjustment for more
complex plans and maybe some refactoring to lose duplication in the
executor to be worth considering committing.

Regards,
Ants Aasma

Kevin.Grittner@wicourts.gov

over 13 years ago

In reply to: Ants Aasma (#9)

Re: ToDo: KNN Search should to support DISTINCT clasuse?

Pavel Stehule wrote:

2012/10/22 Tom Lane <tgl@sss.pgh.pa.us>:

Perhaps it would be close enough to what you want to use DISTINCT ON:

contrib_regression=# explain select distinct on( t <-> 'foo') *,t <-> 'foo' from test_trgm order by t <-> 'foo' limit 10;

good tip - it's working

If two or more values happen to be at exactly the same distance,
wouldn't you just get one of them?

-Kevin

Import Notes

Resolved by subject fallback

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Kevin Grittner (#10)

Re: ToDo: KNN Search should to support DISTINCT clasuse?

"Kevin Grittner" <kgrittn@mail.com> writes:

Pavel Stehule wrote:

2012/10/22 Tom Lane <tgl@sss.pgh.pa.us>:

Perhaps it would be close enough to what you want to use DISTINCT ON:
contrib_regression=# explain select distinct on( t <-> 'foo') *,t <-> 'foo' from test_trgm order by t <-> 'foo' limit 10;

good tip - it's working

If two or more values happen to be at exactly the same distance,
wouldn't you just get one of them?

Yeah, that is a hazard. I'm not sure whether <->'s results are
sufficiently quantized to make that a big problem in practice.

regards, tom lane

Kevin.Grittner@wicourts.gov

over 13 years ago

In reply to: Tom Lane (#11)

Re: ToDo: KNN Search should to support DISTINCT clasuse?

Tom Lane wrote:

"Kevin Grittner" <kgrittn@mail.com> writes:

Pavel Stehule wrote:

2012/10/22 Tom Lane <tgl@sss.pgh.pa.us>:

Perhaps it would be close enough to what you want to use DISTINCT ON:
contrib_regression=# explain select distinct on( t <-> 'foo') *,t <-> 'foo' from test_trgm order by t <-> 'foo' limit 10;

good tip - it's working

If two or more values happen to be at exactly the same distance,
wouldn't you just get one of them?

Yeah, that is a hazard. I'm not sure whether <->'s results are
sufficiently quantized to make that a big problem in practice.

It doesn't seem too far-fetched for trigram queries:

test=# select nm, nm <-> 'anders' from (values ('anderson'),('andersen'),('andersly')) x(nm);
nm | ?column?
----------+----------
anderson | 0.4
andersen | 0.4
andersly | 0.4
(3 rows)

test=# select distinct on (nm <-> 'anders') nm, nm <-> 'anders' from (values ('anderson'),('andersen'),('andersly')) x(nm) order by nm <-> 'anders' limit 3;
nm | ?column?
----------+----------
anderson | 0.4
(1 row)

-Kevin

Import Notes

Resolved by subject fallback

pavel.stehule@gmail.com

over 13 years ago

In reply to: Kevin Grittner (#12)

Re: ToDo: KNN Search should to support DISTINCT clasuse?

2012/10/25 Kevin Grittner <kgrittn@mail.com>:

Tom Lane wrote:

"Kevin Grittner" <kgrittn@mail.com> writes:

Pavel Stehule wrote:

2012/10/22 Tom Lane <tgl@sss.pgh.pa.us>:

Perhaps it would be close enough to what you want to use DISTINCT ON:
contrib_regression=# explain select distinct on( t <-> 'foo') *,t <-> 'foo' from test_trgm order by t <-> 'foo' limit 10;

good tip - it's working

If two or more values happen to be at exactly the same distance,
wouldn't you just get one of them?

Yeah, that is a hazard. I'm not sure whether <->'s results are
sufficiently quantized to make that a big problem in practice.

It doesn't seem too far-fetched for trigram queries:

test=# select nm, nm <-> 'anders' from (values ('anderson'),('andersen'),('andersly')) x(nm);
nm | ?column?
----------+----------
anderson | 0.4
andersen | 0.4
andersly | 0.4
(3 rows)

test=# select distinct on (nm <-> 'anders') nm, nm <-> 'anders' from (values ('anderson'),('andersen'),('andersly')) x(nm) order by nm <-> 'anders' limit 3;
nm | ?column?
----------+----------
anderson | 0.4
(1 row)

yes it is issue - but I am thinking about simple "fuzzy" searching, so
exact result is not strongly expected. On second hand if SELECT
DISTINCT * will be supported it should be nice.

Pavel

Show quoted text

-Kevin