9.6 phrase search distance specification

Started by Bruce Momjianover 9 years ago8 messages
#1Bruce Momjian
bruce@momjian.us

Does anyone know why the phrase distance "<3>" was changed from "at most
three tokens away" to "exactly three tokens away"? I looked at the
thread at:

/messages/by-id/33828354.WrrSMviC7Y@abook

and didn't see the answer. I assume if you are looking for "<3>" you
would want "<2>" matches and "<1>" matches as well.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+                     Ancient Roman grave inscription +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#1)
Re: 9.6 phrase search distance specification

Bruce Momjian <bruce@momjian.us> writes:

Does anyone know why the phrase distance "<3>" was changed from "at most
three tokens away" to "exactly three tokens away"?

So that it would correctly support phraseto_tsquery's use of the operator
to represent omitted words (stopwords) in a phrase.

I think there's probably some use in also providing an operator that does
"at most this many tokens away", but Oleg/Teodor were evidently less
excited, because they didn't take the time to do it.

The thread where this change was discussed is

/messages/by-id/c19fcfec308e6ccd952cdde9e648b505@mail.gmail.com

see particularly

/messages/by-id/11252.1465422251@sss.pgh.pa.us

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#2)
Re: 9.6 phrase search distance specification

On Tue, Aug 9, 2016 at 01:58:25PM -0400, Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

Does anyone know why the phrase distance "<3>" was changed from "at most
three tokens away" to "exactly three tokens away"?

So that it would correctly support phraseto_tsquery's use of the operator
to represent omitted words (stopwords) in a phrase.

I think there's probably some use in also providing an operator that does
"at most this many tokens away", but Oleg/Teodor were evidently less
excited, because they didn't take the time to do it.

The thread where this change was discussed is

/messages/by-id/c19fcfec308e6ccd952cdde9e648b505@mail.gmail.com

see particularly

/messages/by-id/11252.1465422251@sss.pgh.pa.us

Ah, I know it was discussed somewhere. Thanks, the phraseto_tsquery
tie-in was what I forgot.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+                     Ancient Roman grave inscription +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Ryan Pedela
rpedela@datalanche.com
In reply to: Tom Lane (#2)
Re: 9.6 phrase search distance specification

Thanks,

Ryan Pedela
Datalanche CEO, founder
www.datalanche.com

On Tue, Aug 9, 2016 at 11:58 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Bruce Momjian <bruce@momjian.us> writes:

Does anyone know why the phrase distance "<3>" was changed from "at most
three tokens away" to "exactly three tokens away"?

So that it would correctly support phraseto_tsquery's use of the operator
to represent omitted words (stopwords) in a phrase.

I think there's probably some use in also providing an operator that does
"at most this many tokens away", but Oleg/Teodor were evidently less
excited, because they didn't take the time to do it.

The thread where this change was discussed is

/messages/by-id/c19fcfec308e6ccd952cdde9e648b5
05%40mail.gmail.com

see particularly

/messages/by-id/11252.1465422251@sss.pgh.pa.us

I would say that it is worth it to have a "phrase slop" operator (Apache
Lucene terminology). Proximity search is extremely useful for improving
relevance and phrase slop is one of the tools to achieve that.

#5Ryan Pedela
rpedela@datalanche.com
In reply to: Ryan Pedela (#4)
Re: 9.6 phrase search distance specification

On Tue, Aug 9, 2016 at 12:59 PM, Ryan Pedela <rpedela@datalanche.com> wrote:

Thanks,

Ryan Pedela
Datalanche CEO, founder
www.datalanche.com

On Tue, Aug 9, 2016 at 11:58 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Bruce Momjian <bruce@momjian.us> writes:

Does anyone know why the phrase distance "<3>" was changed from "at most
three tokens away" to "exactly three tokens away"?

So that it would correctly support phraseto_tsquery's use of the operator
to represent omitted words (stopwords) in a phrase.

I think there's probably some use in also providing an operator that does
"at most this many tokens away", but Oleg/Teodor were evidently less
excited, because they didn't take the time to do it.

The thread where this change was discussed is

/messages/by-id/c19fcfec308e6ccd9
52cdde9e648b505%40mail.gmail.com

see particularly

/messages/by-id/11252.1465422251@sss.pgh.pa.us

I would say that it is worth it to have a "phrase slop" operator (Apache
Lucene terminology). Proximity search is extremely useful for improving
relevance and phrase slop is one of the tools to achieve that.

Sorry for the position of my signature....

Ryan

#6Oleg Bartunov
obartunov@gmail.com
In reply to: Ryan Pedela (#4)
Re: 9.6 phrase search distance specification

On Tue, Aug 9, 2016 at 9:59 PM, Ryan Pedela <rpedela@datalanche.com> wrote:

I would say that it is worth it to have a "phrase slop" operator (Apache
Lucene terminology). Proximity search is extremely useful for improving
relevance and phrase slop is one of the tools to achieve that.

It'd be great if you explain what is "phrase slop". I assume it's not
about search, but about relevance.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Ryan Pedela
rpedela@datalanche.com
In reply to: Oleg Bartunov (#6)
Re: 9.6 phrase search distance specification

On Thu, Aug 11, 2016 at 9:27 AM, Oleg Bartunov <obartunov@gmail.com> wrote:

On Tue, Aug 9, 2016 at 9:59 PM, Ryan Pedela <rpedela@datalanche.com>
wrote:

I would say that it is worth it to have a "phrase slop" operator (Apache
Lucene terminology). Proximity search is extremely useful for improving
relevance and phrase slop is one of the tools to achieve that.

It'd be great if you explain what is "phrase slop". I assume it's not
about search, but about relevance.

Sure. An exact phrase query has slop = 0 which means find all terms in the
exact positions relative to each other. Phrase query with slop > 0 means
find all terms within <slop> positions relative to each other. If slop =
10, find all terms within 10 positions of each other. Here is a concrete
example from my current work searching SEC filings.

Bill Gates' full legal name is William H. Gates, III. In the SEC database
[1]: , his name is GATES WILLIAM H III. If you are searching the records of people within the SEC database and you want to find Bill Gates, most users will type "bill gates". Since there are many people with the first name Bill (William) and the last name Gates, Bill Gates most likely won't be the first result with a standard keyword query. Likewise an exact phrase query (slop = 0) will not find him either because the first and last names are transposed. What you need is a phrase query with a slop = 2 which will match "William Gates", "William H Gates", "Gates William", etc. There is still the issue of Bill vs William, but that can be solved with synonyms and is a different topic.
people within the SEC database and you want to find Bill Gates, most users
will type "bill gates". Since there are many people with the first name
Bill (William) and the last name Gates, Bill Gates most likely won't be the
first result with a standard keyword query. Likewise an exact phrase query
(slop = 0) will not find him either because the first and last names are
transposed. What you need is a phrase query with a slop = 2 which will
match "William Gates", "William H Gates", "Gates William", etc. There is
still the issue of Bill vs William, but that can be solved with synonyms
and is a different topic.

1. https://www.sec.gov/cgi-bin/browse-edgar?CIK=902012&amp;owner=exclude&amp;action=
getcompany&Find=Search

Thanks,
Ryan

#8Ryan Pedela
rpedela@datalanche.com
In reply to: Ryan Pedela (#7)
Re: 9.6 phrase search distance specification

On Thu, Aug 11, 2016 at 10:42 AM, Ryan Pedela <rpedela@datalanche.com>
wrote:

On Thu, Aug 11, 2016 at 9:27 AM, Oleg Bartunov <obartunov@gmail.com>
wrote:

On Tue, Aug 9, 2016 at 9:59 PM, Ryan Pedela <rpedela@datalanche.com>
wrote:

I would say that it is worth it to have a "phrase slop" operator

(Apache

Lucene terminology). Proximity search is extremely useful for improving
relevance and phrase slop is one of the tools to achieve that.

It'd be great if you explain what is "phrase slop". I assume it's not
about search, but about relevance.

Sure. An exact phrase query has slop = 0 which means find all terms in the
exact positions relative to each other. Phrase query with slop > 0 means
find all terms within <slop> positions relative to each other. If slop =
10, find all terms within 10 positions of each other. Here is a concrete
example from my current work searching SEC filings.

Bill Gates' full legal name is William H. Gates, III. In the SEC database
[1], his name is GATES WILLIAM H III. If you are searching the records of
people within the SEC database and you want to find Bill Gates, most users
will type "bill gates". Since there are many people with the first name
Bill (William) and the last name Gates, Bill Gates most likely won't be the
first result with a standard keyword query. Likewise an exact phrase query
(slop = 0) will not find him either because the first and last names are
transposed. What you need is a phrase query with a slop = 2 which will
match "William Gates", "William H Gates", "Gates William", etc. There is
still the issue of Bill vs William, but that can be solved with synonyms
and is a different topic.

1. https://www.sec.gov/cgi-bin/browse-edgar?CIK=902012&amp;owner
=exclude&action=getcompany&Find=Search

One more thing. In that trivial example, an AND query would probably do a
great job too. However if you are searching for Bill Gates in large text
documents rather than a list of names, an AND query will not give you very
good results because the words "bill" and "gates" are so common.