Full text search on partial URLs
Hi,
I have Postgres full text search set up for my application and it's been
working great! However, my users would like their searches to turn up
parts of URLs. For example, they would like a search for "foobar" to
turn up a document that contains the string
"http://example.com/foobar/blah" (and similarly for queries like
"example" and "blah). With the default dictionaries for host, url, and
url_path, the search query would have to contain the complete host or
url path.
What is the best way to accomplish this? Should I be looking at
building a custom dictionary that breaks down hosts and urls or is there
something simpler I can do?
Thanks,
Zev
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Wed, Nov 6, 2013 at 10:45 AM, Zev Benjamin
<zev-pgsql@strangersgate.com>wrote:
Hi,
I have Postgres full text search set up for my application and it's been
working great! However, my users would like their searches to turn up
parts of URLs. For example, they would like a search for "foobar" to turn
up a document that contains the string "http://example.com/foobar/blah"
(and similarly for queries like "example" and "blah). With the default
dictionaries for host, url, and url_path, the search query would have to
contain the complete host or url path.What is the best way to accomplish this? Should I be looking at building
a custom dictionary that breaks down hosts and urls or is there something
simpler I can do?
Have you looked into trigrams?
http://www.postgresql.org/docs/current/static/pgtrgm.html
On 11/06/2013 01:47 PM, bricklen wrote:
On Wed, Nov 6, 2013 at 10:45 AM, Zev Benjamin
<zev-pgsql@strangersgate.com <mailto:zev-pgsql@strangersgate.com>> wrote:Hi,
I have Postgres full text search set up for my application and it's
been working great! However, my users would like their searches to
turn up parts of URLs. For example, they would like a search for
"foobar" to turn up a document that contains the string
"http://example.com/foobar/__blah <http://example.com/foobar/blah>"
(and similarly for queries like "example" and "blah). With the
default dictionaries for host, url, and url_path, the search query
would have to contain the complete host or url path.What is the best way to accomplish this? Should I be looking at
building a custom dictionary that breaks down hosts and urls or is
there something simpler I can do?Have you looked into trigrams?
http://www.postgresql.org/docs/current/static/pgtrgm.html
I've looked at it in the context of adding fuzzy search. But my
understanding is that doing a fuzzy search here would only work if the
query were a significant fraction of, say, the url path. For example, I
would expect a fuzzy search of "foobar" on "/foobar/x" to return a high
similarity, but a fuzzy search of "foobar" on
"/foobar/some/very/long/path/x" to have a low similarity.
Or are you suggesting using trigrams in a different way?
Zev
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Wed, Nov 6, 2013 at 10:53 AM, Zev Benjamin
<zev-pgsql@strangersgate.com>wrote:
On 11/06/2013 01:47 PM, bricklen wrote:
On Wed, Nov 6, 2013 at 10:45 AM, Zev Benjamin
<zev-pgsql@strangersgate.com <mailto:zev-pgsql@strangersgate.com>> wrote:Hi,
I have Postgres full text search set up for my application and it's
been working great! However, my users would like their searches to
turn up parts of URLs. For example, they would like a search for
"foobar" to turn up a document that contains the string
"http://example.com/foobar/__blah <http://example.com/foobar/blah>"
(and similarly for queries like "example" and "blah). With the
default dictionaries for host, url, and url_path, the search query
would have to contain the complete host or url path.What is the best way to accomplish this? Should I be looking at
building a custom dictionary that breaks down hosts and urls or is
there something simpler I can do?Have you looked into trigrams?
http://www.postgresql.org/docs/current/static/pgtrgm.htmlI've looked at it in the context of adding fuzzy search. But my
understanding is that doing a fuzzy search here would only work if the
query were a significant fraction of, say, the url path. For example, I
would expect a fuzzy search of "foobar" on "/foobar/x" to return a high
similarity, but a fuzzy search of "foobar" on "/foobar/some/very/long/path/x"
to have a low similarity.Or are you suggesting using trigrams in a different way?
Yeah, I was thinking more along the lines of allowing wildcard searching,
not similarity.
Eg.
CREATE INDEX yourtable_yourcol_gist_fbi ON yourtable using GIST ( yourcol
gist_trgm_ops );
select * from yourtable where yourcol ~~ '%foobar%';
On 11/06/2013 02:04 PM, bricklen wrote:
On Wed, Nov 6, 2013 at 10:53 AM, Zev Benjamin
<zev-pgsql@strangersgate.com <mailto:zev-pgsql@strangersgate.com>> wrote:On 11/06/2013 01:47 PM, bricklen wrote:
On Wed, Nov 6, 2013 at 10:45 AM, Zev Benjamin
<zev-pgsql@strangersgate.com
<mailto:zev-pgsql@strangersgate.com>
<mailto:zev-pgsql@__strangersgate.com
<mailto:zev-pgsql@strangersgate.com>>> wrote:Hi,
I have Postgres full text search set up for my application
and it's
been working great! However, my users would like their
searches to
turn up parts of URLs. For example, they would like a
search for
"foobar" to turn up a document that contains the string
"http://example.com/foobar/____blah
<http://example.com/foobar/__blah>
<http://example.com/foobar/__blah <http://example.com/foobar/blah>>"
(and similarly for queries like "example" and "blah). With the
default dictionaries for host, url, and url_path, the
search query
would have to contain the complete host or url path.What is the best way to accomplish this? Should I be
looking at
building a custom dictionary that breaks down hosts and
urls or is
there something simpler I can do?Have you looked into trigrams?
http://www.postgresql.org/__docs/current/static/pgtrgm.__html
<http://www.postgresql.org/docs/current/static/pgtrgm.html>I've looked at it in the context of adding fuzzy search. But my
understanding is that doing a fuzzy search here would only work if
the query were a significant fraction of, say, the url path. For
example, I would expect a fuzzy search of "foobar" on "/foobar/x" to
return a high similarity, but a fuzzy search of "foobar" on
"/foobar/some/very/long/path/__x" to have a low similarity.Or are you suggesting using trigrams in a different way?
Yeah, I was thinking more along the lines of allowing wildcard
searching, not similarity.Eg.
CREATE INDEX yourtable_yourcol_gist_fbi ON yourtable using GIST (
yourcol gist_trgm_ops );
select * from yourtable where yourcol ~~ '%foobar%';
Hrm. That might work. So the application-level search functionality
would be the union of tsearch and trigram wildcard matching.
If anyone else has other ideas, I'd be interested in hearing them as well.
Thanks,
Zev
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On 11/15/2013 07:40 PM, Zev Benjamin wrote:
One problem that I've run into here is that I would also like to
highlight matched text in my application. For my existing search
solution, I do this with ts_headline. For partial matches, it's
unfortunately not just a matter of searching for the text and adding the
appropriate markup because my documents are HTML (the FTS lexer
helpfully pulls out all the HTML tags so it hasn't been a problem so
far) and we don't want to accidentally "highlight" some of the
attributes of the markup.One way to solve this would be if there were a way to turn a tsvector
and tsquery pair into a list of the offsets and lengths of the lexemes
that match. The highlighting could then be done at the application
level rather than the database level while still leveraging Postgres's
FTS functionality.
I've written C functions to implement this and attached them to this
email. The support files necessary for making a module are available at
https://github.com/zbenjamin/tsearch_extras. I'm new to the PostgreSQL
code base so any feedback or comments would be greatly appreciated.
Would these be appropriate to submit as patches to PostgreSQL?
Thanks,
Zev
Attachments:
tsearch_extras.ctext/x-csrc; name=tsearch_extras.cDownload
Import Notes
Reply to msg id not found: 5286BEEA.1010900@strangersgate.com
On 11/15/2013 07:40 PM, Zev Benjamin wrote:
One problem that I've run into here is that I would also like to
highlight matched text in my application. For my existing search
solution, I do this with ts_headline. For partial matches, it's
unfortunately not just a matter of searching for the text and adding the
appropriate markup because my documents are HTML (the FTS lexer
helpfully pulls out all the HTML tags so it hasn't been a problem so
far) and we don't want to accidentally "highlight" some of the
attributes of the markup.One way to solve this would be if there were a way to turn a tsvector
and tsquery pair into a list of the offsets and lengths of the lexemes
that match. The highlighting could then be done at the application
level rather than the database level while still leveraging Postgres's
FTS functionality.
I've written C functions to implement this and attached them to this
email. The support files necessary for making a module are available at
https://github.com/zbenjamin/tsearch_extras. I'm new to the PostgreSQL
code base so any feedback or comments would be greatly appreciated.
Would these be appropriate to submit as patches to PostgreSQL?
Thanks,
Zev
Attachments:
tsearch_extras.ctext/x-csrc; name=tsearch_extras.cDownload
Import Notes
Reply to msg id not found: 5286BEEA.1010900@strangersgate.com