Simplifying Text Search
Something Tom Dunstan just mentioned has made me ask the question "Why
does our full text search feature look so strange?". It's the
operator-laden syntax that causes the problem.
By any stretch, this query is difficult for most people to understand:
SELECT * FROM text_table
WHERE to_tsvector('fat cats ate fat rats') @@ to_tsquery('fat & rat');
Wouldn't it be much simpler to just have a function, so we can write
this query like this?
SELECT * FROM text_table
WHERE text_search('haystack needle haystack', 'needle');
We then explain to people that while the above is nice, it will presume
that both the function inputs are Text, which isn't any good for complex
searches, indexing and dictionaries etc.., so then we move to:
SELECT * FROM text_table
WHERE text_search('haystack needle haystack'::tsvector,
'needle'::tsquery);
or perhaps
SELECT * FROM text_table
WHERE full_text_search('haystack needle haystack', 'needle & hay');
which would automatically do the conversions to tsvector and tsquery for
us. No more tedious casting, easy to read.
[*text_search() functions would return bool]
So we end up with a normal sounding function that is overloaded to
provide all of the various goodies. We can include the text_search(text,
text) version of the function in the normal chapter on functions, with a
pointer to the more complex stuff elsewhere.
Sound good?
We can then explain everything without having to use @@ operators. They
can then be introduced as an option.
The side benefit of this is that we can then allow our wonderful new
functionality to be more easily usable by things like Hibernate. We just
tell them we have this new function and thats all they need to know.
I know that under the covers the @@ operator is necessary because we
hang various pieces of optimizer information from it. Each function
signature gets an operator with matching signature, so there's a 1:1
correspondence in most use cases. So to make this all hang together,
there'd need to be a some smarts that says: if there is only one
operator on a function then use the operator's optimizer information
when you see just the function. That information can be assessed at DDL
time, so we can keep accurate track of operator counts in pgproc.
An alternative approach might be to make the first operator created on a
function the "primary" operator. All other operators would then be
secondary operators, so that adding operators would not change the
inference mechanism.
I've not got sufficient knowledge to say how hard the
function-to-operator inference is, but it would be dang useful in making
text search and many other programs readable and easy to interface to.
In the end that is going to mean wider usage of that functionality, with
more people feeling like they can dip their toes into the water.
I must confess I have insufficient time to do this myself right now, not
least me discovering exactly how. I'm spending time on this now because
I'm the one that has to explain this stuff to people and things like
this can make a huge difference in their understanding and eventual
uptake.
Thoughts?
--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com
On Mon, Nov 12, 2007 at 03:00:36PM +0000, Simon Riggs wrote:
Something Tom Dunstan just mentioned has made me ask the question "Why
does our full text search feature look so strange?". It's the
operator-laden syntax that causes the problem.By any stretch, this query is difficult for most people to understand:
SELECT * FROM text_table
WHERE to_tsvector('fat cats ate fat rats') @@ to_tsquery('fat & rat');Wouldn't it be much simpler to just have a function, so we can write
this query like this?SELECT * FROM text_table
WHERE text_search('haystack needle haystack', 'needle');
Can't you do this with an SQL function that gets expanded inline?
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
Those who make peaceful revolution impossible will make violent revolution inevitable.
-- John F Kennedy
On Mon, 2007-11-12 at 16:28 +0100, Martijn van Oosterhout wrote:
On Mon, Nov 12, 2007 at 03:00:36PM +0000, Simon Riggs wrote:
Something Tom Dunstan just mentioned has made me ask the question "Why
does our full text search feature look so strange?". It's the
operator-laden syntax that causes the problem.By any stretch, this query is difficult for most people to understand:
SELECT * FROM text_table
WHERE to_tsvector('fat cats ate fat rats') @@ to_tsquery('fat & rat');Wouldn't it be much simpler to just have a function, so we can write
this query like this?SELECT * FROM text_table
WHERE text_search('haystack needle haystack', 'needle');Can't you do this with an SQL function that gets expanded inline?
Yep, we can. Good thinking. So the change is fairly trivial.
What do you think of the proposal to make text search work this way and
to document this more easily readable form?
--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com
On Mon, Nov 12, 2007 at 03:48:20PM +0000, Simon Riggs wrote:
On Mon, 2007-11-12 at 16:28 +0100, Martijn van Oosterhout wrote:
On Mon, Nov 12, 2007 at 03:00:36PM +0000, Simon Riggs wrote:
Something Tom Dunstan just mentioned has made me ask the question "Why
does our full text search feature look so strange?". It's the
operator-laden syntax that causes the problem.By any stretch, this query is difficult for most people to understand:
SELECT * FROM text_table
WHERE to_tsvector('fat cats ate fat rats') @@ to_tsquery('fat & rat');Wouldn't it be much simpler to just have a function, so we can write
this query like this?SELECT * FROM text_table
WHERE text_search('haystack needle haystack', 'needle');Can't you do this with an SQL function that gets expanded inline?
Yep, we can. Good thinking. So the change is fairly trivial.
What do you think of the proposal to make text search work this way
and to document this more easily readable form?
+1 for adding this.
Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com
Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
Simon Riggs wrote:
Something Tom Dunstan just mentioned has made me ask the question "Why
does our full text search feature look so strange?". It's the
operator-laden syntax that causes the problem.By any stretch, this query is difficult for most people to understand:
SELECT * FROM text_table
WHERE to_tsvector('fat cats ate fat rats') @@ to_tsquery('fat & rat');Wouldn't it be much simpler to just have a function, so we can write
this query like this?SELECT * FROM text_table
WHERE text_search('haystack needle haystack', 'needle');We then explain to people that while the above is nice, it will presume
that both the function inputs are Text, which isn't any good for complex
searches, indexing and dictionaries etc.., so then we move to:SELECT * FROM text_table
WHERE text_search('haystack needle haystack'::tsvector,
'needle'::tsquery);or perhaps
SELECT * FROM text_table
WHERE full_text_search('haystack needle haystack', 'needle & hay');which would automatically do the conversions to tsvector and tsquery for
us. No more tedious casting, easy to read.
There's a text @@ text operator, so you can write just:
SELECT * FROM tstable where data @@ 'needle';
No need to cast.
Unfortunately, that form can't use a GIN index, I think. But that's
another issue, which I don't think your proposal would fix...
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Simon Riggs <simon@2ndquadrant.com> writes:
So we end up with a normal sounding function that is overloaded to
provide all of the various goodies.
As best I can tell, @@ does exactly this already. This is just a
different spelling of the same capability, and I don't actually
find it better. Why is "text_search(x,y)" better than "x @@ y"?
We don't recommend that people write "texteq(x,y)" instead of
"x = y".
Sound good?
It's not an improvement, it's not compatible with what existing tsearch2
users are accustomed to, and it's several months too late...
regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes:
Simon Riggs <simon@2ndquadrant.com> writes:
So we end up with a normal sounding function that is overloaded to
provide all of the various goodies.As best I can tell, @@ does exactly this already. This is just a
different spelling of the same capability, and I don't actually
find it better. Why is "text_search(x,y)" better than "x @@ y"?
We don't recommend that people write "texteq(x,y)" instead of
"x = y".
I agree, I find it odd to suggest that a function would be more natural than
an operator. The main reason the non-core version of tsearch felt so much like
an add-on was precisely that it had to use functions to interface with
objects. That Postgres supports creating new operators is a strength which
allows a lot more extensibility.
And yet I agree that there's something awkward about the tsearch syntax. I'm
not sure where the core of it comes from though, but I don't think it comes
from the use of operators.
Part of it is that "@@" isn't a familiar operator. I'm not even sure what to
read it as. "Matches"? "Satisfies"?
Perhaps we should think (at some point in the future) about some way of
allowing alphabetic characters in operator names. Then you could write
something like:
col ~satisfies~ '1 & 2'
(That exact syntax wouldn't work without removing ~ from the characters in
normal operators so something with more finesse would be needed.)
The other part of tsearch that seems somewhat awkward is just the very concept
and syntax of tsqueries. But that seems pretty integral to the functionality
and I don't see any way to avoid it. It's not entirely unlike the idea of
regexps which I'm sure would seem unnatural if we were just meeting them with
no background.
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's On-Demand Production Tuning
On Mon, 2007-11-12 at 11:56 -0500, Tom Lane wrote:
Simon Riggs <simon@2ndquadrant.com> writes:
So we end up with a normal sounding function that is overloaded to
provide all of the various goodies.As best I can tell, @@ does exactly this already. This is just a
different spelling of the same capability, and I don't actually
find it better. Why is "text_search(x,y)" better than "x @@ y"?
We don't recommend that people write "texteq(x,y)" instead of
"x = y".
Most people don't understand those differences. x = y means "make sure
they are the same" to most people. They don't see what you (and I) see:
function and operator interchangeability. So text_search() is better
than @@ and = is better than texteq(). Life ain't neat...
Right now, Full Text Search SQL looks like complete gibberish and it
dissuades many people from using what is an awesome set of features. I
just want to add a little sugar to help people get started.
Sound good?
It's not an improvement
That is the very point of debate
it's not compatible with what existing tsearch2
users are accustomed to
@@ would still exist, so no problems. These additions are for new users,
not old ones.
it's several months too late...
True. I wish I'd thought of it before. I've waded through the syntax
without thinking how to make it more easily readable and explainable.
Damn.
--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs wrote:
On Mon, 2007-11-12 at 11:56 -0500, Tom Lane wrote:
Simon Riggs <simon@2ndquadrant.com> writes:
So we end up with a normal sounding function that is overloaded to
provide all of the various goodies.As best I can tell, @@ does exactly this already. This is just a
different spelling of the same capability, and I don't actually
find it better. Why is "text_search(x,y)" better than "x @@ y"?
We don't recommend that people write "texteq(x,y)" instead of
"x = y".Most people don't understand those differences. x = y means "make sure
they are the same" to most people. They don't see what you (and I) see:
function and operator interchangeability. So text_search() is better
than @@ and = is better than texteq(). Life ain't neat...Right now, Full Text Search SQL looks like complete gibberish and it
dissuades many people from using what is an awesome set of features. I
just want to add a little sugar to help people get started.
Granted, @@ is a bit awkward until you get used to it. "x LIKE y" would
read out better, but unfortunately that's already taken ;-).
In any case, it's way too late.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Mon, Nov 12, 2007 at 08:09:48PM +0000, Simon Riggs wrote:
@@ would still exist, so no problems. These additions are for new users,
not old ones.
Given that this is all sugar on top of tsearch anyway, why not put it in
pgfoundry as the tsearch_sugar project? Then packagers could include a
standard set of such sugar if they wanted.
A
--
Andrew Sullivan
Old sigs will return after re-constitution of blue smoke
Heikki Linnakangas wrote:
Granted, @@ is a bit awkward until you get used to it. "x LIKE y" would
read out better, but unfortunately that's already taken ;-).
Actually LIKE does not make much sense when you have 'hay & needle'.
Probably MATCHES would be a better term ... but then, MySQL defines a
strange thing called MATCH/AGAINST; so apparently you use "MATCH (column
list) AGAINST (pattern spec)"
None of this is standard though ...
--
Alvaro Herrera http://www.amazon.com/gp/registry/CTMLCN8V17R4
"Everybody understands Mickey Mouse. Few understand Hermann Hesse.
Hardly anybody understands Einstein. And nobody understands Emperor Norton."
* Heikki Linnakangas <heikki@enterprisedb.com> [071112 15:18]:
Simon Riggs wrote:
Right now, Full Text Search SQL looks like complete gibberish and it
dissuades many people from using what is an awesome set of features. I
just want to add a little sugar to help people get started.
Granted, @@ is a bit awkward until you get used to it. "x LIKE y" would
read out better, but unfortunately that's already taken ;-).
Can LIKE be easily overloaded in the parser? So:
text LIKE text
works in it's current form, and
tsvector LIKE tsquery
also works like the @@? Or have I gotten all the ts* types all mixed up
again...
But it doesn't buy anything except avoiding the "@@" that people seem to not
grok easily, and it might actually cause more grief, because of people
confusing the 2 forms of LIKE.
a.
--
Aidan Van Dyk Create like a god,
aidan@highrise.ca command like a king,
http://www.highrise.ca/ work like a slave.
On Mon, 2007-11-12 at 20:17 +0000, Heikki Linnakangas wrote:
Granted, @@ is a bit awkward until you get used to it. "x LIKE y" would
read out better, but unfortunately that's already taken ;-).
Remember, I'm not suggesting we get rid of @@
In any case, it's way too late.
I'm suggesting we add a couple of simple SQL functions that will help
text search docs be more easily understood.
It's beta and its valid to respond to usability issues just as we would
respond to code bugs. Otherwise, why have beta? Late, but not too late.
SQLServer, Oracle and MySQL all use functions, not operators. My
observation would be that we have the hardest and most difficult to
understand full text search capability. The Contains() function seems
like a better name than I gave earlier also.
I love what we've done; I just want more people be able to use it.
--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com
Hello
look to standard, please. SQL/MM has part - full text.
SELECT docno
FROM information
WHERE document.CONTAINS
('STEMMED FORM OF "standard"
IN SAME PARAGRAPH AS
SOUNDS LIKE "sequel"') = 1
it's little bit baroque, It's sample of method.
So,it can be:
SELECT ..
FROM x.contains(y);
It's well readable and elegant too.
Regards
Pavel Stehule
Simon Riggs wrote:
On Mon, 2007-11-12 at 20:17 +0000, Heikki Linnakangas wrote:
Granted, @@ is a bit awkward until you get used to it. "x LIKE y" would
read out better, but unfortunately that's already taken ;-).Remember, I'm not suggesting we get rid of @@
In any case, it's way too late.
I'm suggesting we add a couple of simple SQL functions that will help
text search docs be more easily understood.It's beta and its valid to respond to usability issues just as we would
respond to code bugs. Otherwise, why have beta? Late, but not too late.SQLServer, Oracle and MySQL all use functions, not operators. My
observation would be that we have the hardest and most difficult to
understand full text search capability. The Contains() function seems
like a better name than I gave earlier also.I love what we've done; I just want more people be able to use it.
Hmmm, my choices are:
SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector @@ 'cat &
rat'::tsquery;
Or:
SELECT ts_match('a fat cat sat on a mat and ate a fat rat','cat & rat');
This seems a little too much like the "duh" department to ignore. A set
of SQL functions would certainly be appropriate here.
Sincerely,
Joshua D. Drake
On Mon, Nov 12, 2007 at 03:44:18PM -0500, Aidan Van Dyk wrote:
Can LIKE be easily overloaded in the parser? So:
text LIKE text
works in it's current form, and
tsvector LIKE tsquery
also works like the @@? Or have I gotten all the ts* types all mixed up
again...
AIUI LIKE is mashed into an operator at parse time, so yes, if you
create the operator with the right name it will just work.
Or not (I havn't tested it).
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
Those who make peaceful revolution impossible will make violent revolution inevitable.
-- John F Kennedy
On Mon, 2007-11-12 at 21:59 +0100, Pavel Stehule wrote:
SELECT docno
FROM information
WHERE document.CONTAINS
('STEMMED FORM OF "standard"
IN SAME PARAGRAPH AS
SOUNDS LIKE "sequel"') = 1it's little bit baroque, It's sample of method.
Seems thats the way Oracle does it too.
The SQLServer syntax is
WHERE contains(text_column, search_query)
which seems marginally better.
--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com
Pavel Stehule escribi�:
Hello
look to standard, please. SQL/MM has part - full text.
Huh, what version of the standard is this? My copy (the typical 2003
draft) doesn't have SQL/MM AFAICS.
--
Alvaro Herrera http://www.amazon.com/gp/registry/CTMLCN8V17R4
"La espina, desde que nace, ya pincha" (Proverbio africano)
On 12/11/2007, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
Pavel Stehule escribió:
Hello
look to standard, please. SQL/MM has part - full text.
Huh, what version of the standard is this? My copy (the typical 2003
draft) doesn't have SQL/MM AFAICS.
I found
http://jtc1sc32.org/doc/N0751-0800/32N0771T.pdf
http://www.sigmod.org/record/issues/0112/standards.pdf
http://dbs.uni-leipzig.de/file/kap5.pdf
Pavel
Show quoted text
--
Alvaro Herrera http://www.amazon.com/gp/registry/CTMLCN8V17R4
"La espina, desde que nace, ya pincha" (Proverbio africano)
Simon Riggs wrote:
On Mon, 2007-11-12 at 11:56 -0500, Tom Lane wrote:
Simon Riggs <simon@2ndquadrant.com> writes:
So we end up with a normal sounding function that is overloaded to
provide all of the various goodies.As best I can tell, @@ does exactly this already. This is just a
different spelling of the same capability, and I don't actually
find it better. Why is "text_search(x,y)" better than "x @@ y"?
We don't recommend that people write "texteq(x,y)" instead of
"x = y".Most people don't understand those differences. x = y means "make sure
they are the same" to most people. They don't see what you (and I) see:
function and operator interchangeability. So text_search() is better
than @@ and = is better than texteq(). Life ain't neat...Right now, Full Text Search SQL looks like complete gibberish and it
dissuades many people from using what is an awesome set of features. I
just want to add a little sugar to help people get started.
I realized this when editing the documentation but not clearly. I
noticed that:
http://momjian.us/main/writings/pgsql/sgml/textsearch-intro.html#TEXTSEARCH-MATCHING
tsvector @@ tsquery
tsquery @@ tsvector
text @@ tsquery
text @@ text
The first two of these we saw already. The form text @@ tsquery is
equivalent to to_tsvector(x) @@ y. The form text @@ text is equivalent
to to_tsvector(x) @@ plainto_tsquery(y).
was quite odd, especially the "text @@ text" case, and in fact it makes
casting almost required unless you can remember which one is a query and
which is a vector (hint, the vector is first). What really adds to the
confusion is that the operator is two _identical_ characters, meaning
the operator is symetric, and it behave symetric if you cast one side,
but as vector @@ query if you don't.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +