new function for tsquery creartion
Dear all,
Now Postgres has a few functions to create tsqueries for full text
search. The main one is the to_tsquery function that allows to make
query with any operation. But to make correct query all of the operators
should be specified explicitly. In order to make it easier postgres has
functions like plainto_tsquery and phraseto_tsquery which allow to make
tsqueries from strings. But they are not flexible enough.
Let me introduce new function for full text search query creation(which
is called 'queryto_tsquery'). It takes 'google like' query string and
translates it to tsquery.
The main features are the following:
All the text inside double quotes would be treated as a phrase("a b c"
-> 'a <-> b <-> c')
New operator AROUND(N). It matches if the distance between words(or
maybe phrases) is less than or equal to N.
Alias for !('-rat' is the same as '!rat')
Alias for |('dog OR cat' is the same as 'dog | cat')
As a plainto_tsquery and phraseto_tsquery it will fill operators by
itself, but already placed operations won't be ignored. It allows to
combine two approaches.
In the attachment you can find patch with the new features, tests and
documentation for it.
What do you think about it?
Thank you very much for the attention!
--
------
Victor Drobny
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
queryto_tsquery.patchtext/x-diff; name=queryto_tsquery.patchDownload+449-40
On Wed, Jul 19, 2017 at 12:43 PM, Victor Drobny <v.drobny@postgrespro.ru> wrote:
Let me introduce new function for full text search query creation(which is
called 'queryto_tsquery'). It takes 'google like' query string and
translates it to tsquery.
I haven't looked at the code, but that sounds like a neat idea.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jul 20, 2017 at 4:58 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Jul 19, 2017 at 12:43 PM, Victor Drobny <v.drobny@postgrespro.ru> wrote:
Let me introduce new function for full text search query creation(which is
called 'queryto_tsquery'). It takes 'google like' query string and
translates it to tsquery.I haven't looked at the code, but that sounds like a neat idea.
+1
This is a very cool feature making tsquery much more accessible. Many
people know that sort of defacto search engine query language that
many websites accept using quotes, AND, OR, - etc.
Calling this search syntax just "query" seems too general and
overloaded. "Simple search", "simple query", "web search", "web
syntax", "web query", "Google-style query", "Poogle" (kidding!) ...
well I'm not sure, but I feel like it deserves a proper name.
websearch_to_tsquery()?
I see that your AROUND(n) is an undocumented Google search syntax.
That's a good trick to know.
Please send a rebased version of the patch for people to review and
test as that one has bit-rotted.
--
Thomas Munro
http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017-09-09 06:03, Thomas Munro wrote:
Please send a rebased version of the patch for people to review and
test as that one has bit-rotted.
Hello,
Thank you for interest. In the attachment you can find rebased
version(based on 69835bc8988812c960f4ed5aeee86b62ac73602a commit)
--
Victor Drobny
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
queryto_tsquery2.patchtext/x-diff; name=queryto_tsquery2.patchDownload+449-40
Hi all,
I am extending phrase operator <n> is such way that it will have <n,m>
syntax that means from n to m words, so I will use such syntax (<n,m>)
further. I found that a AROUND(N) b is exactly the same as a <-N,N> b
and it can be replaced while parsing. So, what do you think of such
idea? In this patch I have noticed some unobvious behavior.
# select to_tsvector('Hello, cat world!') @@ queryto_tsquery('cat
AROUND(1) cat') as match;
match
-------
t
cat AROUND(1) cat is the same is "cat <1> cat || cat <0> cat" and:
# select to_tsvector('Hello, cat world!') @@ to_tsquery('cat <0> cat');
?column?
-------
t
It seems to be a proper logic behavior but it is a possible pitfall,
maybe it should be documented?
But more important question is how AROUND() operator should handle stop
words? Now it works as:
# select queryto_tsquery('cat <2> (a AROUND(10) rat)');
queryto_tsquery
------------------
'cat' <12> 'rat'
(1 row)
# select queryto_tsquery('cat <2> a AROUND(10) rat');
queryto_tsquery
------------------------
'cat' AROUND(12) 'rat'
(1 row)
In my opinion it should be like:
cat <2> (a AROUND(10) rat) == cat <2,2> (a <-10,10> rat) == cat <-8,12>
rat
cat <2> a AROUND(10) rat == cat <2,2> a <-10,10> rat = cat <-8, 12>
rat
Now <n,m> operator can be replaced with combination of phrase
operator <n>, AROUND(), and logical operators, but with <n,m> operator
it will be much painless. Correct me, please, if I am wrong.
--
Alexey Chernyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017-10-13 16:37, Alexey Chernyshov wrote:
Hi all,
I am extending phrase operator <n> is such way that it will have <n,m>
syntax that means from n to m words, so I will use such syntax (<n,m>)
further. I found that a AROUND(N) b is exactly the same as a <-N,N> b
and it can be replaced while parsing. So, what do you think of such
idea? In this patch I have noticed some unobvious behavior.
Thank you for the interest and review!
# select to_tsvector('Hello, cat world!') @@ queryto_tsquery('cat
AROUND(1) cat') as match;
match
-------
tcat AROUND(1) cat is the same is "cat <1> cat || cat <0> cat" and:
# select to_tsvector('Hello, cat world!') @@ to_tsquery('cat <0> cat');
?column?
-------
tIt seems to be a proper logic behavior but it is a possible pitfall,
maybe it should be documented?
It is a tricky question. I think that this interpretation is confusing,
so
better to make it as <-N, -1> and <1, N>.
But more important question is how AROUND() operator should handle stop
words? Now it works as:# select queryto_tsquery('cat <2> (a AROUND(10) rat)');
queryto_tsquery
------------------
'cat' <12> 'rat'
(1 row)# select queryto_tsquery('cat <2> a AROUND(10) rat');
queryto_tsquery
------------------------
'cat' AROUND(12) 'rat'
(1 row)In my opinion it should be like:
cat <2> (a AROUND(10) rat) == cat <2,2> (a <-10,10> rat) == cat <-8,12>
rat
I think that correct version is:
cat <2> (a AROUND(10) rat) == cat <2,2> (a <-10,10> rat) == cat <-2,12>
rat.
cat <2> a AROUND(10) rat == cat <2,2> a <-10,10> rat = cat <-8, 12>
rat
It is a problem indeed. I did not catch it during implementation. Thank
you
for pointing it out.
Now <n,m> operator can be replaced with combination of phrase
operator <n>, AROUND(), and logical operators, but with <n,m> operator
it will be much painless. Correct me, please, if I am wrong.
I think that <n,m> operator is more general than around(n) so the last
one
should be based on yours. However, i think, that taking negative
parameters
is not the best idea because it is confusing. On top of that it is not
so
necessary and i think it won`t be popular among users.
It seems to me that AROUND operator can be easily implemented with
<n,m>,
also, it helps to avoid problems, that you showed above.
--
Victor Drobny
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
On 09/13/2017 10:57 AM, Victor Drobny wrote:
On 2017-09-09 06:03, Thomas Munro wrote:
Please send a rebased version of the patch for people to review and
test as that one has bit-rotted.Hello,
Thank you for interest. In the attachment you can find rebased
version(based on 69835bc8988812c960f4ed5aeee86b62ac73602a commit)
I did a quick review of the patch today. The patch unfortunately no
longer applies, so I had to use an older commit from September. Please
rebase to current master.
I've only looked on the diff at this point, will do more testing once
the rebase happens.
Some comments:
1) This seems to mix multiple improvements into one patch. There's the
support for alternative query syntax, and then there are the new
operators (AROUND and <m,n>). I propose to split the patch into two or
more parts, each addressing one of those bits.
I guess there will be two or three parts - first adding the syntax,
second adding <m,n> and third adding the AROUND(n). Seems reasonable?
2) I don't think we should mention Google in the docs explicitly. Not
that I'm somehow anti-google, but this syntax was certainly not invented
by Google - I vividly remember using something like that on Altavista
(yeah, I'm old). And it's used by pretty much every other web search
engine out there ...
3) In the SGML docs, please use <literal></literal> instead of just
quoting the values. So it should be <literal>|</literal> instead of '|'
etc. Just like in the parts describing plainto_tsquery, for example.
4) Also, I recommend adding a brief explanation what the examples do.
Right now there's just a bunch of queryto_tsquery, and the reader is
expected to understand the output. I suggest adding a sentence or two,
explaining what's happening (just like for plainto_tsquery examples).
5) I'm not sure about negative values in the <n,m> operator. I don't
find it particularly confusing - once you understand that (a <n,m> b)
means "there are 'k' words between 'a' and 'b' (n <= k <= m)", then
negative values seem like a fairly straightforward extension.
But I guess the main question is "Is there really a demand for the new
<n,m> operator, or have we just invented if because we can?"
6) There seem to be some new constants defined, with names not following
the naming convention. I mean this
#define WAITOPERAND 1
#define WAITOPERATOR 2
#define WAITFIRSTOPERAND 3
#define WAITSINGLEOPERAND 4
#define INSIDEQUOTES 5 <-- the new one
and
#define TSPO_L_ONLY 0x01
#define TSPO_R_ONLY 0x02
#define TSPO_BOTH 0x04
#define TS_NOT_EXAC 0x08 <-- the new one
Perhaps that's OK, but it seems a bit inconsistent.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2017-11-19 04:30, Tomas Vondra wrote:
Hello,
Hi,
On 09/13/2017 10:57 AM, Victor Drobny wrote:
On 2017-09-09 06:03, Thomas Munro wrote:
Please send a rebased version of the patch for people to review and
test as that one has bit-rotted.Hello,
Thank you for interest. In the attachment you can find rebased
version(based on 69835bc8988812c960f4ed5aeee86b62ac73602a commit)I did a quick review of the patch today. The patch unfortunately no
longer applies, so I had to use an older commit from September. Please
rebase to current master.
Thank you for your time. In the attachment you can find rebased version.
(based on e842791b0 commit)
I've only looked on the diff at this point, will do more testing once
the rebase happens.Some comments:
1) This seems to mix multiple improvements into one patch. There's the
support for alternative query syntax, and then there are the new
operators (AROUND and <m,n>). I propose to split the patch into two or
more parts, each addressing one of those bits.
I agree. I have split it in 3 parts: support for around operator,
queryto_tsquery function and documentation.
I guess there will be two or three parts - first adding the syntax,
second adding <m,n> and third adding the AROUND(n). Seems reasonable?2) I don't think we should mention Google in the docs explicitly. Not
that I'm somehow anti-google, but this syntax was certainly not
invented
by Google - I vividly remember using something like that on Altavista
(yeah, I'm old). And it's used by pretty much every other web search
engine out there ...
Yes, those syntax is not introduced by google, but, as for me, it was
the
easiest way to give a brief description of it. Of cause it can be
changed,
I just don't know how. Any suggestions are welcomed! ;)
3) In the SGML docs, please use <literal></literal> instead of just
quoting the values. So it should be <literal>|</literal> instead of '|'
etc. Just like in the parts describing plainto_tsquery, for example.
Fixed. I hope that i didn't miss anything.
4) Also, I recommend adding a brief explanation what the examples do.
Right now there's just a bunch of queryto_tsquery, and the reader is
expected to understand the output. I suggest adding a sentence or two,
explaining what's happening (just like for plainto_tsquery examples).5) I'm not sure about negative values in the <n,m> operator. I don't
find it particularly confusing - once you understand that (a <n,m> b)
means "there are 'k' words between 'a' and 'b' (n <= k <= m)", then
negative values seem like a fairly straightforward extension.But I guess the main question is "Is there really a demand for the new
<n,m> operator, or have we just invented if because we can?"
The operator <n,m> is not introduced yet. It's just a concept. It were
our
thoughts about implementation AROUND operator through <n,m> in future.
6) There seem to be some new constants defined, with names not
following
the naming convention. I mean this#define WAITOPERAND 1
#define WAITOPERATOR 2
#define WAITFIRSTOPERAND 3
#define WAITSINGLEOPERAND 4
#define INSIDEQUOTES 5 <-- the new oneand
#define TSPO_L_ONLY 0x01
#define TSPO_R_ONLY 0x02
#define TSPO_BOTH 0x04
#define TS_NOT_EXAC 0x08 <-- the new onePerhaps that's OK, but it seems a bit inconsistent.
I agree. I have fixed it.
regards
--
Victor Drobny
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
queryto_tsquery3_01_around.patchtext/x-diff; name=queryto_tsquery3_01_around.patchDownload+127-24
queryto_tsquery3_02_queryto_tsquery.patchtext/x-diff; name=queryto_tsquery3_02_queryto_tsquery.patchDownload+258-20
queryto_tsquery3_03_documentation.patchtext/x-diff; name=queryto_tsquery3_03_documentation.patchDownload+82-4
The following review has been posted through the commitfest application:
make installcheck-world: tested, failed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: tested, passed
Hi Victor,
I like the idea and I think it's a great patch. However in current shape it
requires some amount of reworking to meet PostgreSQL standards of code quality.
Particularly:
1. Many new procedures don't have a comment with a brief description. Ideally
every procedure should have not only a brief description but also a description
of every argument, return value and changes of global state if applicable.
2. I believe you could implement the has_prefix procedure just as a wrapper of
strstr().
3. I suggest to use snprintf instead of sprintf in a new code whenever
possible, especially if you are using %s - just to be on a safe side.
4. I noticed that your code affects the catalog. Did you check that your
changes will not cause problems during the migration from the older version of
PostgreSQL to the never one?
5. Tests for queryto_tsquery use only ASCII strings. I suggest to add a few
test that use non-ASCII characters as well, and a few corner cases like empty
string, string that contains only the stop-words, etc.
6. `make check-world` doesn't pass:
```
***************
*** 1672,1678 ****
(1 row)
set enable_seqscan = on;
-
--test queryto_tsquery function
select queryto_tsquery('My brand new smartphone');
queryto_tsquery
--- 1672,1677 ----
***************
*** 1784,1786 ****
--- 1783,1786 ----
---------------------------
'fat-rat' & 'fat' & 'rat'
(1 row)
+
```
Hi Victor,
I like the idea and I think it's a great patch. However in current shape it
requires some amount of reworking to meet PostgreSQL standards of code quality.
Also I would like to add that I agree with Thomas Munro:
Calling this search syntax just "query" seems too general and
overloaded. "Simple search", "simple query", "web search", "web
syntax", "web query", "Google-style query", "Poogle" (kidding!) ...
well I'm not sure, but I feel like it deserves a proper name.
websearch_to_tsquery()?
websearch_to_tsquery() sounds much better than query_to_tsquery().
Also I agree Tomas Vondra in regard that:
2) I don't think we should mention Google in the docs explicitly. Not
that I'm somehow anti-google, but this syntax was certainly not invented
by Google - I vividly remember using something like that on Altavista
(yeah, I'm old). And it's used by pretty much every other web search
engine out there ...
I suggest to rephrase:
```
+ about its input. <function>queryto_tsquery</function> provides a
+ different, Google like syntax to create tsquery.
```
.. to something more like "provides a different syntax, similar to one
used in web search engines, to create tsqeury". And maybe give a few
examples right in the next sentence.
--
Best regards,
Aleksander Alekseev
On Tue, Nov 28, 2017 at 11:57 PM, Aleksander Alekseev
<a.alekseev@postgrespro.ru> wrote:
I like the idea and I think it's a great patch. However in current shape it
requires some amount of reworking to meet PostgreSQL standards of code quality.Also I would like to add that I agree with Thomas Munro:
Calling this search syntax just "query" seems too general and
overloaded. "Simple search", "simple query", "web search", "web
syntax", "web query", "Google-style query", "Poogle" (kidding!) ...
well I'm not sure, but I feel like it deserves a proper name.
websearch_to_tsquery()?websearch_to_tsquery() sounds much better than query_to_tsquery().
Also I agree Tomas Vondra in regard that:
2) I don't think we should mention Google in the docs explicitly. Not
that I'm somehow anti-google, but this syntax was certainly not invented
by Google - I vividly remember using something like that on Altavista
(yeah, I'm old). And it's used by pretty much every other web search
engine out there ...I suggest to rephrase:
``` + about its input. <function>queryto_tsquery</function> provides a + different, Google like syntax to create tsquery. ```.. to something more like "provides a different syntax, similar to one
used in web search engines, to create tsqeury". And maybe give a few
examples right in the next sentence.
The patch got a review less than 1 day ago, so I am moving it to next
CF with the same status, waiting on author.
--
Michael
On 2017-11-28 17:57, Aleksander Alekseev wrote:
Hi Aleksander,
Thank you for review. I have tried to fix all of your comments.
However i want to mention that the absence of comments for functions
in to_tsany.c is justified by the absence of comments for other
similar functions.
Hi Victor,
I like the idea and I think it's a great patch. However in current
shape it
requires some amount of reworking to meet PostgreSQL standards of code
quality.Also I would like to add that I agree with Thomas Munro:
Calling this search syntax just "query" seems too general and
overloaded. "Simple search", "simple query", "web search", "web
syntax", "web query", "Google-style query", "Poogle" (kidding!) ...
well I'm not sure, but I feel like it deserves a proper name.
websearch_to_tsquery()?websearch_to_tsquery() sounds much better than query_to_tsquery().
Also I agree Tomas Vondra in regard that:
2) I don't think we should mention Google in the docs explicitly. Not
that I'm somehow anti-google, but this syntax was certainly not
invented
by Google - I vividly remember using something like that on Altavista
(yeah, I'm old). And it's used by pretty much every other web search
engine out there ...I suggest to rephrase:
``` + about its input. <function>queryto_tsquery</function> provides a + different, Google like syntax to create tsquery. ```.. to something more like "provides a different syntax, similar to one
used in web search engines, to create tsqeury". And maybe give a few
examples right in the next sentence.
Best,
--
Victor Drobny
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On 2017-11-29 17:56, Victor Drobny wrote:
Sorry, forgot to attach new version of the patch.
On 2017-11-28 17:57, Aleksander Alekseev wrote:
Hi Aleksander,Thank you for review. I have tried to fix all of your comments.
However i want to mention that the absence of comments for functions
in to_tsany.c is justified by the absence of comments for other
similar functions.Hi Victor,
I like the idea and I think it's a great patch. However in current
shape it
requires some amount of reworking to meet PostgreSQL standards of
code quality.Also I would like to add that I agree with Thomas Munro:
Calling this search syntax just "query" seems too general and
overloaded. "Simple search", "simple query", "web search", "web
syntax", "web query", "Google-style query", "Poogle" (kidding!) ...
well I'm not sure, but I feel like it deserves a proper name.
websearch_to_tsquery()?websearch_to_tsquery() sounds much better than query_to_tsquery().
Also I agree Tomas Vondra in regard that:
2) I don't think we should mention Google in the docs explicitly. Not
that I'm somehow anti-google, but this syntax was certainly not
invented
by Google - I vividly remember using something like that on Altavista
(yeah, I'm old). And it's used by pretty much every other web search
engine out there ...I suggest to rephrase:
``` + about its input. <function>queryto_tsquery</function> provides a + different, Google like syntax to create tsquery. ```.. to something more like "provides a different syntax, similar to one
used in web search engines, to create tsqeury". And maybe give a few
examples right in the next sentence.Best,
--
Victor Drobny
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
queryto_tsquery4_01_around.patchtext/x-diff; name=queryto_tsquery4_01_around.patchDownload+123-24
queryto_tsquery4_02_queryto_tsquery.patchtext/x-diff; name=queryto_tsquery4_02_queryto_tsquery.patchDownload+292-18
queryto_tsquery4_03_documentation.patchtext/x-diff; name=queryto_tsquery4_03_documentation.patchDownload+82-4
The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: tested, passed
Here are a few minor issues:
```
+/*
+ * Checks if 'str' starts with a 'prefix'
+ */
+static bool
+has_prefix(char * str, char * prefix)
+{
+ if (strlen(prefix) > strlen(str))
+ {
+ return false;
+ }
+ return strstr(str, prefix) == str;
+}
```
strlen() check is redundant.
```
+ case OP_AROUND:
+ snprintf(in->cur, 256, " AROUND(%d) %s", distance, nrm.buf);
+ break;
```
Instead of the constant 256 it's better to use sizeof().
Apart from these issues this patch looks not bad.
The new status of this patch is: Ready for Committer
On 2017-11-29 17:56:30 +0300, Victor Drobny wrote:
Thank you for review. I have tried to fix all of your comments.
However i want to mention that the absence of comments for functions
in to_tsany.c is justified by the absence of comments for other
similar functions.
That's not justification. Tsquery related code is notorious for being
badly commented, we do not want to continue that.
Greetings,
Andres Freund
It seems that this patch doesn't apply anymore, see http://commitfest.cputube.org/
The new status of this patch is: Waiting on Author
Hi Victor,
On 3/5/18 7:52 AM, Aleksander Alekseev wrote:
It seems that this patch doesn't apply anymore, see http://commitfest.cputube.org/
The new status of this patch is: Waiting on Author
This patch needs a rebase and should address the comments from
Aleksander and Andres. We are now three weeks into the CF with no new
patch.
Are you planning to provide a new patch? If not, I think it should be
marked as Returned with Feedback and submitted to the next CF once it
has been updated.
Regards,
--
-David
david@pgmasters.net
Hi David,
I'd like to take over from Victor. I'll post a revised version of the
patch in a couple of days.
--
Dmitry Ivanov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
I am extending phrase operator <n> is such way that it will have <n,m>
syntax that means from n to m words, so I will use such syntax (<n,m>)
further. I found that a AROUND(N) b is exactly the same as a <-N,N> b
and it can be replaced while parsing. So, what do you think of such
idea? In this patch I have noticed some unobvious behavior.
I think new operator should be a subject for separate patch. And I prefer idea
about range phrase operator.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
On Thu, 22 Mar 2018 16:53:15 +0300
Dmitry Ivanov <d.ivanov@postgrespro.ru> wrote:
Hi David,
I'd like to take over from Victor. I'll post a revised version of the
patch in a couple of days.
Hi Dmitry,
Recently I worked with the old version of the patch and found a bug.
So, I think it is better to notify you immediately, so you can fix it in
rebased/revised version.
I noticed, that operator AROUND(N) works only
in case of non-negative operands. If any of the operands is negative, it
behaves as phrase operator <N>. It is caused by lack of TS_NOT_EXAC
flag and AROUND(N) operator check in function TS_phrase_execute in
branches for negated operands.
--
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company