pg_trgm word_similarity inconsistencies or bug
Hello all, this is related to postgres 9.6 (9.6.4) and a good description can be found here https://stackoverflow.com/questions/46966360/postgres-word-similarity-not-comparing-words
But in summary, word_similarity doesn’t seem to do exactly what the docs say, since it will match trigrams from multiple words rather tan doing a word by word comparison.
Below is a table with output and expected output, thanks to kiln from stackoverflow to provide it.
with data(t) as (
values
('message'),
('message s'),
('message sag'),
('message sag sag'),
('message sag sage')
)
select t, word_similarity('sage', t), my_word_similarity('sage', t)
from data;
t | word_similarity | my_word_similarity
------------------+-----------------+--------------------
message | 0.6 | 0.3
message s | 0.8 | 0.3
message sag | 1 | 0.5
message sag sag | 1 | 0.5
message sag sage | 1 | 1
On Fri, Oct 27, 2017 at 06:48:08PM +0000, Cristiano Coelho wrote:
Hello all, this is related to postgres 9.6 (9.6.4) and a good description can be found here https://stackoverflow.com/questions/46966360/postgres-word-similarity-not-comparing-words
But in summary, word_similarity doesn’t seem to do exactly what the docs say, since it will match trigrams from multiple words rather tan doing a word by word comparison.
Below is a table with output and expected output, thanks to kiln from stackoverflow to provide it.
Interesting. An klin's answer from stackoverflow.com is right.
The initial example can be reduced to the next:
=# select word_similarity('sage', 'age sag');
word_similarity
-----------------
1
It computes maximum similarity using closest trigrams not considering order of
'sage' trigrams. It determines that all
trigrams from 'sage' match trigrams from 'age sag'.
Initial order of 'age sag' trigrams:
' a', ' ag', 'age', 'ge ', ' s', ' sa', 'sag', 'ag '
^ ^
|from |to
Sorted 'sage' trigrams (all of them occured within 'age sag' trigrams
continuously):
' s', ' sa', 'age', 'ge ', 'sag'
Maybe the problem should be solved by considering 'sage' trigrams
initial order.
--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
On Sat, Oct 28, 2017 at 11:22 AM, Arthur Zakirov <a.zakirov@postgrespro.ru>
wrote:
On Fri, Oct 27, 2017 at 06:48:08PM +0000, Cristiano Coelho wrote:
Hello all, this is related to postgres 9.6 (9.6.4) and a good
description can be found here https://stackoverflow.com/
questions/46966360/postgres-word-similarity-not-comparing-wordsBut in summary, word_similarity doesn’t seem to do exactly what the docs
say, since it will match trigrams from multiple words rather tan doing a
word by word comparison.Below is a table with output and expected output, thanks to kiln from
stackoverflow to provide it.
Interesting. An klin's answer from stackoverflow.com is right.
The initial example can be reduced to the next:
=# select word_similarity('sage', 'age sag');
word_similarity
-----------------
1It computes maximum similarity using closest trigrams not considering
order of
'sage' trigrams. It determines that all
trigrams from 'sage' match trigrams from 'age sag'.Initial order of 'age sag' trigrams:
' a', ' ag', 'age', 'ge ', ' s', ' sa', 'sag', 'ag '
^ ^
|from |to
Sorted 'sage' trigrams (all of them occured within 'age sag' trigrams
continuously):
' s', ' sa', 'age', 'ge ', 'sag'Maybe the problem should be solved by considering 'sage' trigrams
initial order.
We searching for continuous extent of second string trigrams (in original
orders) which has best similarity with first string trigrams.
Possible solution could be forcing this extent boundaries to be at word
boundaries. However, it would become less convenient to search for *part*
of word. And we already have users adopt this feature.
So, I see following solution:
1) Define GUC variable which specifies whether word_similarity() should
force extent boundaries to be at word boundaries,
2) Document both cases of word_similarity() behavior.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
2017-10-30 19:08 GMT+01:00 Alexander Korotkov <a.korotkov@postgrespro.ru>:
On Sat, Oct 28, 2017 at 11:22 AM, Arthur Zakirov <a.zakirov@postgrespro.ru> wrote:
On Fri, Oct 27, 2017 at 06:48:08PM +0000, Cristiano Coelho wrote:
Hello all, this is related to postgres 9.6 (9.6.4) and a good description can be found here https://stackoverflow.com/questions/46966360/postgres-word-similarity-not-comparing-words
But in summary, word_similarity doesn’t seem to do exactly what the docs say, since it will match trigrams from multiple words rather tan doing a word by word comparison.
Below is a table with output and expected output, thanks to kiln from stackoverflow to provide it.
Interesting. An klin's answer from stackoverflow.com is right.
The initial example can be reduced to the next:
=# select word_similarity('sage', 'age sag');
word_similarity
-----------------
1It computes maximum similarity using closest trigrams not considering order of
'sage' trigrams. It determines that all
trigrams from 'sage' match trigrams from 'age sag'.Initial order of 'age sag' trigrams:
' a', ' ag', 'age', 'ge ', ' s', ' sa', 'sag', 'ag '
^ ^
|from |to
Sorted 'sage' trigrams (all of them occured within 'age sag' trigrams
continuously):
' s', ' sa', 'age', 'ge ', 'sag'Maybe the problem should be solved by considering 'sage' trigrams
initial order.We searching for continuous extent of second string trigrams (in original orders) which has best similarity with first string trigrams.
Possible solution could be forcing this extent boundaries to be at word boundaries. However, it would become less convenient to search for *part* of word. And we already have users adopt this feature.
So, I see following solution:
1) Define GUC variable which specifies whether word_similarity() should force extent boundaries to be at word boundaries,
2) Document both cases of word_similarity() behavior.------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Look at the example:
with data(word, string) as (
values
('sage', 'message'),
('sage', 'message s'),
('sage', 'message sa')
)
select
similarity(word, string),
word_similarity (word, string)
from data;
similarity | word_similarity
------------+-----------------
0.3 | 0.6
0.363636 | 0.8
0.454545 | 1
(3 rows)
When searching for a part of a word I would expect that the word
similarity is the same in all three rows. It's really strange that the
context of the second word (sa) makes the similarity equal to 1.
From a user's point of view it's also hard to understand why there is
such a big difference between similarity() and word_similarity(),
especially when comparing just two words (the first row).
I do not think the current function has any practical use.
------
Jan Przemysław Wójcik
2017-10-30 19:08 GMT+01:00 Alexander Korotkov <a.korotkov@postgrespro.ru>:
On Sat, Oct 28, 2017 at 11:22 AM, Arthur Zakirov <a.zakirov@postgrespro.ru>
wrote:On Fri, Oct 27, 2017 at 06:48:08PM +0000, Cristiano Coelho wrote:
Hello all, this is related to postgres 9.6 (9.6.4) and a good
description can be found here
https://stackoverflow.com/questions/46966360/postgres-word-similarity-not-comparing-wordsBut in summary, word_similarity doesn’t seem to do exactly what the docs
say, since it will match trigrams from multiple words rather tan doing a
word by word comparison.Below is a table with output and expected output, thanks to kiln from
stackoverflow to provide it.Interesting. An klin's answer from stackoverflow.com is right.
The initial example can be reduced to the next:
=# select word_similarity('sage', 'age sag');
word_similarity
-----------------
1It computes maximum similarity using closest trigrams not considering
order of
'sage' trigrams. It determines that all
trigrams from 'sage' match trigrams from 'age sag'.Initial order of 'age sag' trigrams:
' a', ' ag', 'age', 'ge ', ' s', ' sa', 'sag', 'ag '
^ ^
|from |to
Sorted 'sage' trigrams (all of them occured within 'age sag' trigrams
continuously):
' s', ' sa', 'age', 'ge ', 'sag'Maybe the problem should be solved by considering 'sage' trigrams
initial order.We searching for continuous extent of second string trigrams (in original
orders) which has best similarity with first string trigrams.
Possible solution could be forcing this extent boundaries to be at word
boundaries. However, it would become less convenient to search for *part*
of word. And we already have users adopt this feature.
So, I see following solution:
1) Define GUC variable which specifies whether word_similarity() should
force extent boundaries to be at word boundaries,
2) Document both cases of word_similarity() behavior.------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
On Tue, Oct 31, 2017 at 4:02 PM, Jan Przemysław Wójcik <
jan.przemyslaw.wojcik@gmail.com> wrote:
2017-10-30 19:08 GMT+01:00 Alexander Korotkov <a.korotkov@postgrespro.ru>:
On Sat, Oct 28, 2017 at 11:22 AM, Arthur Zakirov <
a.zakirov@postgrespro.ru> wrote:
On Fri, Oct 27, 2017 at 06:48:08PM +0000, Cristiano Coelho wrote:
Hello all, this is related to postgres 9.6 (9.6.4) and a good
description can be found here https://stackoverflow.com/
questions/46966360/postgres-word-similarity-not-comparing-wordsBut in summary, word_similarity doesn’t seem to do exactly what the
docs say, since it will match trigrams from multiple words rather tan doing
a word by word comparison.Below is a table with output and expected output, thanks to kiln from
stackoverflow to provide it.
Interesting. An klin's answer from stackoverflow.com is right.
The initial example can be reduced to the next:
=# select word_similarity('sage', 'age sag');
word_similarity
-----------------
1It computes maximum similarity using closest trigrams not considering
order of
'sage' trigrams. It determines that all
trigrams from 'sage' match trigrams from 'age sag'.Initial order of 'age sag' trigrams:
' a', ' ag', 'age', 'ge ', ' s', ' sa', 'sag', 'ag '
^ ^
|from |to
Sorted 'sage' trigrams (all of them occured within 'age sag' trigrams
continuously):
' s', ' sa', 'age', 'ge ', 'sag'Maybe the problem should be solved by considering 'sage' trigrams
initial order.We searching for continuous extent of second string trigrams (in
original orders) which has best similarity with first string trigrams.
Possible solution could be forcing this extent boundaries to be at word
boundaries. However, it would become less convenient to search for *part*
of word. And we already have users adopt this feature.So, I see following solution:
1) Define GUC variable which specifies whether word_similarity() shouldforce extent boundaries to be at word boundaries,
2) Document both cases of word_similarity() behavior.
Look at the example:
with data(word, string) as (
values
('sage', 'message'),
('sage', 'message s'),
('sage', 'message sa')
)select
similarity(word, string),
word_similarity (word, string)
from data;similarity | word_similarity
------------+-----------------
0.3 | 0.6
0.363636 | 0.8
0.454545 | 1
(3 rows)When searching for a part of a word I would expect that the word
similarity is the same in all three rows. It's really strange that the
context of the second word (sa) makes the similarity equal to 1.From a user's point of view it's also hard to understand why there is
such a big difference between similarity() and word_similarity(),
especially when comparing just two words (the first row).
Probably word_similarity() is not a good name for this function. Initially
it was called substring_similarity() which now seems like better name for
that.
I do not think the current function has any practical use.
It's hard for me to agree or disagree with you. There is no technical
problem to force word_similarity() to search for extent boundaries within
word boundaries. However, we already have customers using this function
(and they are likely satisfied with its currency behavior). It's important
for me that our fix wouldn't affect them. I asked them to join this
discussion. I hope that together we'll find a consensus.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Hi!
I'd like to forward a feedback from our customer who uses word_similarity()
function.
François finds current behavior of word_similarity() to be useful. Thus, I
think we should preserve it. But documentation correction is needed and
option for alternative behavior would be useful too.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
---------- Forwarded message ----------
From: François CHAHUNEAU <Francois.CHAHUNEAU@numen.fr>
Date: Wed, Nov 1, 2017 at 1:04 AM
Subject: RE: [BUGS] pg_trgm word_similarity inconsistencies or bug
To: Alexander Korotkov <a.korotkov@postgrespro.ru>
Cc: Thierry BOUDIERE <Thierry.BOUDIERE@numen.fr>, "foli@numen.mg" <
foli@numen.mg>
Hello Alexander,
We agree that the current pg_trgm documentation does not correctly reflect
the de facto behavior of word_similarity(), and that something has to be
changed. But to us, it is more a documentation problem than anything else.
What is computed is still « substring_similarity » as was initially
specified between us, but it is influenced by a strong word boundary bias
caused by the way trigrams are padded at word boundaries. This bias was
noticed by early reviewers and you explained that this motivated the name
switch to « word_similarity ». As you will remember, at the time we
discovered this, we were suprised because we considerd this as a slight
misnomer. Indeed, what is currently described in the 9.6 pg_trgm
documentation is inaccurate (although seemingly consistent with this new
name) and has to be amended.
Now, word_similarity() has been out for more than a year and, of course, it
is preferable to avoid any breaking changes… In our case, we consider the
name « unfortunate » and the explanation buggy, not the function itself.
As you may remember from the initial discussion, some other users stressed
the importance to be able to matchsub strings. We tend to agree with what
Jeff Janes wrote in this discussion :
The reason I like the option of not treating word boundaries as
special in this case is that often in scientific vocabulary, and in
catalog part numbers, people are pretty inconsistent about whether
they included spaces. "HEK 293", "HEK293", and "HEK-293" could be all
the same thing. So I like to strip out spaces and punctuation on both
sides of operator. Of course I can't do that if there are invisible
un-removable spaces on the substring side.
But, It doesn't sound like I am going to win that debate. Given that,
I don't think we need a different name for the function. I'm fine with
explaining the word-boundary subtlety in the documentation, and
keeping the function name itself simple.
Now, considering your proposal :
As far as we are concerned, we use <% and %> everyday for efficient fuzzy
matching on large databases. Our typical usage scenario is matching noisy
OCRized text strings against reference databases.
*> 1) Define GUC variable which specifies whether word_similarity() should
force extent boundaries to be at word boundaries,*
Ok for us,* iff* default behavior remains the same as now, for backward
compatibility reasons. We could take advantage, *in some cases*, of the new
« word rounded » behavior controlled by the GUC variable, but this would
not cover all scenarios currently in use.
2*) Document both cases of word_similarity() behavior.*
This is clearly needed anyway.
Best regards,
*François CHAHUNEAU*
Directeur des technologies
NUMEN DIGITAL| 24, rue Marc Seguin
<https://maps.google.com/?q=24,+rue+Marc+Seguin+75018+Paris+France&entry=gmail&source=g>
75018
<https://maps.google.com/?q=24,+rue+Marc+Seguin+75018+Paris+France&entry=gmail&source=g>
Paris
<https://maps.google.com/?q=24,+rue+Marc+Seguin+75018+Paris+France&entry=gmail&source=g>
France
<https://maps.google.com/?q=24,+rue+Marc+Seguin+75018+Paris+France&entry=gmail&source=g>*
| www.numen.fr
<https://numen.letsignit.com/r/0/991c6b92-d8fe-4afa-95f5-7b74d0322fd9>*
Tel +33 1 40 37 95 03 <+33%201%2040%2037%2095%2003> | Mob +33 6 07 85 21 79
<+33%206%2007%2085%2021%2079> | Fax +33 1 40 37 94 94
<+33%201%2040%2037%2094%2094>
<https://numen.letsignit.com/r/15/57dd0ced-dea8-441a-a066-68bf7cedbecd>
<https://numen.letsignit.com/r/3/9be1fd6e-57d8-4963-bcc7-03151b263433> Pensez
vert, n’imprimez que nécessaire. Les informations contenues dans le présent
e-mail sont exclusivement adressées au(x) destinataire(s) de ce message et
peuvent contenir des informations confidentielles, protégées par un secret
professionnel. L’utilisation de ces informations par d’autres personnes que
le(s) destinataire(s) est strictement interdite. Si vous n’êtes pas
destinataire de ce message, la publication, la reproduction, la diffusion
et /ou la distribution de ces informations auprès de tiers n’est pas
autorisée. Si vous avez reçu cet e-mail par erreur, veuillez nous en
informer immédiatement, détruire l'email, ses copies et documents joints et
le supprimer.
*De :* Alexander Korotkov [mailto:a.korotkov@postgrespro.ru]
*Envoyé :* mardi 31 octobre 2017 16:18
*À :* Thierry BOUDIERE <Thierry.BOUDIERE@numen.fr>; François CHAHUNEAU <
Francois.CHAHUNEAU@numen.fr>
*Objet :* Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug
Dear, Thierry and François!
PostgreSQL users found inconsistency between documentation and
implementation of word_similarity().
Possible solution proposed by the reporter is to alter the implementation.
But it's important for me that your interests are not affected but
potential further change of implementation of word_similarity().
Could you please share your opinion on changes proposed by Jan in the
pgsql-bugs mailing list?
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Import Notes
Reply to msg id not found: AM4PR02MB1316BF00D42FC486D5A4F6CEF65E0@AM4PR02MB1316.eurprd02.prod.outlook.com
Hi,
my statement about the function usefulness was probably too categorical,
though I had in mind the current name of the function.
I'm afraid that creating a function that implements quite different
algorithms depending on a global parameter seems very hacky and would lead
to misunderstandings. I do understand the need of backward compatibility,
but I'd opt for the lesser evil. Perhaps a good idea would be to change the
name to 'substring_similarity()' and introduce the new function
'word_similarity()' later, for example in the next major version release.
------
Jan Przemysław Wójcik
--
Sent from: http://www.postgresql-archive.org/PostgreSQL-bugs-f2117394.html
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Hi!
On Tue, Nov 7, 2017 at 3:51 PM, Jan Przemysław Wójcik <
jan.przemyslaw.wojcik@gmail.com> wrote:
Hi,
my statement about the function usefulness was probably too categorical,
though I had in mind the current name of the function.I'm afraid that creating a function that implements quite different
algorithms depending on a global parameter seems very hacky and would lead
to misunderstandings. I do understand the need of backward compatibility,
but I'd opt for the lesser evil. Perhaps a good idea would be to change the
name to 'substring_similarity()' and introduce the new function
'word_similarity()' later, for example in the next major version release.
Good point. I've no complaints about that. I'm going to propose
corresponding patch to the next commitfest.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Tue, Nov 7, 2017 at 7:24 PM, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:
On Tue, Nov 7, 2017 at 3:51 PM, Jan Przemysław Wójcik <
jan.przemyslaw.wojcik@gmail.com> wrote:my statement about the function usefulness was probably too categorical,
though I had in mind the current name of the function.I'm afraid that creating a function that implements quite different
algorithms depending on a global parameter seems very hacky and would lead
to misunderstandings. I do understand the need of backward compatibility,
but I'd opt for the lesser evil. Perhaps a good idea would be to change
the
name to 'substring_similarity()' and introduce the new function
'word_similarity()' later, for example in the next major version release.Good point. I've no complaints about that. I'm going to propose
corresponding patch to the next commitfest.
I've written a draft patch for fixing this inconsistency. Please, find it
in attachment. This patch doesn't contain proper documentation and
comments yet.
I've called existing behavior subset_similarity(). I didn't use name
substring_similarity(), because it doesn't really looking for substring
with appropriate padding, but rather searching for continuous subset of
trigrams. For index search over subset similarity, %>>, <<%, <->>>, <<<->
operators are provided. I've added extra arrow sign to denote these
operators look deeper into string.
Simultaneously, word_similarity() now forces extent bounds to be word
bounds. Now word_similarity() behaves similar to my_word_similarity()
proposed on stackoverlow.
# with data(t) as (
values
('message'),
('message s'),
('message sag'),
('message sag sag'),
('message sag sage')
)
select t, subset_similarity('sage', t), word_similarity('sage', t)
from data;
t | subset_similarity | word_similarity
------------------+-------------------+-----------------
message | 0.6 | 0.3
message s | 0.8 | 0.363636
message sag | 1 | 0.5
message sag sag | 1 | 0.5
message sag sage | 1 | 1
(5 rows)
The difference here is only in 'messsage s' row, because word_similarity()
allows matching one word to two or more while my_word_similarity() doesn't
allow that. In this case word_similarity() returns similarity between
'sage' and 'message s'.
# select similarity('sage', 'message s');
similarity
------------
0.363636
(1 row)
I think behavior of word_similarity() appears better here, because typo can
break word into two.
I also wonder if word_similarity() and subset_similarity() should share
same threshold value for indexed search. subset_similarity() typically
returns higher values than word_similarity(). Thus, it's probably makes
sense to split their threshold values.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
pg-trgm-word-subset-similarity-1.patchapplication/octet-stream; name=pg-trgm-word-subset-similarity-1.patchDownload+733-195
Hello Alexander,
This is fine with us. Yes, separate thresholds seem preferable.
Best Regards
Obtenez Outlook pour iOS<https://aka.ms/o0ukef>
________________________________
From: Alexander Korotkov <a.korotkov@postgrespro.ru>
Sent: Thursday, December 7, 2017 4:38:59 PM
To: Jan Przemysław Wójcik; Cristiano Coelho
Cc: pgsql-bugs@postgresql.org; François CHAHUNEAU; Artur Zakirov; pgsql-hackers
Subject: Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug
On Tue, Nov 7, 2017 at 7:24 PM, Alexander Korotkov <a.korotkov@postgrespro.ru<mailto:a.korotkov@postgrespro.ru>> wrote:
On Tue, Nov 7, 2017 at 3:51 PM, Jan Przemysław Wójcik <jan.przemyslaw.wojcik@gmail.com<mailto:jan.przemyslaw.wojcik@gmail.com>> wrote:
my statement about the function usefulness was probably too categorical,
though I had in mind the current name of the function.
I'm afraid that creating a function that implements quite different
algorithms depending on a global parameter seems very hacky and would lead
to misunderstandings. I do understand the need of backward compatibility,
but I'd opt for the lesser evil. Perhaps a good idea would be to change the
name to 'substring_similarity()' and introduce the new function
'word_similarity()' later, for example in the next major version release.
Good point. I've no complaints about that. I'm going to propose corresponding patch to the next commitfest.
I've written a draft patch for fixing this inconsistency. Please, find it in attachment. This patch doesn't contain proper documentation and comments yet.
I've called existing behavior subset_similarity(). I didn't use name substring_similarity(), because it doesn't really looking for substring with appropriate padding, but rather searching for continuous subset of trigrams. For index search over subset similarity, %>>, <<%, <->>>, <<<-> operators are provided. I've added extra arrow sign to denote these operators look deeper into string.
Simultaneously, word_similarity() now forces extent bounds to be word bounds. Now word_similarity() behaves similar to my_word_similarity() proposed on stackoverlow.
# with data(t) as (
values
('message'),
('message s'),
('message sag'),
('message sag sag'),
('message sag sage')
)
select t, subset_similarity('sage', t), word_similarity('sage', t)
from data;
t | subset_similarity | word_similarity
------------------+-------------------+-----------------
message | 0.6 | 0.3
message s | 0.8 | 0.363636
message sag | 1 | 0.5
message sag sag | 1 | 0.5
message sag sage | 1 | 1
(5 rows)
The difference here is only in 'messsage s' row, because word_similarity() allows matching one word to two or more while my_word_similarity() doesn't allow that. In this case word_similarity() returns similarity between 'sage' and 'message s'.
# select similarity('sage', 'message s');
similarity
------------
0.363636
(1 row)
I think behavior of word_similarity() appears better here, because typo can break word into two.
I also wonder if word_similarity() and subset_similarity() should share same threshold value for indexed search. subset_similarity() typically returns higher values than word_similarity(). Thus, it's probably makes sense to split their threshold values.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com<http://www.postgrespro.com/>
The Russian Postgres Company
On Tue, Nov 7, 2017 at 7:51 AM, Jan Przemysław Wójcik
<jan.przemyslaw.wojcik@gmail.com> wrote:
I'm afraid that creating a function that implements quite different
algorithms depending on a global parameter seems very hacky and would lead
to misunderstandings. I do understand the need of backward compatibility,
but I'd opt for the lesser evil. Perhaps a good idea would be to change the
name to 'substring_similarity()' and introduce the new function
'word_similarity()' later, for example in the next major version release.
That breaks things for everybody using word_similarity() currently.
If the previous discussion of this topic concluded that
word_similarity() was an OK name despite being a slight misnomer, I
don't think we should change our mind now. Instead the new function
can be called something which makes the difference clear, e.g.
strict_word_similarity(), and the old function can remain as it is.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, Dec 7, 2017 at 8:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Nov 7, 2017 at 7:51 AM, Jan Przemysław Wójcik
<jan.przemyslaw.wojcik@gmail.com> wrote:I'm afraid that creating a function that implements quite different
algorithms depending on a global parameter seems very hacky and wouldlead
to misunderstandings. I do understand the need of backward compatibility,
but I'd opt for the lesser evil. Perhaps a good idea would be to changethe
name to 'substring_similarity()' and introduce the new function
'word_similarity()' later, for example in the next major version release.That breaks things for everybody using word_similarity() currently.
If the previous discussion of this topic concluded that
word_similarity() was an OK name despite being a slight misnomer, I
don't think we should change our mind now. Instead the new function
can be called something which makes the difference clear, e.g.
strict_word_similarity(), and the old function can remain as it is.
+1
Thank you for pointing this. Yes, it would be better not to change
existing names and behavior, but adjust documentation and add alternative
behavior with another name.
Therefore, I'm going to provide patchset of two patches:
1) Improve word_similarity() documentation.
2) Add new function strict_word_similarity() (or whatever better name we
invent).
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Fri, Dec 8, 2017 at 2:50 PM, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:
On Thu, Dec 7, 2017 at 8:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Nov 7, 2017 at 7:51 AM, Jan Przemysław Wójcik
<jan.przemyslaw.wojcik@gmail.com> wrote:I'm afraid that creating a function that implements quite different
algorithms depending on a global parameter seems very hacky and wouldlead
to misunderstandings. I do understand the need of backward
compatibility,
but I'd opt for the lesser evil. Perhaps a good idea would be to change
the
name to 'substring_similarity()' and introduce the new function
'word_similarity()' later, for example in the next major versionrelease.
That breaks things for everybody using word_similarity() currently.
If the previous discussion of this topic concluded that
word_similarity() was an OK name despite being a slight misnomer, I
don't think we should change our mind now. Instead the new function
can be called something which makes the difference clear, e.g.
strict_word_similarity(), and the old function can remain as it is.+1
Thank you for pointing this. Yes, it would be better not to change
existing names and behavior, but adjust documentation and add alternative
behavior with another name.
Therefore, I'm going to provide patchset of two patches:
1) Improve word_similarity() documentation.
2) Add new function strict_word_similarity() (or whatever better name we
invent).
Please, find patchset attached.
0001-pg-trgm-word-similarity-docs-improvement.patch – contains improvement
to documentation of word_similarity() and related operators. I decided to
give formal definition first (what exactly it internally does), and then
example and some more human-understandable description. This patch also
adjusts two comments where lower and upper bounds mess up.
0002-pg-trgm-strict_word-similarity.patch – implementation of
strict_word_similarity() with comments, docs and tests.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
0002-pg-trgm-strict_word-similarity.patch – implementation of
strict_word_similarity() with comments, docs and tests.
The patch looks commmitable, but sometime I get
*** ...pgsql/contrib/pg_trgm/expected/pg_strict_word_trgm.out 2017-12-12
14:16:55.190989000 +0300
--- .../pgsql/contrib/pg_trgm/results/pg_strict_word_trgm.out 2017-12-12
14:17:27.645639000 +0300
***************
*** 153,160 ****
----------+----------------------------------
0 | Kabankala
0.25 | Kabankalan City Public Plaza
- 0.416667 | Kabakala
0.416667 | Abankala
0.538462 | Kabikala
0.625 | Ntombankala School
0.642857 | Nehalla Bankalah Reserved Forest
--- 153,160 ----
----------+----------------------------------
0 | Kabankala
0.25 | Kabankalan City Public Plaza
0.416667 | Abankala
+ 0.416667 | Kabakala
0.538462 | Kabikala
0.625 | Ntombankala School
0.642857 | Nehalla Bankalah Reserved Forest
======================================================================
Seems, some stability order should be added to tests
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
0002-pg-trgm-strict_word-similarity.patch – implementation of
strict_word_similarity() with comments, docs and tests.
After some looking in
1)
repeated piece of code:
+ if (strategy == SimilarityStrategyNumber)
+ nlimit = similarity_threshold;
+ else if (strategy == WordSimilarityStrategyNumber)
+ nlimit = word_similarity_threshold;
+ else /* strategy == StrictWordSimilarityStrategyNumber */
+ nlimit = strict_word_similarity_threshold;
Isn't it better to move that piece to separate function?
2)
calc_word_similarity(char *str1, int slen1, char *str2, int slen2,
bool check_only, bool word_bounds)
Seems, two bools args are replaceble to bitwise-ORed flag. It will simplify
adding new options in future.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
0001-pg-trgm-word-similarity-docs-improvement.patch – contains improvement to
documentation of word_similarity() and related operators. I decided to give
formal definition first (what exactly it internally does), and then example and
some more human-understandable description. This patch also adjusts two
comments where lower and upper bounds mess up.
I'm ready for commit that, but I'd like someone from native English speaker to
check that. Thank you.
And, suppose, this patch should be backpatched to 9.6
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
On Tue, Dec 12, 2017 at 2:33 PM, Teodor Sigaev <teodor@sigaev.ru> wrote:
0002-pg-trgm-strict_word-similarity.patch – implementation of
strict_word_similarity() with comments, docs and tests.
After some looking in
1) repeated piece of code: + if (strategy == SimilarityStrategyNumber) + nlimit = similarity_threshold; + else if (strategy == WordSimilarityStrategyNumber) + nlimit = word_similarity_threshold; + else /* strategy == StrictWordSimilarityStrategyNumber */ + nlimit = strict_word_similarity_threshold; Isn't it better to move that piece to separate function?
Good point. Moved to separate function.
2)
calc_word_similarity(char *str1, int slen1, char *str2, int slen2,
bool check_only, bool word_bounds)Seems, two bools args are replaceble to bitwise-ORed flag. It will
simplify adding new options in future.
Yep. I've introduced flags.
Also, I've adjusted tests to make them stable (found example where TOP-8
distances are unique).
Please, find revised patch in attachment.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
0002-pg-trgm-strict_word-similarity-2.patchapplication/octet-stream; name=0002-pg-trgm-strict_word-similarity-2.patchDownload+1462-61
On Wed, Dec 13, 2017 at 2:13 PM, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:
On Tue, Dec 12, 2017 at 2:33 PM, Teodor Sigaev <teodor@sigaev.ru> wrote:
0002-pg-trgm-strict_word-similarity.patch – implementation of
strict_word_similarity() with comments, docs and tests.
After some looking in
1) repeated piece of code: + if (strategy == SimilarityStrategyNumber) + nlimit = similarity_threshold; + else if (strategy == WordSimilarityStrategyNumber) + nlimit = word_similarity_threshold; + else /* strategy == StrictWordSimilarityStrategyNumber */ + nlimit = strict_word_similarity_threshold; Isn't it better to move that piece to separate function?Good point. Moved to separate function.
2)
calc_word_similarity(char *str1, int slen1, char *str2, int slen2,
bool check_only, bool word_bounds)Seems, two bools args are replaceble to bitwise-ORed flag. It will
simplify adding new options in future.Yep. I've introduced flags.
Also, I've adjusted tests to make them stable (found example where TOP-8
distances are unique).
Please, find revised patch in attachment.
I just found that patch apply is failed according to commitfest.cputube.org.
I think it's because I sent only second patch from patchset in last message.
Anyway I resend both patches rebased to current master.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Hi Alexander,
On 1/4/18 4:25 PM, Alexander Korotkov wrote:
I just found that patch apply is failed according to
commitfest.cputube.org <http://commitfest.cputube.org>. I think it's
because I sent only second patch from patchset in last message.
Anyway I resend both patches rebased to current master.
I agree with Teodor (upthread, not quoted here) that the documentation
could use some editing.
I started to do it myself, but quickly realized I have no knowledge of
the content. I'm afraid I would destroy the meaning while updating the
grammar.
Anyone understand the subject matter well enough to review the
documentation?
Thanks,
--
-David
david@pgmasters.net
On Thu, Mar 1, 2018 at 11:05 PM, David Steele <david@pgmasters.net> wrote:
On 1/4/18 4:25 PM, Alexander Korotkov wrote:
I just found that patch apply is failed according to
commitfest.cputube.org <http://commitfest.cputube.org>. I think it's
because I sent only second patch from patchset in last message.
Anyway I resend both patches rebased to current master.I agree with Teodor (upthread, not quoted here) that the documentation
could use some editing.I started to do it myself, but quickly realized I have no knowledge of
the content. I'm afraid I would destroy the meaning while updating the
grammar.
That's to problem. If you're willing to help you can edit the documentation
and let me review that it's correct. Also feel free to ask any questions
and
more explanation from me. Ultimately, we need to have a documentation
that any average user can understand, not to mention you :)
Anyone understand the subject matter well enough to review the
documentation?
I expect it would be hard to find anybody matching this criteria. But it
would be nice to find one though.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company