Can pg_trgm handle non-alphanumeric characters?

Started by MauMauover 13 years ago12 messages
#1MauMau
maumau307@gmail.com

Hello,

This question may be appropriate for pgsql-general, but let me ask here
because the only relevant discussion seems to have been done on
pgsql-hackers:

http://archives.postgresql.org/pgsql-hackers/2011-09/msg00169.php

Can pg_trgm support non-alphanumeric characters by undefining KEEPONLYALNUM
in contrib/pg_trgm/trgm.h? If no, what kind of problems would happen?

Regards
MauMau

#2Fujii Masao
masao.fujii@gmail.com
In reply to: MauMau (#1)
Re: Can pg_trgm handle non-alphanumeric characters?

On Wed, May 9, 2012 at 9:10 PM, MauMau <maumau307@gmail.com> wrote:

Hello,

This question may be appropriate for pgsql-general, but let me ask here
because the only relevant discussion seems to have been done on
pgsql-hackers:

http://archives.postgresql.org/pgsql-hackers/2011-09/msg00169.php

Can pg_trgm support non-alphanumeric characters by undefining KEEPONLYALNUM
in contrib/pg_trgm/trgm.h?

Yes unless I'm missing something.

Regards,

--
Fujii Masao

#3MauMau
maumau307@gmail.com
In reply to: Fujii Masao (#2)
Re: Can pg_trgm handle non-alphanumeric characters?

From: "Fujii Masao" <masao.fujii@gmail.com>

On Wed, May 9, 2012 at 9:10 PM, MauMau <maumau307@gmail.com> wrote:

This question may be appropriate for pgsql-general, but let me ask here
because the only relevant discussion seems to have been done on
pgsql-hackers:

http://archives.postgresql.org/pgsql-hackers/2011-09/msg00169.php

Can pg_trgm support non-alphanumeric characters by undefining
KEEPONLYALNUM
in contrib/pg_trgm/trgm.h?

Yes unless I'm missing something.

Then, does it make sense to remove "#define KEEPONLYALNUM" in 9.1.4? Would
it cause any problems? If no, I wish that, because it eliminates the need to
do the removal every time the users applies minor releases.

Regards
MauMau

#4Euler Taveira
euler@timbira.com
In reply to: MauMau (#3)
Re: Can pg_trgm handle non-alphanumeric characters?

On 09-05-2012 19:17, MauMau wrote:

Then, does it make sense to remove "#define KEEPONLYALNUM" in 9.1.4? Would it
cause any problems? If no, I wish that, because it eliminates the need to do
the removal every time the users applies minor releases.

If you do so, you'll break minor versions. IMHO the default is the desirable
behavior for almost all use cases (you are the first one that complain about
it). Maybe in the future, we should be able to flip this flag without
rebuilding binaries.

--
Euler Taveira de Oliveira - Timbira http://www.timbira.com.br/
PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

#5Fujii Masao
masao.fujii@gmail.com
In reply to: Euler Taveira (#4)
Re: Can pg_trgm handle non-alphanumeric characters?

On Thu, May 10, 2012 at 8:18 AM, Euler Taveira <euler@timbira.com> wrote:

On 09-05-2012 19:17, MauMau wrote:

Then, does it make sense to remove "#define KEEPONLYALNUM" in 9.1.4? Would it
cause any problems? If no, I wish that, because it eliminates the need to do
the removal every time the users applies minor releases.

If you do so, you'll break minor versions.

Right. And removing KEEPONLYALNUM is a feature change rather than bug fix,
so that should be proposed during major version development cycle.

IMHO the default is the desirable
behavior for almost all use cases (you are the first one that complain about
it).

Really? I was thinking non-English users (including me) basicaly would not be
satisfied with the default because they cannot use pg_trgm for N-gram full text
search of non-English text. Though I agree some users would prefer the default.

Maybe in the future, we should be able to flip this flag without
rebuilding binaries.

Agreed.

Regards,

--
Fujii Masao

#6MauMau
maumau307@gmail.com
In reply to: Fujii Masao (#5)
Re: Can pg_trgm handle non-alphanumeric characters?

From: "Fujii Masao" <masao.fujii@gmail.com>

On Thu, May 10, 2012 at 8:18 AM, Euler Taveira <euler@timbira.com> wrote:

On 09-05-2012 19:17, MauMau wrote:

Then, does it make sense to remove "#define KEEPONLYALNUM" in 9.1.4?
Would it
cause any problems? If no, I wish that, because it eliminates the need
to do
the removal every time the users applies minor releases.

If you do so, you'll break minor versions.

Right. And removing KEEPONLYALNUM is a feature change rather than bug fix,
so that should be proposed during major version development cycle.

For information, what kind of breakage would occur? Is it performance
degradation, extra index storage consumption, or undesirable query results?
I imagined removing KEEPONLYALNUM would just accept non-alphanumeric
characters and cause no harm to those who use only alphanumeric characters.

Regards
MauMau

#7Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: MauMau (#6)
Re: Can pg_trgm handle non-alphanumeric characters?

"MauMau" <maumau307@gmail.com> wrote:

On 09-05-2012 19:17, MauMau wrote:

Then, does it make sense to remove "#define KEEPONLYALNUM" in
9.1.4? Would it cause any problems?

Yes, it will cause problems.

For information, what kind of breakage would occur?

I imagined removing KEEPONLYALNUM would just accept
non-alphanumeric characters and cause no harm to those who use
only alphanumeric characters.

This would break our current usages because of the handling of
trigrams at the "edges" of groups of qualifying characters. It
would make similarity (and distance) values less useful for our
current name searches using it. To simulate the effect, I used an
'8' in place of a comma instead of recompiling with the suggested
change.

test=# select show_trgm('smith,john');
show_trgm
-----------------------------------------------------------
{" j"," s"," jo"," sm","hn ",ith,joh,mit,ohn,smi,"th "}
(1 row)

test=# select show_trgm('smith8john');
show_trgm
-----------------------------------------------------
{" s"," sm",8jo,h8j,"hn ",ith,joh,mit,ohn,smi,th8}
(1 row)

test=# select similarity('smith,john', 'jon smith');
similarity
------------
0.615385
(1 row)

test=# select similarity('smith8john', 'jon smith');
similarity
------------
0.3125
(1 row)

So making the proposed change unconditionally could indeed hurt
current users of the technique. On the other hand, if there was
fine-grained control of this, it might make trigrams useful for
searching statute cites (using all characters) as well as names
(using the current character set); so I wouldn't want it to just be
controlled by a global GUC.

-Kevin

#8MauMau
maumau307@gmail.com
In reply to: Kevin Grittner (#7)
Re: Can pg_trgm handle non-alphanumeric characters?

From: "Kevin Grittner" <Kevin.Grittner@wicourts.gov>

"MauMau" <maumau307@gmail.com> wrote:

For information, what kind of breakage would occur?

I imagined removing KEEPONLYALNUM would just accept
non-alphanumeric characters and cause no harm to those who use
only alphanumeric characters.

This would break our current usages because of the handling of
trigrams at the "edges" of groups of qualifying characters. It
would make similarity (and distance) values less useful for our
current name searches using it. To simulate the effect, I used an
'8' in place of a comma instead of recompiling with the suggested
change.

test=# select show_trgm('smith,john');
show_trgm
-----------------------------------------------------------
{" j"," s"," jo"," sm","hn ",ith,joh,mit,ohn,smi,"th "}
(1 row)

test=# select show_trgm('smith8john');
show_trgm
-----------------------------------------------------
{" s"," sm",8jo,h8j,"hn ",ith,joh,mit,ohn,smi,th8}
(1 row)

test=# select similarity('smith,john', 'jon smith');
similarity
------------
0.615385
(1 row)

test=# select similarity('smith8john', 'jon smith');
similarity
------------
0.3125
(1 row)

So making the proposed change unconditionally could indeed hurt
current users of the technique. On the other hand, if there was
fine-grained control of this, it might make trigrams useful for
searching statute cites (using all characters) as well as names
(using the current character set); so I wouldn't want it to just be
controlled by a global GUC.

Thanks for your explanation. Although I haven't understood it well yet, I'll
consider what you taught. And I'll consider if the tentative measure of
removing KEEPONLYALNUM is correct for someone who wants to use pg_trgm
against Japanese text.

Regards
MauMau

#9Fujii Masao
masao.fujii@gmail.com
In reply to: MauMau (#8)
Re: Can pg_trgm handle non-alphanumeric characters?

On Fri, May 11, 2012 at 12:07 AM, MauMau <maumau307@gmail.com> wrote:

Thanks for your explanation. Although I haven't understood it well yet, I'll
consider what you taught. And I'll consider if the tentative measure of
removing KEEPONLYALNUM is correct for someone who wants to use pg_trgm
against Japanese text.

In Japanese, it's common to do a text search with two characters keyword.
But since pg_trgm is 3-gram, you basically would not be able to use index
for such text search. So you might need something like pg_bigm or pg_unigm
for Japanese text search.

Regards,

--
Fujii Masao

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Fujii Masao (#9)
Re: Can pg_trgm handle non-alphanumeric characters?

Fujii Masao <masao.fujii@gmail.com> writes:

On Fri, May 11, 2012 at 12:07 AM, MauMau <maumau307@gmail.com> wrote:

Thanks for your explanation. Although I haven't understood it well yet, I'll
consider what you taught. And I'll consider if the tentative measure of
removing KEEPONLYALNUM is correct for someone who wants to use pg_trgm
against Japanese text.

In Japanese, it's common to do a text search with two characters keyword.
But since pg_trgm is 3-gram, you basically would not be able to use index
for such text search. So you might need something like pg_bigm or pg_unigm
for Japanese text search.

I believe the trigrams are three *bytes* not three characters. So a
couple of kanji should work just fine for this.

regards, tom lane

#11MauMau
maumau307@gmail.com
In reply to: Kevin Grittner (#7)
Re: Can pg_trgm handle non-alphanumeric characters?

From: "Kevin Grittner" <Kevin.Grittner@wicourts.gov>

"MauMau" <maumau307@gmail.com> wrote:

For information, what kind of breakage would occur?

I imagined removing KEEPONLYALNUM would just accept
non-alphanumeric characters and cause no harm to those who use
only alphanumeric characters.

This would break our current usages because of the handling of
trigrams at the "edges" of groups of qualifying characters. It
would make similarity (and distance) values less useful for our
current name searches using it. To simulate the effect, I used an
'8' in place of a comma instead of recompiling with the suggested
change.

test=# select show_trgm('smith,john');
show_trgm
-----------------------------------------------------------
{" j"," s"," jo"," sm","hn ",ith,joh,mit,ohn,smi,"th "}
(1 row)

test=# select show_trgm('smith8john');
show_trgm
-----------------------------------------------------
{" s"," sm",8jo,h8j,"hn ",ith,joh,mit,ohn,smi,th8}
(1 row)

test=# select similarity('smith,john', 'jon smith');
similarity
------------
0.615385
(1 row)

test=# select similarity('smith8john', 'jon smith');
similarity
------------
0.3125
(1 row)

So making the proposed change unconditionally could indeed hurt
current users of the technique. On the other hand, if there was
fine-grained control of this, it might make trigrams useful for
searching statute cites (using all characters) as well as names
(using the current character set); so I wouldn't want it to just be
controlled by a global GUC.

Thank you for your concise examples. I probably got it.

From your examples, I thought KEEPONLYALNUM controls whether
non-alphanumeric characters are included in trigrams, though I haven't read
the code of pg_trgm. So, removing KEEPONLYALNUM definition produces trigrams
unnecessary for users who handle only alphanumeric text. That would lead to
undesirable query results.

Then, I wonder what would be the ideal specification...to add
alphanumeric/non-alphanumeric boolean switch to similarity() function, add
non-alphanumeric version of operators (ex. %* and <->*) and non-alphanumeric
version of operator classes (ex. gin_allchars_trgm_ops)? At least, I
understood the fix is not appropriate for minor releases.

Regards
MauMau

#12Fujii Masao
masao.fujii@gmail.com
In reply to: Tom Lane (#10)
Re: Can pg_trgm handle non-alphanumeric characters?

On Fri, May 11, 2012 at 4:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Fujii Masao <masao.fujii@gmail.com> writes:

On Fri, May 11, 2012 at 12:07 AM, MauMau <maumau307@gmail.com> wrote:

Thanks for your explanation. Although I haven't understood it well yet, I'll
consider what you taught. And I'll consider if the tentative measure of
removing KEEPONLYALNUM is correct for someone who wants to use pg_trgm
against Japanese text.

In Japanese, it's common to do a text search with two characters keyword.
But since pg_trgm is 3-gram, you basically would not be able to use index
for such text search. So you might need something like pg_bigm or pg_unigm
for Japanese text search.

Even if an index can be used for two characters text search, bitmap index scan
picks up all rows, so it's too slow.

I believe the trigrams are three *bytes* not three characters.  So a
couple of kanji should work just fine for this.

Really? As far as I read the code of pg_trgm, the trigram is three characters
and its CRC32 is used as an index key if its size is more than three bytes.

Regards,

--
Fujii Masao