pg_trgm

Started by Tatsuo Ishiialmost 16 years ago40 messageshackers
Jump to latest
#1Tatsuo Ishii
t-ishii@sra.co.jp

Hi,

Anyone working on make contrib/pg_trgm mutibyte encoding aware? If
not, I'm interested in the work.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

#2Teodor Sigaev
teodor@sigaev.ru
In reply to: Tatsuo Ishii (#1)
Re: pg_trgm

Anyone working on make contrib/pg_trgm mutibyte encoding aware? If
not, I'm interested in the work.

It's already multibyte safe since 8.4
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#3Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Teodor Sigaev (#2)
Re: pg_trgm

It's already multibyte safe since 8.4

No, it doesn't.

$ psql test
Pager usage is off.
psql (8.4.4)
Type "help" for help.

test=# select similarity('abc', 'abd'); -- OK
similarity
------------
0.333333
(1 row)

test=# select similarity('日本語', '日本後'); -- NG
similarity
------------
NaN
(1 row)

test=# select show_trgm('abc'); -- OK
show_trgm
-------------------------
{" a"," ab",abc,"bc "}
(1 row)

test=# select show_trgm('日本語'); -- NG
show_trgm
-----------
{}
(1 row)

Encoding is EUC_JP, locale is C. Included is the script to reproduce
the problem.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Attachments:

pg_trgm.sqlapplication/octet-streamDownload
#4Andres Freund
andres@anarazel.de
In reply to: Tatsuo Ishii (#3)
Re: pg_trgm

Hi,

On Thursday 27 May 2010 13:53:37 Tatsuo Ishii wrote:

It's already multibyte safe since 8.4

No, it doesn't.
Encoding is EUC_JP, locale is C. Included is the script to reproduce
the problem.

test=# select show_trgm('日本語');
show_trgm
---------------------------------------
{0x8194c0,0x836e53,0x1dc363,0x1e22e9}
(1 row)

Time: 0.443 ms
test=# select similarity('日本語', '日本後');
similarity
------------
0.333333
(1 row)

Time: 0.426 ms

Encoding is UTF-8...

Andres

#5Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Andres Freund (#4)
Re: pg_trgm

No, it doesn't.
Encoding is EUC_JP, locale is C. Included is the script to reproduce
the problem.

test=# select show_trgm('日本語');
show_trgm
---------------------------------------
{0x8194c0,0x836e53,0x1dc363,0x1e22e9}
(1 row)

Time: 0.443 ms
test=# select similarity('日本語', '日本後');
similarity
------------
0.333333
(1 row)

Time: 0.426 ms

Encoding is UTF-8...

What is your locale?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

#6Andres Freund
andres@anarazel.de
In reply to: Tatsuo Ishii (#5)
Re: pg_trgm

On Thursday 27 May 2010 14:40:41 Tatsuo Ishii wrote:

No, it doesn't.
Encoding is EUC_JP, locale is C. Included is the script to reproduce
the problem.

test=# select show_trgm('日本語');

show_trgm

---------------------------------------

{0x8194c0,0x836e53,0x1dc363,0x1e22e9}

(1 row)

Time: 0.443 ms
test=# select similarity('日本語', '日本後');

similarity

------------

0.333333

(1 row)

Time: 0.426 ms

Encoding is UTF-8...

What is your locale?

It was en_EN.UTF-8. Interesting. With C it fails...

Andres

#7Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Andres Freund (#6)
Re: pg_trgm

What is your locale?

It was en_EN.UTF-8. Interesting. With C it fails...

Yes, pg_trgm seems to have problems with multibyte + C locale.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tatsuo Ishii (#7)
Re: pg_trgm

Tatsuo Ishii <ishii@postgresql.org> writes:

What is your locale?

It was en_EN.UTF-8. Interesting. With C it fails...

Yes, pg_trgm seems to have problems with multibyte + C locale.

It's not a problem, it's just pilot error, or possibly inadequate
documentation. pg_trgm uses the locale's definition of "alpha",
"digit", etc. In C locale only basic ASCII letters and digits will be
recognized as word constituents.

regards, tom lane

#9Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tom Lane (#8)
Re: pg_trgm

Yes, pg_trgm seems to have problems with multibyte + C locale.

It's not a problem, it's just pilot error, or possibly inadequate
documentation. pg_trgm uses the locale's definition of "alpha",
"digit", etc. In C locale only basic ASCII letters and digits will be
recognized as word constituents.

That means there is no chance to make pg_trgm work with multibyte + C
locale? If so, I will leave pg_trgm as it is and provide private
patches for those who need the functionality.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tatsuo Ishii (#9)
Re: pg_trgm

Tatsuo Ishii <ishii@postgresql.org> writes:

It's not a problem, it's just pilot error, or possibly inadequate
documentation. pg_trgm uses the locale's definition of "alpha",
"digit", etc. In C locale only basic ASCII letters and digits will be
recognized as word constituents.

That means there is no chance to make pg_trgm work with multibyte + C
locale? If so, I will leave pg_trgm as it is and provide private
patches for those who need the functionality.

Exactly what do you consider to be the missing functionality?
You need a notion of word vs non-word character from somewhere,
and the locale setting is the standard place to get that. The
core text search functionality behaves the same way.

regards, tom lane

#11Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tom Lane (#10)
Re: pg_trgm

Exactly what do you consider to be the missing functionality?
You need a notion of word vs non-word character from somewhere,
and the locale setting is the standard place to get that. The
core text search functionality behaves the same way.

No. Text search works fine with multibyte + C locale.

Anyway locale is completely usesless for finding word vs non-character
an agglutinative language such as Japanese.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tatsuo Ishii (#11)
Re: pg_trgm

Tatsuo Ishii <ishii@sraoss.co.jp> writes:

Anyway locale is completely usesless for finding word vs non-character
an agglutinative language such as Japanese.

Well, that doesn't mean that the answer is to use C locale ;-)

However, you could possibly think about making this bit of code
more flexible:

#ifdef KEEPONLYALNUM
#define iswordchr(c) (t_isalpha(c) || t_isdigit(c))
#else
#define iswordchr(c) (!t_isspace(c))
#endif

Currently it seems to be hard-wired to the first case in standard
builds.

regards, tom lane

#13Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tom Lane (#12)
Re: pg_trgm

Well, that doesn't mean that the answer is to use C locale ;-)

Of course it's up to user whether to use C locale or not. I just want
pg_trgm work with C locale as well.

However, you could possibly think about making this bit of code
more flexible:

#ifdef KEEPONLYALNUM
#define iswordchr(c) (t_isalpha(c) || t_isdigit(c))
#else
#define iswordchr(c) (!t_isspace(c))
#endif

Currently it seems to be hard-wired to the first case in standard
builds.

Yup. Here is the patch in my mind:

*** trgm_op.c~	2009-06-11 23:48:51.000000000 +0900
--- trgm_op.c	2010-05-27 23:38:20.000000000 +0900
***************
*** 59,65 ****
  }
  #ifdef KEEPONLYALNUM
! #define iswordchr(c)	(t_isalpha(c) || t_isdigit(c))
  #else
  #define iswordchr(c)	(!t_isspace(c))
  #endif
--- 59,65 ----
  }

#ifdef KEEPONLYALNUM
! #define iswordchr(c) (t_isalpha(c) || t_isdigit(c) || (lc_ctype_is_c() && !t_isspace(c)))
#else
#define iswordchr(c) (!t_isspace(c))
#endif

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tatsuo Ishii (#13)
Re: pg_trgm

Tatsuo Ishii <ishii@postgresql.org> writes:

! #define iswordchr(c) (t_isalpha(c) || t_isdigit(c) || (lc_ctype_is_c() && !t_isspace(c)))

This seems entirely arbitrary. It might "fix" things in your view
but it will break the longstanding behavior for other people.

I think a more appropriate type of fix would be to expose the
KEEPONLYALNUM option as a GUC, or some other way of letting the
user decide what he wants.

regards, tom lane

#15Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#14)
Re: pg_trgm

On Thu, May 27, 2010 at 3:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think a more appropriate type of fix would be to expose the
KEEPONLYALNUM option as a GUC, or some other way of letting the
user decide what he wants.

So I think a GUC is broken because pg_tgrm has a index opclasses and
any indexes built using one setting will be broken if the GUC is
changed.

Perhaps we need two sets of functions (which presumably call the same
implementation with a flag to indicate which definition to use). Then
you can define an index using one or the other and the meaning would
be stable.

--
greg

#16Peter Eisentraut
peter_e@gmx.net
In reply to: Tatsuo Ishii (#11)
Re: pg_trgm

On tor, 2010-05-27 at 23:20 +0900, Tatsuo Ishii wrote:

Anyway locale is completely usesless for finding word vs non-character
an agglutinative language such as Japanese.

I don't know about Japanese, but the locale approach works just fine for
other agglutinative languages. I would rather suspect that it is the
trigram approach that might be rather useless for such languages,
because you are going to get a lot of similarity hits for the affixes.

#17Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Peter Eisentraut (#16)
Re: pg_trgm

I don't know about Japanese, but the locale approach works just fine for
other agglutinative languages. I would rather suspect that it is the
trigram approach that might be rather useless for such languages,
because you are going to get a lot of similarity hits for the affixes.

I'm not sure what you mean by "affixes". But I will explain...

A Japanese sentence consists of words. Problem is, each word is not
separated by space (agglutinative). So most text tools such as text
search need preprocess which finds word boundaries by looking up
dictionaries (and smart grammer analysis routine). In the process
"affixes" can be determined and perhaps removed from the target word
group to be used for text search (note that removing affixes is no
relevant to locale). Once we get space separated sentence, it can be
processed by text search or by pg_trgm just same as Engligh. (Note
that these preprocessing are done outside PostgreSQL world). The
difference is just the "word" can be consists of non ASCII letters.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

#18Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Bruce Momjian (#15)
Re: pg_trgm

So I think a GUC is broken because pg_tgrm has a index opclasses and
any indexes built using one setting will be broken if the GUC is
changed.

Perhaps we need two sets of functions (which presumably call the same
implementation with a flag to indicate which definition to use). Then
you can define an index using one or the other and the meaning would
be stable.

It's worse. pg_trgm has another compile option "IGNORECASE" which
might affect index opclasses.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

#19Peter Eisentraut
peter_e@gmx.net
In reply to: Tatsuo Ishii (#17)
Re: pg_trgm

On fre, 2010-05-28 at 00:46 +0900, Tatsuo Ishii wrote:

I don't know about Japanese, but the locale approach works just fine for
other agglutinative languages. I would rather suspect that it is the
trigram approach that might be rather useless for such languages,
because you are going to get a lot of similarity hits for the affixes.

I'm not sure what you mean by "affixes". But I will explain...

A Japanese sentence consists of words. Problem is, each word is not
separated by space (agglutinative). So most text tools such as text
search need preprocess which finds word boundaries by looking up
dictionaries (and smart grammer analysis routine). In the process
"affixes" can be determined and perhaps removed from the target word
group to be used for text search (note that removing affixes is no
relevant to locale). Once we get space separated sentence, it can be
processed by text search or by pg_trgm just same as Engligh. (Note
that these preprocessing are done outside PostgreSQL world). The
difference is just the "word" can be consists of non ASCII letters.

I think the problem at hand has nothing at all to do with agglutination
or CJK-specific issues. You will get the same problem with other
languages *if* you set a locale that does not adequately support the
characters in use. E.g., Russian with locale C and encoding UTF8:

select similarity(E'\u0441\u043B\u043E\u043D', E'\u0441\u043B\u043E
\u043D\u044B');
similarity
────────────
NaN
(1 row)

#20Robert Haas
robertmhaas@gmail.com
In reply to: Peter Eisentraut (#19)
Re: pg_trgm

On Thu, May 27, 2010 at 2:01 PM, Peter Eisentraut <peter_e@gmx.net> wrote:

On fre, 2010-05-28 at 00:46 +0900, Tatsuo Ishii wrote:

I don't know about Japanese, but the locale approach works just fine for
other agglutinative languages.  I would rather suspect that it is the
trigram approach that might be rather useless for such languages,
because you are going to get a lot of similarity hits for the affixes.

I'm not sure what you mean by "affixes".  But I will explain...

A Japanese sentence consists of words. Problem is, each word is not
separated by space (agglutinative). So most text tools such as text
search need preprocess which finds word boundaries by looking up
dictionaries (and smart grammer analysis routine). In the process
"affixes" can be determined and perhaps removed from the target word
group to be used for text search (note that removing affixes is no
relevant to locale). Once we get space separated sentence, it can be
processed by text search or by pg_trgm just same as Engligh. (Note
that these preprocessing are done outside PostgreSQL world). The
difference is just the "word" can be consists of non ASCII letters.

I think the problem at hand has nothing at all to do with agglutination
or CJK-specific issues.  You will get the same problem with other
languages *if* you set a locale that does not adequately support the
characters in use.  E.g., Russian with locale C and encoding UTF8:

select similarity(E'\u0441\u043B\u043E\u043D', E'\u0441\u043B\u043E
\u043D\u044B');
 similarity
────────────
       NaN
(1 row)

What I can't help wondering as I'm reading this discussion is -
Tatsuo-san said upthread that he has a problem with pg_trgm that he
does not have with full text search. So what is full text search
doing differently than pg_trgm?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#21Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Peter Eisentraut (#19)
#22Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Robert Haas (#20)
#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tatsuo Ishii (#22)
#24Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tom Lane (#23)
#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tatsuo Ishii (#24)
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tatsuo Ishii (#21)
#27Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tom Lane (#25)
#28Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Peter Eisentraut (#19)
#29Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tom Lane (#26)
#30Peter Eisentraut
peter_e@gmx.net
In reply to: Tatsuo Ishii (#28)
#31Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tatsuo Ishii (#29)
#32Bruce Momjian
bruce@momjian.us
In reply to: Tatsuo Ishii (#31)
#33Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tatsuo Ishii (#31)
#34Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tom Lane (#33)
#35Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Peter Eisentraut (#30)
#36Peter Eisentraut
peter_e@gmx.net
In reply to: Tatsuo Ishii (#35)
#37Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tatsuo Ishii (#34)
#38Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Peter Eisentraut (#36)
#39Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#37)
#40Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#39)