pg_trgm

Started by Tatsuo Ishiiabout 16 years ago40 messageshackers

t-ishii@sra.co.jp

about 16 years ago

Hi,

Anyone working on make contrib/pg_trgm mutibyte encoding aware? If
not, I'm interested in the work.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Teodor Sigaev

teodor@sigaev.ru

about 16 years ago

In reply to: Tatsuo Ishii (#1)

Re: pg_trgm

Anyone working on make contrib/pg_trgm mutibyte encoding aware? If
not, I'm interested in the work.

It's already multibyte safe since 8.4
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

Tatsuo Ishii

t-ishii@sra.co.jp

about 16 years ago

In reply to: Teodor Sigaev (#2)

Re: pg_trgm

It's already multibyte safe since 8.4

No, it doesn't.

$ psql test
Pager usage is off.
psql (8.4.4)
Type "help" for help.

test=# select similarity('abc', 'abd'); -- OK
similarity
------------
0.333333
(1 row)

test=# select similarity('日本語', '日本後'); -- NG
similarity
------------
NaN
(1 row)

test=# select show_trgm('abc'); -- OK
show_trgm
-------------------------
{" a"," ab",abc,"bc "}
(1 row)

test=# select show_trgm('日本語'); -- NG
show_trgm
-----------
{}
(1 row)

Encoding is EUC_JP, locale is C. Included is the script to reproduce
the problem.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Andres Freund

andres@anarazel.de

about 16 years ago

In reply to: Tatsuo Ishii (#3)

Re: pg_trgm

Hi,

On Thursday 27 May 2010 13:53:37 Tatsuo Ishii wrote:

It's already multibyte safe since 8.4

No, it doesn't.
Encoding is EUC_JP, locale is C. Included is the script to reproduce
the problem.

test=# select show_trgm('日本語');
show_trgm
---------------------------------------
{0x8194c0,0x836e53,0x1dc363,0x1e22e9}
(1 row)

Time: 0.443 ms
test=# select similarity('日本語', '日本後');
similarity
------------
0.333333
(1 row)

Time: 0.426 ms

Encoding is UTF-8...

Andres

Tatsuo Ishii

t-ishii@sra.co.jp

about 16 years ago

In reply to: Andres Freund (#4)

Re: pg_trgm

No, it doesn't.
Encoding is EUC_JP, locale is C. Included is the script to reproduce
the problem.

test=# select show_trgm('日本語');
show_trgm
---------------------------------------
{0x8194c0,0x836e53,0x1dc363,0x1e22e9}
(1 row)

Time: 0.443 ms
test=# select similarity('日本語', '日本後');
similarity
------------
0.333333
(1 row)

Time: 0.426 ms

Encoding is UTF-8...

What is your locale?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Andres Freund

andres@anarazel.de

about 16 years ago

In reply to: Tatsuo Ishii (#5)

Re: pg_trgm

On Thursday 27 May 2010 14:40:41 Tatsuo Ishii wrote:

No, it doesn't.
Encoding is EUC_JP, locale is C. Included is the script to reproduce
the problem.

test=# select show_trgm('日本語');

show_trgm

---------------------------------------

{0x8194c0,0x836e53,0x1dc363,0x1e22e9}

(1 row)

Time: 0.443 ms
test=# select similarity('日本語', '日本後');

similarity

------------

0.333333

(1 row)

Time: 0.426 ms

Encoding is UTF-8...

What is your locale?

It was en_EN.UTF-8. Interesting. With C it fails...

Andres

Tatsuo Ishii

t-ishii@sra.co.jp

about 16 years ago

In reply to: Andres Freund (#6)

Re: pg_trgm

What is your locale?

It was en_EN.UTF-8. Interesting. With C it fails...

Yes, pg_trgm seems to have problems with multibyte + C locale.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Tom Lane

tgl@sss.pgh.pa.us

about 16 years ago

In reply to: Tatsuo Ishii (#7)

Re: pg_trgm

Tatsuo Ishii <ishii@postgresql.org> writes:

What is your locale?

It was en_EN.UTF-8. Interesting. With C it fails...

Yes, pg_trgm seems to have problems with multibyte + C locale.

It's not a problem, it's just pilot error, or possibly inadequate
documentation. pg_trgm uses the locale's definition of "alpha",
"digit", etc. In C locale only basic ASCII letters and digits will be
recognized as word constituents.

regards, tom lane

Tatsuo Ishii

t-ishii@sra.co.jp

about 16 years ago

In reply to: Tom Lane (#8)

Re: pg_trgm

Yes, pg_trgm seems to have problems with multibyte + C locale.

It's not a problem, it's just pilot error, or possibly inadequate
documentation. pg_trgm uses the locale's definition of "alpha",
"digit", etc. In C locale only basic ASCII letters and digits will be
recognized as word constituents.

That means there is no chance to make pg_trgm work with multibyte + C
locale? If so, I will leave pg_trgm as it is and provide private
patches for those who need the functionality.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

#10

Tom Lane

tgl@sss.pgh.pa.us

about 16 years ago

In reply to: Tatsuo Ishii (#9)

Re: pg_trgm

Tatsuo Ishii <ishii@postgresql.org> writes:

It's not a problem, it's just pilot error, or possibly inadequate
documentation. pg_trgm uses the locale's definition of "alpha",
"digit", etc. In C locale only basic ASCII letters and digits will be
recognized as word constituents.

That means there is no chance to make pg_trgm work with multibyte + C
locale? If so, I will leave pg_trgm as it is and provide private
patches for those who need the functionality.

Exactly what do you consider to be the missing functionality?
You need a notion of word vs non-word character from somewhere,
and the locale setting is the standard place to get that. The
core text search functionality behaves the same way.

regards, tom lane

#11

Tatsuo Ishii

t-ishii@sra.co.jp

about 16 years ago

In reply to: Tom Lane (#10)

Re: pg_trgm

Exactly what do you consider to be the missing functionality?
You need a notion of word vs non-word character from somewhere,
and the locale setting is the standard place to get that. The
core text search functionality behaves the same way.

No. Text search works fine with multibyte + C locale.

Anyway locale is completely usesless for finding word vs non-character
an agglutinative language such as Japanese.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

#12

Tom Lane

tgl@sss.pgh.pa.us

about 16 years ago

In reply to: Tatsuo Ishii (#11)

Re: pg_trgm

Tatsuo Ishii <ishii@sraoss.co.jp> writes:

Anyway locale is completely usesless for finding word vs non-character
an agglutinative language such as Japanese.

Well, that doesn't mean that the answer is to use C locale ;-)

However, you could possibly think about making this bit of code
more flexible:

#ifdef KEEPONLYALNUM
#define iswordchr(c) (t_isalpha(c) || t_isdigit(c))
#else
#define iswordchr(c) (!t_isspace(c))
#endif

Currently it seems to be hard-wired to the first case in standard
builds.

regards, tom lane

#13

Tatsuo Ishii

t-ishii@sra.co.jp

about 16 years ago

In reply to: Tom Lane (#12)

Re: pg_trgm

Well, that doesn't mean that the answer is to use C locale ;-)

Of course it's up to user whether to use C locale or not. I just want
pg_trgm work with C locale as well.

However, you could possibly think about making this bit of code
more flexible:

#ifdef KEEPONLYALNUM
#define iswordchr(c) (t_isalpha(c) || t_isdigit(c))
#else
#define iswordchr(c) (!t_isspace(c))
#endif

Currently it seems to be hard-wired to the first case in standard
builds.

Yup. Here is the patch in my mind:

*** trgm_op.c~	2009-06-11 23:48:51.000000000 +0900
--- trgm_op.c	2010-05-27 23:38:20.000000000 +0900
***************
*** 59,65 ****
  }

  #ifdef KEEPONLYALNUM
! #define iswordchr(c)	(t_isalpha(c) || t_isdigit(c))
  #else
  #define iswordchr(c)	(!t_isspace(c))
  #endif
--- 59,65 ----
  }

#ifdef KEEPONLYALNUM
! #define iswordchr(c) (t_isalpha(c) || t_isdigit(c) || (lc_ctype_is_c() && !t_isspace(c)))
#else
#define iswordchr(c) (!t_isspace(c))
#endif

#14

Tom Lane

tgl@sss.pgh.pa.us

about 16 years ago

In reply to: Tatsuo Ishii (#13)

Re: pg_trgm

Tatsuo Ishii <ishii@postgresql.org> writes:

! #define iswordchr(c) (t_isalpha(c) || t_isdigit(c) || (lc_ctype_is_c() && !t_isspace(c)))

This seems entirely arbitrary. It might "fix" things in your view
but it will break the longstanding behavior for other people.

I think a more appropriate type of fix would be to expose the
KEEPONLYALNUM option as a GUC, or some other way of letting the
user decide what he wants.

regards, tom lane

#15

Bruce Momjian

bruce@momjian.us

about 16 years ago

In reply to: Tom Lane (#14)

Re: pg_trgm

On Thu, May 27, 2010 at 3:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think a more appropriate type of fix would be to expose the
KEEPONLYALNUM option as a GUC, or some other way of letting the
user decide what he wants.

So I think a GUC is broken because pg_tgrm has a index opclasses and
any indexes built using one setting will be broken if the GUC is
changed.

Perhaps we need two sets of functions (which presumably call the same
implementation with a flag to indicate which definition to use). Then
you can define an index using one or the other and the meaning would
be stable.

--
greg

#16

Peter Eisentraut

peter_e@gmx.net

about 16 years ago

In reply to: Tatsuo Ishii (#11)

Re: pg_trgm

On tor, 2010-05-27 at 23:20 +0900, Tatsuo Ishii wrote:

Anyway locale is completely usesless for finding word vs non-character
an agglutinative language such as Japanese.

I don't know about Japanese, but the locale approach works just fine for
other agglutinative languages. I would rather suspect that it is the
trigram approach that might be rather useless for such languages,
because you are going to get a lot of similarity hits for the affixes.

#17

Tatsuo Ishii

t-ishii@sra.co.jp

about 16 years ago

In reply to: Peter Eisentraut (#16)

Re: pg_trgm

I don't know about Japanese, but the locale approach works just fine for
other agglutinative languages. I would rather suspect that it is the
trigram approach that might be rather useless for such languages,
because you are going to get a lot of similarity hits for the affixes.

I'm not sure what you mean by "affixes". But I will explain...

A Japanese sentence consists of words. Problem is, each word is not
separated by space (agglutinative). So most text tools such as text
search need preprocess which finds word boundaries by looking up
dictionaries (and smart grammer analysis routine). In the process
"affixes" can be determined and perhaps removed from the target word
group to be used for text search (note that removing affixes is no
relevant to locale). Once we get space separated sentence, it can be
processed by text search or by pg_trgm just same as Engligh. (Note
that these preprocessing are done outside PostgreSQL world). The
difference is just the "word" can be consists of non ASCII letters.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

#18

Tatsuo Ishii

t-ishii@sra.co.jp

about 16 years ago

In reply to: Bruce Momjian (#15)

Re: pg_trgm

So I think a GUC is broken because pg_tgrm has a index opclasses and
any indexes built using one setting will be broken if the GUC is
changed.

Perhaps we need two sets of functions (which presumably call the same
implementation with a flag to indicate which definition to use). Then
you can define an index using one or the other and the meaning would
be stable.

It's worse. pg_trgm has another compile option "IGNORECASE" which
might affect index opclasses.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

#19

Peter Eisentraut

peter_e@gmx.net

about 16 years ago

In reply to: Tatsuo Ishii (#17)

Re: pg_trgm

On fre, 2010-05-28 at 00:46 +0900, Tatsuo Ishii wrote:

I don't know about Japanese, but the locale approach works just fine for
other agglutinative languages. I would rather suspect that it is the
trigram approach that might be rather useless for such languages,
because you are going to get a lot of similarity hits for the affixes.

I'm not sure what you mean by "affixes". But I will explain...

A Japanese sentence consists of words. Problem is, each word is not
separated by space (agglutinative). So most text tools such as text
search need preprocess which finds word boundaries by looking up
dictionaries (and smart grammer analysis routine). In the process
"affixes" can be determined and perhaps removed from the target word
group to be used for text search (note that removing affixes is no
relevant to locale). Once we get space separated sentence, it can be
processed by text search or by pg_trgm just same as Engligh. (Note
that these preprocessing are done outside PostgreSQL world). The
difference is just the "word" can be consists of non ASCII letters.

I think the problem at hand has nothing at all to do with agglutination
or CJK-specific issues. You will get the same problem with other
languages *if* you set a locale that does not adequately support the
characters in use. E.g., Russian with locale C and encoding UTF8:

select similarity(E'\u0441\u043B\u043E\u043D', E'\u0441\u043B\u043E
\u043D\u044B');
similarity
────────────
NaN
(1 row)

#20

Robert Haas

robertmhaas@gmail.com

about 16 years ago

In reply to: Peter Eisentraut (#19)

Re: pg_trgm

On Thu, May 27, 2010 at 2:01 PM, Peter Eisentraut <peter_e@gmx.net> wrote:

On fre, 2010-05-28 at 00:46 +0900, Tatsuo Ishii wrote:

I don't know about Japanese, but the locale approach works just fine for
other agglutinative languages. I would rather suspect that it is the
trigram approach that might be rather useless for such languages,
because you are going to get a lot of similarity hits for the affixes.

I'm not sure what you mean by "affixes". But I will explain...

A Japanese sentence consists of words. Problem is, each word is not
separated by space (agglutinative). So most text tools such as text
search need preprocess which finds word boundaries by looking up
dictionaries (and smart grammer analysis routine). In the process
"affixes" can be determined and perhaps removed from the target word
group to be used for text search (note that removing affixes is no
relevant to locale). Once we get space separated sentence, it can be
processed by text search or by pg_trgm just same as Engligh. (Note
that these preprocessing are done outside PostgreSQL world). The
difference is just the "word" can be consists of non ASCII letters.

I think the problem at hand has nothing at all to do with agglutination
or CJK-specific issues. You will get the same problem with other
languages *if* you set a locale that does not adequately support the
characters in use. E.g., Russian with locale C and encoding UTF8:

select similarity(E'\u0441\u043B\u043E\u043D', E'\u0441\u043B\u043E
\u043D\u044B');
similarity
────────────
NaN
(1 row)

What I can't help wondering as I'm reading this discussion is -
Tatsuo-san said upthread that he has a problem with pg_trgm that he
does not have with full text search. So what is full text search
doing differently than pg_trgm?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#21

Tatsuo Ishii

t-ishii@sra.co.jp

about 16 years ago

In reply to: Peter Eisentraut (#19)

#22

Tatsuo Ishii

t-ishii@sra.co.jp

about 16 years ago

In reply to: Robert Haas (#20)

#23