`pg_trgm` not recognizing Chinese characters in macOS

Started by Haotian Yangover 7 years ago3 messagesbugs
Jump to latest
#1Haotian Yang
yangnw@live.com

Versions: macOS 10.13.6, PostgreSQL 10.5, pg_trgm 1.3.
LC_ALL=en_US.UTF-8

reproduce:
- enter psql as admin.
- `CREATE EXTENSION pg_trgm`.
- `SELECT show_trgm(‘一个句子’)`.

expects:
- something like `{0x…,0x…}`

gets:
- `{}`

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Haotian Yang (#1)
Re: `pg_trgm` not recognizing Chinese characters in macOS

Haotian Yang <yangnw@live.com> writes:

Versions: macOS 10.13.6, PostgreSQL 10.5, pg_trgm 1.3.
LC_ALL=en_US.UTF-8

pg_trgm relies on libc's functions (specifically, iswalpha()) to determine
what is a word character or not. Unfortunately, the UTF8 locale support
in macOS is pretty incomplete, and I don't find it too surprising that
it's not recognizing Chinese characters as alphabetic. Now, you could
make a good argument that they *shouldn't* be considered alphabetic in
an en_US locale; but I'm unsure whether switching to a more appropriate
locale will help.

Anyway, I'd first try zh_CN.UTF-8, and if that doesn't fix it, the place
to complain is https://bugreport.apple.com/ ... I'm sure they know about
it already, but the number of reports has an impact on how fast they
fix things.

regards, tom lane

#3周正中(德歌)
dege.zzz@alibaba-inc.com
In reply to: Tom Lane (#2)
回复:`pg_trgm` not recognizing Chinese characters in macOS

you should use lc_ctype not to C.

```
postgres=# \l
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
-----------+----------+----------+------------+------------+-----------------------
newdb | postgres | UTF8 | en_US.UTF8 | en_US.UTF8 |
postgres | postgres | UTF8 | en_US.UTF8 | en_US.UTF8 |
template0 | postgres | UTF8 | en_US.UTF8 | en_US.UTF8 | =c/postgres +
| | | | | postgres=CTc/postgres
template1 | postgres | UTF8 | en_US.UTF8 | en_US.UTF8 | =c/postgres +
| | | | | postgres=CTc/postgres
(4 rows)

postgres=# select show_trgm('hello你好');
show_trgm
------------------------------------------------------
{0xcf7970,0xfe5170,0x114ebf," h"," he",ell,hel,llo}
(1 row)

postgres=# create database testdb with template template0 lc_ctype='C';
CREATE DATABASE
postgres=# \c testdb
You are now connected to database "testdb" as user "postgres".
testdb=# create extension pg_trgm;
CREATE EXTENSION
testdb=# select show_trgm('hello你好');
show_trgm
---------------------------------
{" h"," he",ell,hel,llo,"lo "}
(1 row)
```
------------------------------------------------------------------
发件人:Tom Lane <tgl@sss.pgh.pa.us>
发送时间:2018年9月11日(星期二) 21:20
收件人:Haotian Yang <yangnw@live.com>
抄 送:pgsql-bugs@postgresql.org <pgsql-bugs@postgresql.org>
主 题:Re: `pg_trgm` not recognizing Chinese characters in macOS

Haotian Yang <yangnw@live.com> writes:

Versions: macOS 10.13.6, PostgreSQL 10.5, pg_trgm 1.3.
LC_ALL=en_US.UTF-8

pg_trgm relies on libc's functions (specifically, iswalpha()) to determine
what is a word character or not. Unfortunately, the UTF8 locale support
in macOS is pretty incomplete, and I don't find it too surprising that
it's not recognizing Chinese characters as alphabetic. Now, you could
make a good argument that they *shouldn't* be considered alphabetic in
an en_US locale; but I'm unsure whether switching to a more appropriate
locale will help.

Anyway, I'd first try zh_CN.UTF-8, and if that doesn't fix it, the place
to complain is https://bugreport.apple.com/ ... I'm sure they know about
it already, but the number of reports has an impact on how fast they
fix things.

regards, tom lane