BUG #15014: pg_trgm regexp with wchar not good?

Started by PG Bug reporting formabout 8 years ago2 messagesbugs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org

The following bug has been logged on the website:

Bug reference: 15014
Logged by: Zhou Digoal
Email address: digoal@126.com
PostgreSQL version: 10.1
Operating system: CentOS 7.x.x64
Description:

HI,

when i use pg_trgm's gin index, with wchar search, it's not good for regexp,
but good for like express.

```
postgres=# \l
                                 List of databases
   Name    |  Owner   | Encoding |  Collate   |   Ctype    |   Access
privileges   
-----------+----------+----------+------------+------------+-----------------------
 postgres  | postgres | UTF8     | en_US.UTF8 | en_US.UTF8 | 
 template0 | postgres | UTF8     | en_US.UTF8 | en_US.UTF8 | =c/postgres    
     +
           |          |          |            |            |
postgres=CTc/postgres
 template1 | postgres | UTF8     | en_US.UTF8 | en_US.UTF8 | =c/postgres    
     +

create or replace function gen_hanzi(int) returns text as $$
declare
res text;
begin
if $1 >=1 then
select string_agg(chr(19968+(random()*20901)::int), '') into res from
generate_series(1,$1);
return res;
end if;
return null;
end;
$$ language plpgsql strict;

postgres=# create table test(id int, info text);
CREATE TABLE
postgres=# insert into test select generate_series(1,100000),
gen_hanzi(100);
INSERT 0 100000
postgres=# create index idx_test_1 on test using gin (info gin_trgm_ops);
CREATE INDEX

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test
where info ~ '婐绷乂畳';
QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test (cost=45261409.28..45261421.30 rows=10
width=36) (actual time=583.810..816.503 rows=1 loops=1)
Output: id, info
Recheck Cond: (test.info ~ '婐绷乂畳'::text)
Rows Removed by Index Recheck: 99999
Heap Blocks: exact=4167
Buffers: shared hit=59783
-> Bitmap Index Scan on idx_test_1 (cost=0.00..45261409.28 rows=10
width=0) (actual time=583.237..583.237 rows=100000 loops=1)
Index Cond: (test.info ~ '婐绷乂畳'::text)
Buffers: shared hit=55616
Planning time: 0.150 ms
Execution time: 816.545 ms
(11 rows)

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test
where info ~ '123';
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test (cost=39.40..2897.60 rows=4000 width=36)
(actual time=0.046..0.046 rows=0 loops=1)
Output: id, info
Recheck Cond: (test.info ~ '123'::text)
Buffers: shared hit=4
-> Bitmap Index Scan on idx_test_1 (cost=0.00..38.40 rows=4000 width=0)
(actual time=0.043..0.043 rows=0 loops=1)
Index Cond: (test.info ~ '123'::text)
Buffers: shared hit=4
Planning time: 0.146 ms
Execution time: 0.072 ms
(9 rows)
```

best regards,
digoal

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: PG Bug reporting form (#1)
Re: BUG #15014: pg_trgm regexp with wchar not good?

=?utf-8?q?PG_Bug_reporting_form?= <noreply@postgresql.org> writes:

when i use pg_trgm's gin index, with wchar search, it's not good for regexp,
but good for like express.

pg_trgm is going to ignore characters that it doesn't think are letters or
digits. Don't know if the characters you are working with are considered
letters in en_US locale, but if they aren't, that would likely result in
no usable trigrams in this string. Another issue is that "trigrams" are
three *bytes* not three characters, so the useful information per trigram
is a lot lower when working with many-byte characters; that could also
lead to an index search being much less selective than you'd hope.

You might learn something by looking at the result of show_trgm() for
these strings, but I'm thinking there's no bug here, just design
limitations of the trigram approach.

regards, tom lane