PostgreSQL Asian language support for full text search using ICU (and also updating pg_trgm)

Started by Chanon Sajjamanochaialmost 7 years ago2 messageshackersgeneral

chanon.s@gmail.com

almost 7 years ago

hackersgeneral

Hello,

Currently PostgreSQL doesn't support full text search natively for many
Asian languages such as Chinese, Japanese and others. These languages are
used by a large portion of the population of the world.

The two key modules that could be modified to support Asian languages are
the full text search module (including tsvector) and pg_trgm.

I would like to propose that this support be added to PostgreSQL.

For full text search, PostgreSQL could add a new parser (
https://www.postgresql.org/docs/9.2/textsearch-parsers.html) that
implements ICU word tokenization. This should be a lot more easier than
before now that PostgreSQL itself already includes ICU dependencies for
other things.

Then allow the ICU parser to be chosen at run-time (via a run-time config
or an option to to_tsvector). That is all that is needed to support full
text search for many more Asian languages natively in PostgreSQL such as
Chinese, Japanese and Thai.

For example Elastic Search implements this using its ICU Tokenizer plugin:
https://www.elastic.co/guide/en/elasticsearch/guide/current/icu-tokenizer.html

Some information about the related APIs in ICU for this are at:
http://userguide.icu-project.org/boundaryanalysis

Another simple improvement that would give another option for searching for
Asian languages is to add a run-time setting for pg_trgm that would tell it
to not drop non-ascii characters, as currently it only indexes ascii
characters and thus all Asian language characters are dropped.

I emphasize 'run-time setting' because when using PostgreSQL via a
Database-As-A-Service service provider, most of the time it is not possible
to change the config files, recompile sources, or add any new extensions.

PostgreSQL is an awesome project and probably the best RDBMS right now. I
hope the maintainers consider this suggestion.

Best Regards,
Chanon

Tatsuo Ishii

t-ishii@sra.co.jp

almost 7 years ago

In reply to: Chanon Sajjamanochai (#1)

hackersgeneral

Re: PostgreSQL Asian language support for full text search using ICU (and also updating pg_trgm)

[redirected to hackers list since I think this topic is related to
adding new PostgreSQL feature.]

I think there's no doubt that it would be nice if PostgreSQL natively
supports Asian languages. For the first step, I briefly tested the ICU
tokenizer (ubrk_open and other functions) with Japanese, the only
Asian language I understand. The result was a little bit different
from the most popular Japanese tokenizer "Mecab" [1]https://taku910.github.io/mecab/, but it seems I
can live with that as far as it's used for full text search
purpose. Of course more tests would be needed though.

In addition to the accuracy of tokenizing, performance is of course
important. This needs more work.

I think same studies would be needed for other Asian languages. Hope
someone who is familiar with other Asian languages volunteers to do
the task.

[1]: https://taku910.github.io/mecab/

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

From: Chanon Sajjamanochai <chanon.s@gmail.com>
Subject: PostgreSQL Asian language support for full text search using ICU (and also updating pg_trgm)
Date: Wed, 1 May 2019 08:55:50 +0700
Message-ID: <CAEV3FNPU8hU_hi=0+QNAbEkc-uO8-K9PB3aAChdmcCyPfWX6rg@mail.gmail.com>

Show quoted text

Hello,

Currently PostgreSQL doesn't support full text search natively for many
Asian languages such as Chinese, Japanese and others. These languages are
used by a large portion of the population of the world.

The two key modules that could be modified to support Asian languages are
the full text search module (including tsvector) and pg_trgm.

I would like to propose that this support be added to PostgreSQL.

For full text search, PostgreSQL could add a new parser (
https://www.postgresql.org/docs/9.2/textsearch-parsers.html) that
implements ICU word tokenization. This should be a lot more easier than
before now that PostgreSQL itself already includes ICU dependencies for
other things.

Then allow the ICU parser to be chosen at run-time (via a run-time config
or an option to to_tsvector). That is all that is needed to support full
text search for many more Asian languages natively in PostgreSQL such as
Chinese, Japanese and Thai.

For example Elastic Search implements this using its ICU Tokenizer plugin:
https://www.elastic.co/guide/en/elasticsearch/guide/current/icu-tokenizer.html

Some information about the related APIs in ICU for this are at:
http://userguide.icu-project.org/boundaryanalysis

Another simple improvement that would give another option for searching for
Asian languages is to add a run-time setting for pg_trgm that would tell it
to not drop non-ascii characters, as currently it only indexes ascii
characters and thus all Asian language characters are dropped.

I emphasize 'run-time setting' because when using PostgreSQL via a
Database-As-A-Service service provider, most of the time it is not possible
to change the config files, recompile sources, or add any new extensions.

PostgreSQL is an awesome project and probably the best RDBMS right now. I
hope the maintainers consider this suggestion.

Best Regards,
Chanon