pg full text search very slow for Chinese characters

Started by Jimmy Huangover 6 years ago8 messagesgeneral
Jump to latest
#1Jimmy Huang
jimmy_huang@live.com

Hi Team,

Can anyone shed some light on why postgres 11 is extremely slow in my case?

I am making a mirror of zh.wikisource.org and I have downloaded 303049 pages and stored them in a postgres 11 database.

My postgres instance is based on docker image postgres:11 and runs on my MacBook Pro i7 16GB.

Database schema is as follows

Table pages(id, url, html, downloaded, inserted_at, updated_at) and books(id, name, info, preface, text, html, url, parent_id, inserted_at, updated_at, info_html, preface_html)

A wikisource web page is downloaded and its html text is inserted into table “pages” column “html.
Later, books.{name, info, preface, text, html, info_html, preface_html} are extracted from pages.html. The text column of books is a txt version of the content of html column of table pages.

On average there are 7635 characters (each characters is 3 bytes long because of utf-8 encoding) for text column of table books and I want to add full text search to books(text).

I tried pg_trgm and my own customized token parser https://github.com/huangjimmy/pg_cjk_parser

To my surprise, postgres 11 is extremely slow when creating a full text index.

I added a column of tsvector type and tried to create an index on that column. Pg could not finish creating a GIN index for a long time and I had to cancel the execution.
I then tried to create a partial full text index for 500 rows and it took postgres 2 to 3 minutes to create the index. Based on this estimation, pg will need at least one day to create a full GIN full text search index for 303049 rows of data. I think this is ridiculous slow.
If I tried to create fts index for books(name) or books(info), it took just 3 minutes to create the index. However, name and info are extremely short compared to books(text).

I switched to Elasticsearch and it turned out that Elasticsearch is extremely efficient for my case. It took Elasticsearch 3 hours to index all 303049 rows.

Jimmy Huang
jimmy_huang@live.com

#2Cory Nemelka
cnemelka@gmail.com
In reply to: Jimmy Huang (#1)
Re: pg full text search very slow for Chinese characters

Well, there is a lot of information we would need to diagnose this. How
much tuning have you done?, etc.

My advice is pretty simple. Don't expect performance on a notebook and,
unless you are planning on hosting it on a notebook, use the notebook for
development only . Test performance on a properly configured and tuned
server.

--cnemelka

On Tue, Sep 10, 2019 at 9:53 AM Jimmy Huang <jimmy_huang@live.com> wrote:

Show quoted text

Hi Team,

Can anyone shed some light on why postgres 11 is extremely slow in my case?

I am making a mirror of zh.wikisource.org and I have downloaded 303049
pages and stored them in a postgres 11 database.

My postgres instance is based on docker image postgres:11 and runs on my
MacBook Pro i7 16GB.

Database schema is as follows

Table pages(id, url, html, downloaded, inserted_at, updated_at) and
books(id, name, info, preface, text, html, url, parent_id, inserted_at,
updated_at, info_html, preface_html)

A wikisource web page is downloaded and its html text is inserted into
table “pages” column “html.

Later, books.{name, info, preface, text, html, info_html, preface_html}
are extracted from pages.html. The text column of books is a txt version of
the content of html column of table pages.

On average there are 7635 characters (each characters is 3 bytes long
because of utf-8 encoding) for text column of table books and I want to add
full text search to books(text).

I tried pg_trgm and my own customized token parser
https://github.com/huangjimmy/pg_cjk_parser

To my surprise, postgres 11 is extremely slow when creating a full text
index.

I added a column of tsvector type and tried to create an index on that
column. Pg could not finish creating a GIN index for a long time and I had
to cancel the execution.

I then tried to create a partial full text index for 500 rows and it took
postgres 2 to 3 minutes to create the index. Based on this estimation, pg
will need at least one day to create a full GIN full text search index for
303049 rows of data. I think this is ridiculous slow.

If I tried to create fts index for books(name) or books(info), it took
just 3 minutes to create the index. However, name and info are extremely
short compared to books(text).

I switched to Elasticsearch and it turned out that Elasticsearch is
extremely efficient for my case. It took Elasticsearch 3 hours to index all
303049 rows.

Jimmy Huang

jimmy_huang@live.com

#3Michael Lewis
mlewis@entrata.com
In reply to: Jimmy Huang (#1)
Re: pg full text search very slow for Chinese characters

My postgres instance is based on docker image postgres:11 and runs on my

MacBook Pro i7 16GB.

How much ram and such did you give to this vm?

To my surprise, postgres 11 is extremely slow when creating a full text

index. I added a column of tsvector type and tried to create an index on
that column. Pg could not finish creating a GIN index for a long time and I
had to cancel the execution.I then tried to create a partial full text
index for 500 rows and it took postgres 2 to 3 minutes to create the index.

Did you customize any config? maintenance_work_mem specifically would be
relevant to the time to create an index and default value is only 64MB.
Especially if you are running a spinning hard drive and not ssd, then this
could be problematic.

#4Cory Nemelka
cnemelka@gmail.com
In reply to: Michael Lewis (#3)
Re: pg full text search very slow for Chinese characters

On Tue, Sep 10, 2019 at 10:11 AM Michael Lewis <mlewis@entrata.com> wrote:

My postgres instance is based on docker image postgres:11 and runs on my

MacBook Pro i7 16GB.

How much ram and such did you give to this vm?

To my surprise, postgres 11 is extremely slow when creating a full text

index. I added a column of tsvector type and tried to create an index on
that column. Pg could not finish creating a GIN index for a long time and I
had to cancel the execution.I then tried to create a partial full text
index for 500 rows and it took postgres 2 to 3 minutes to create the index.

Did you customize any config? maintenance_work_mem specifically would be
relevant to the time to create an index and default value is only 64MB.
Especially if you are running a spinning hard drive and not ssd, then this
could be problematic.

I apologize for my top post. :D Won't happen again

#5Jimmy Huang
jimmy_huang@live.com
In reply to: Michael Lewis (#3)
Re: pg full text search very slow for Chinese characters

It is all default values.
I just check maintenance_work_mem and indeed it is 64MB.

At first I gave 2GB ram to docker and later I increased ram to 4GB.
It did not make much difference when ram increased from 2GB to 4GB.

I will try increasing maintenance_work_mem and see if it helps.

发件人: Michael Lewis <mlewis@entrata.com>
日期: 2019年9月11日 星期三 上午12:11
收件人: 黄 少君 <jimmy_huang@live.com>
抄送: "pgsql-general@lists.postgresql.org" <pgsql-general@lists.postgresql.org>
主题: Re: pg full text search very slow for Chinese characters

My postgres instance is based on docker image postgres:11 and runs on my MacBook Pro i7 16GB.

How much ram and such did you give to this vm?

To my surprise, postgres 11 is extremely slow when creating a full text index. I added a column of tsvector type and tried to create an index on that column. Pg could not finish creating a GIN index for a long time and I had to cancel the execution.I then tried to create a partial full text index for 500 rows and it took postgres 2 to 3 minutes to create the index.

Did you customize any config? maintenance_work_mem specifically would be relevant to the time to create an index and default value is only 64MB. Especially if you are running a spinning hard drive and not ssd, then this could be problematic.

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jimmy Huang (#1)
Re: pg full text search very slow for Chinese characters

Jimmy Huang <jimmy_huang@live.com> writes:

I tried pg_trgm and my own customized token parser https://github.com/huangjimmy/pg_cjk_parser

pg_trgm is going to be fairly useless for indexing text that's mostly
multibyte characters, since its unit of indexable data is just 3 bytes
(not characters). I don't know of any comparable issue in the core
tsvector logic, though. The numbers you're quoting do sound quite awful,
but I share Cory's suspicion that it's something about your setup rather
than an inherent Postgres issue.

regards, tom lane

#7Andreas Joseph Krogh
andreas@visena.com
In reply to: Tom Lane (#6)
Re: pg full text search very slow for Chinese characters

På tirsdag 10. september 2019 kl. 18:21:45, skrev Tom Lane <tgl@sss.pgh.pa.us
<mailto:tgl@sss.pgh.pa.us>>: Jimmy Huang <jimmy_huang@live.com> writes:

I tried pg_trgm and my own customized token parser

https://github.com/huangjimmy/pg_cjk_parser

pg_trgm is going to be fairly useless for indexing text that's mostly
multibyte characters, since its unit of indexable data is just 3 bytes
(not characters). I don't know of any comparable issue in the core
tsvector logic, though. The numbers you're quoting do sound quite awful,
but I share Cory's suspicion that it's something about your setup rather
than an inherent Postgres issue.

regards, tom lane We experienced quite awful performance when we hosted the
DB on virtual servers (~5 years ago) and it turned out we hit the write-cache
limit (then 8GB), which resulted in ~1MB/s IO thruput. Running iozone might
help tracing down IO-problems. --
Andreas Joseph Krogh

#8Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Andreas Joseph Krogh (#7)
Re: pg full text search very slow for Chinese characters

Hi.

At Tue, 10 Sep 2019 18:42:26 +0200 (CEST), Andreas Joseph Krogh <andreas@visena.com> wrote in <VisenaEmail.3.8750116fce15432e.16d1c0b2b28@tc7-visena>

På tirsdag 10. september 2019 kl. 18:21:45, skrev Tom Lane <tgl@sss.pgh.pa.us
<mailto:tgl@sss.pgh.pa.us>>: Jimmy Huang <jimmy_huang@live.com> writes:

I tried pg_trgm and my own customized token parser

https://github.com/huangjimmy/pg_cjk_parser

pg_trgm is going to be fairly useless for indexing text that's mostly
multibyte characters, since its unit of indexable data is just 3 bytes
(not characters). I don't know of any comparable issue in the core
tsvector logic, though. The numbers you're quoting do sound quite awful,
but I share Cory's suspicion that it's something about your setup rather
than an inherent Postgres issue.

regards, tom lane We experienced quite awful performance when we hosted the
DB on virtual servers (~5 years ago) and it turned out we hit the write-cache
limit (then 8GB), which resulted in ~1MB/s IO thruput. Running iozone might
help tracing down IO-problems. --
Andreas Joseph Krogh

Multibyte characters also quickly bloats index by many many small
buckets for every 3-characters combination of thouhsand of
characters, which makes it useless.

pg_bigm based on bigram/2-gram works better on multibyte
characters.

https://pgbigm.osdn.jp/index_en.html

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center