pg_trgm: unicode string not working

Started by Sushant Sinhaover 14 years ago9 messages
#1Sushant Sinha
sushant354@gmail.com

I am using pg_trgm for spelling correction as prescribed in the
documentation. But I see that it does not work for unicode sring. The
database was initialized with utf8 encoding and the C locale.

Here is the table:
\d words
Table "public.words"
Column | Type | Modifiers
--------+---------+-----------
word | text |
ndoc | integer |
nentry | integer |
Indexes:
"words_idx" gin (word gin_trgm_ops)

Query: select word from words where word % 'कतद';

I get an error:

ERROR: GIN indexes do not support whole-index scans

Any idea what is wrong?

-Sushant.

#2Florian Pflug
fgp@phlo.org
In reply to: Sushant Sinha (#1)
Re: pg_trgm: unicode string not working

Hi

Next time, please post questions regarding the usage of postgres
to the -general list, not to -hackers. The purpose of -hackers is
to discuss the development of postgres proper, not the development
of applications using postgres.

On Jun12, 2011, at 13:33 , Sushant Sinha wrote:

I am using pg_trgm for spelling correction as prescribed in the
documentation. But I see that it does not work for unicode sring. The
database was initialized with utf8 encoding and the C locale.

I think you need to use a locale (more precisely, a CTYPE) in which
'क', 'त', 'द' are considered to be alphanumeric.

You can specify the CTYPE when creating the database with
CREATE DATABASE ... LC_CTYPE = ...

Here is the table:
\d words
Table "public.words"
Column | Type | Modifiers
--------+---------+-----------
word | text |
ndoc | integer |
nentry | integer |
Indexes:
"words_idx" gin (word gin_trgm_ops)

Query: select word from words where word % 'कतद';

I get an error:

ERROR: GIN indexes do not support whole-index scans

pg_trgm probably ignores non-alphanumeric characters during
comparison, so you end up with an empty search string, which
translates to a whole-index scan. Postgres up to 9.0 does
not support such scans for GIN indices.

Note that this restriction was removed in postgres 9.1 which
is currently in beta. However, GIT indices must be re-created
with REINDEX after upgrading from 9.0 to leverage that
improvement.

best regards.
Florian Pflug

#3Robert Haas
robertmhaas@gmail.com
In reply to: Florian Pflug (#2)
Re: pg_trgm: unicode string not working

On Sun, Jun 12, 2011 at 8:40 AM, Florian Pflug <fgp@phlo.org> wrote:

Note that this restriction was removed in postgres 9.1 which
is currently in beta. However, GIT indices must be re-created
with REINDEX after upgrading from 9.0 to leverage that
improvement.

Does pg_upgrade know about this?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#4Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#3)
Re: pg_trgm: unicode string not working

Robert Haas wrote:

On Sun, Jun 12, 2011 at 8:40 AM, Florian Pflug <fgp@phlo.org> wrote:

Note that this restriction was removed in postgres 9.1 which
is currently in beta. However, GIT indices must be re-created
with REINDEX after upgrading from 9.0 to leverage that
improvement.

Does pg_upgrade know about this?

No, it does not. Under what circumstances should I issue a suggestion
to reindex, and what should the text be?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#5Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#4)
Re: pg_trgm: unicode string not working

On Mon, Jun 13, 2011 at 7:47 PM, Bruce Momjian <bruce@momjian.us> wrote:

Robert Haas wrote:

On Sun, Jun 12, 2011 at 8:40 AM, Florian Pflug <fgp@phlo.org> wrote:

Note that this restriction was removed in postgres 9.1 which
is currently in beta. However, GIT indices must be re-created
with REINDEX after upgrading from 9.0 to leverage that
improvement.

Does pg_upgrade know about this?

No, it does not.  Under what circumstances should I issue a suggestion
to reindex, and what should the text be?

It sounds like GIN indexes need to be reindexed after upgrading from <
9.1 to >= 9.1.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#6Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#5)
Re: pg_trgm: unicode string not working

Robert Haas wrote:

On Mon, Jun 13, 2011 at 7:47 PM, Bruce Momjian <bruce@momjian.us> wrote:

Robert Haas wrote:

On Sun, Jun 12, 2011 at 8:40 AM, Florian Pflug <fgp@phlo.org> wrote:

Note that this restriction was removed in postgres 9.1 which
is currently in beta. However, GIT indices must be re-created
with REINDEX after upgrading from 9.0 to leverage that
improvement.

Does pg_upgrade know about this?

No, it does not. ?Under what circumstances should I issue a suggestion
to reindex, and what should the text be?

It sounds like GIN indexes need to be reindexed after upgrading from <
9.1 to >= 9.1.

I already have some GIN tests I used for 8.3 to 8.4 so that is easy, but
is the reindex required or just suggested for features?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#5)
Re: pg_trgm: unicode string not working

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Jun 13, 2011 at 7:47 PM, Bruce Momjian <bruce@momjian.us> wrote:

No, it does not. �Under what circumstances should I issue a suggestion
to reindex, and what should the text be?

It sounds like GIN indexes need to be reindexed after upgrading from <
9.1 to >= 9.1.

Only if you care whether they work for corner cases such as empty
arrays ... corner cases which didn't work before 9.1, so very likely
you don't care.

I'm not sure that pg_upgrade is a good vehicle for dispensing such
advice, anyway. At least in the Red Hat packaging, end users will never
read what it prints, unless maybe it fails outright and they're trying
to debug why.

regards, tom lane

#8Florian Pflug
fgp@phlo.org
In reply to: Tom Lane (#7)
Re: pg_trgm: unicode string not working

On Jun14, 2011, at 07:15 , Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Jun 13, 2011 at 7:47 PM, Bruce Momjian <bruce@momjian.us> wrote:

No, it does not. Under what circumstances should I issue a suggestion
to reindex, and what should the text be?

It sounds like GIN indexes need to be reindexed after upgrading from <
9.1 to >= 9.1.

Only if you care whether they work for corner cases such as empty
arrays ... corner cases which didn't work before 9.1, so very likely
you don't care.

We also already say "To fix this, do REINDEX INDEX ... " in the errhint
of "old GIN indexes do not support whole-index scans nor searches for nulls".

best regards,
Florian Pflug

#9Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#7)
Re: pg_trgm: unicode string not working

On Tue, Jun 14, 2011 at 1:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm not sure that pg_upgrade is a good vehicle for dispensing such
advice, anyway.  At least in the Red Hat packaging, end users will never
read what it prints, unless maybe it fails outright and they're trying
to debug why.

In my experience to date, that happens 100% of the time. :-(

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company