Comparing strings with non-ASCII characters

Started by Artur Ratajover 23 years ago5 messagesgeneral
Jump to latest
#1Artur Rataj
arataj@iitis.gliwice.pl

Hi,

Is it possible in Postgres to use non-ASCII characters and have enabled
index optimizations for the ~ operator? Would it work correctly with
locale set to C for expressions of the form ~ '^string.*'? Or is there a
way to substitue expressions like this with two comparisons? I have tried
to do it like that >= 'string' and <= 'string_' where _ would be a
character sorted after any other used in the searched table, but I did
not found such a character in the locale pl_PL. Anyway, why the index
optimizations are disabled for the ~ operator and not for the < or >
operators in the locale? Is not there a similar problem with all of the
three operators in a non-ASCII locale? Could not the lack of index
optimizations be fixed in Postgres for locales like pl_PL with a simple
reordering of characters for a collator, at least to get the < and >
operators working right with index optimizations in such locales?

Best regards,
Artur Rataj

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Artur Rataj (#1)
Re: Comparing strings with non-ASCII characters

Artur Rataj <arataj@iitis.gliwice.pl> writes:

Anyway, why the index optimizations are disabled for the ~ operator
and not for the < or > operators in the locale?

The < and > operators are compatible with the index ordering, by
definition. The trouble with ~ is that it is not doing ordering,
but pattern matching. You can exploit an index to narrow down
the candidates for a left-anchored pattern only when the index
ordering is strict lexicographic ... and in all too many non-C
locales, it ain't. Digraphs, multipass sorting rules, and things
like that are killers.

We have tried and failed (repeatedly) to find a way to use non-C
indexes for LIKE and ~. If you'd like to try again, you're more
than welcome, but I suggest you read the PG list archives to learn
the reasons why previous attempts didn't work.

The long-run answer to this will probably be to allow individual columns
to be declared as being of particular locales; then you could assign C
locale to a column you need to do pattern matching on. This seems to
require building our own locale library :-( ... so it's not going to
happen quickly.

BTW, you might think about using full-text search (contrib/tsearch
or contrib/fulltextindex) instead of pattern matching.

regards, tom lane

#3Artur Rataj
arataj@iitis.gliwice.pl
In reply to: Artur Rataj (#1)
Re: Comparing strings with non-ASCII characters

I have set locale to C, made initdb and started postgres, and the index is
not used again.

explain select idescr.string from isbibc_fti_descr idescr where
idescr.string ~ '^string.*';
NOTICE: QUERY PLAN:

Seq Scan on isbibc_fti_descr idescr (cost=0.00..63910.04 rows=17033
width=44)

On Wed, 13 Nov 2002, Artur Rataj wrote:

Hi,

Is it possible in Postgres to use non-ASCII characters and have enabled
index optimizations for the ~ operator? Would it work correctly with
locale set to C for expressions of the form ~ '^string.*'? Or is there a
way to substitue expressions like this with two comparisons? I have tried
to do it like that >= 'string' and <= 'string_' where _ would be a
character sorted after any other used in the searched table, but I did
not found such a character in the locale pl_PL. Anyway, why the index
optimizations are disabled for the ~ operator and not for the < or >
operators in the locale? Is not there a similar problem with all of the
three operators in a non-ASCII locale? Could not the lack of index
optimizations be fixed in Postgres for locales like pl_PL with a simple
reordering of characters for a collator, at least to get the < and >
operators working right with index optimizations in such locales?

Best regards,
Artur Rataj

#4Artur Rataj
arataj@iitis.gliwice.pl
In reply to: Tom Lane (#2)
Re: Comparing strings with non-ASCII characters

On Wed, 13 Nov 2002, Tom Lane wrote:

BTW, you might think about using full-text search (contrib/tsearch
or contrib/fulltextindex) instead of pattern matching.

Thank you. I needed the ~ operator to match entries in a table being a
full text index itself. Now I use C locale and rows are sorted outside
Postgres, the index now is used with the expressions of the form
~ '^string.*', but does such expressions work correctly in C locale with
non-ASCII characters?

Regards,
Artur Rataj

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Artur Rataj (#4)
Re: Comparing strings with non-ASCII characters

Artur Rataj <arataj@iitis.gliwice.pl> writes:

Thank you. I needed the ~ operator to match entries in a table being a
full text index itself. Now I use C locale and rows are sorted outside
Postgres, the index now is used with the expressions of the form
~ '^string.*', but does such expressions work correctly in C locale with
non-ASCII characters?

Don't see why they wouldn't --- in C locale, bytes are bytes. But you
should experiment before trusting that opinion ...

regards, tom lane