another locale problem

Started by Daniel Kalchevover 26 years ago16 messages
#1Daniel Kalchev
daniel@digsys.bg

While playing with --enable-locale and the like expressions I came across this
phenomenon ('somestring' is in cyrillic - last 64 codes from 0xa0 to 0xff)

SELECT key FROM t WHERE key ~* 'somestring'

returns the correct answers, properly matching upper/lower case characters -
this with --enable-locale, --with-mb=WIN and commenting out the USE_LOCALE
restrictions in gram.y. Explain shows Sequential scan, as expected...

However

SELECT key FROM t WHERE key ~* '^somestring'

returns no tuples and explain says it will use the index on key. Why is this?

If 'somestring' is ASCII characters, explain always gives sequential scan and
in both cases returns the proper results.

I am willing to do some testing if anyone has ideas how to patch the regex
code (I hate it :-).

Regards,
Daniel

#2Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Daniel Kalchev (#1)
Re: [HACKERS] another locale problem

While playing with --enable-locale and the like expressions I came across this
phenomenon ('somestring' is in cyrillic - last 64 codes from 0xa0 to 0xff)

SELECT key FROM t WHERE key ~* 'somestring'

returns the correct answers, properly matching upper/lower case characters -
this with --enable-locale, --with-mb=WIN and commenting out the USE_LOCALE
restrictions in gram.y. Explain shows Sequential scan, as expected...

However

SELECT key FROM t WHERE key ~* '^somestring'

returns no tuples and explain says it will use the index on key. Why is this?

That's strange. It should be seq scan in this case?

If 'somestring' is ASCII characters, explain always gives sequential scan and
in both cases returns the proper results.

I am willing to do some testing if anyone has ideas how to patch the regex
code (I hate it :-).

Can you test following case:

SELECT key FROM t WHERE key ~* '^Xsomestring'

where X is one of an ASCII character.
---
Tatsuo Ishii

#3Daniel Kalchev
daniel@digsys.bg
In reply to: Tatsuo Ishii (#2)
Re: [HACKERS] another locale problem

Tatsuo Ishii said:

SELECT key FROM t WHERE key ~* '^somestring'

returns no tuples and explain says it will use the index on key. Why is th

is?

That's strange. It should be seq scan in this case?

I forgot to mention, that if 'somestring' is all uppercase, everything works
(the key field in the table is all uppercase). It still says index scan will
be used.

To summarize the problem. If key contains (equivalent cyrillic letters) 'ABC',
'ABCD', 'DAB' and 'ABX' and the query is:

SELECT key FROM t WHERE key ~* '^AB';

index scan will be used and the correct tuples ('ABC', 'ABCD' and 'ABX') will
be returned. If the query is

SELECT key FROM t WHERE key ~* '^ab';

index scan will be used and no tuples will be returned. With the query

SELECT key FROM t WHERE key ~* 'ab';

sequential scan will be used and the correct tuples will be returned (all of
the above).

Can you test following case:

SELECT key FROM t WHERE key ~* '^Xsomestring'

where X is one of an ASCII character.

Explain says it will use sequential scan and if I insert proper key in the
table it will be returned.

Daniel

#4Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Daniel Kalchev (#3)
Re: [HACKERS] another locale problem

SELECT key FROM t WHERE key ~* '^somestring'

returns no tuples and explain says it will use the index on key. Why is th

is?

That's strange. It should be seq scan in this case?

I forgot to mention, that if 'somestring' is all uppercase, everything works
(the key field in the table is all uppercase). It still says index scan will
be used.

Hmm... I am totally confused by this. Sould be a bug.

To summarize the problem. If key contains (equivalent cyrillic letters) 'ABC',
'ABCD', 'DAB' and 'ABX' and the query is:

SELECT key FROM t WHERE key ~* '^AB';

index scan will be used and the correct tuples ('ABC', 'ABCD' and 'ABX') will
be returned. If the query is

SELECT key FROM t WHERE key ~* '^ab';

index scan will be used and no tuples will be returned. With the query

Should be bug.

SELECT key FROM t WHERE key ~* 'ab';

sequential scan will be used and the correct tuples will be returned (all of
the above).

Seems correct result.

Can you test following case:

SELECT key FROM t WHERE key ~* '^Xsomestring'

where X is one of an ASCII character.

Explain says it will use sequential scan and if I insert proper key in the
table it will be returned.

Expected result.

From line 5388 of parser/gram.y:

(strcmp(opname,"~*") == 0 && isalpha(n->val.val.str[pos])))

I suspect isalpha() does not return true if n->val.val.str[pos] is a
non ascii char. Probably that's why

SELECT key FROM t WHERE key ~* '^somestring'

does not work.

If its argument is a cyrillic char and cyrillic locale enabled then
isalpha() should return true. Can you check this?

Or:
(strcmp(opname,"~*")
== 0 && isalpha((unsigned char)n->val.val.str[pos])))

works for you?
--
Tatsuo Ishii

#5Daniel Kalchev
daniel@digsys.bg
In reply to: Tatsuo Ishii (#4)
Re: [HACKERS] another locale problem

Tatsuo Ishii said:

[...]

From line 5388 of parser/gram.y:

(strcmp(opname,"~*") == 0 && is

alpha(n->val.val.str[pos])))

I suspect isalpha() does not return true if n->val.val.str[pos] is a
non ascii char. Probably that's why

[...]

I have tried the following program under BSD/OS 4.0.1:

--- cut here ---
#include <string.h>
#include <locale.h>
#include <rune.h>
#include <errno.h>

main()
{
int i;

     printf ("%s\n", setlocale(LC_CTYPE, "CP1251"));
     printf ("--- strcoll ---\n");
     printf ("%d\n", strcoll("���", "���"));
     printf ("%d\n", strcoll("���", "���"));
     printf ("%d\n", strcoll("���", "���"));
     printf ("--- isalpha ---\n");
     i = '�';
     printf ("%c, %d\n", i, i);
     printf ("%d\n", isalpha('�'));
     printf ("%d\n", isalpha('a'));
}
--- cut here ---

Where the cyrillic letter is the 'L' lowercase in cyrillic, the other are
first characters from the alphabet....

The strcoll work always. This is because BSD/OS does NOT support the
LC_COLLATE and strcoll is effectively strcmp (confirmed by the libc sources).

There are two cases:

1. cc -o test test.c

Returns 0 for isalpha() on cyrillic characers.

2. cc -funsigned-char -o test test.c

Returns 1 for isalpha() on cyrillic characters!

If I substitute the '�' character constant with it's code (235), isalpha()
returns 1.

BSD/OS 4.0.1 uses gcc 2.7.2.1. Apparently somewhere in the code 'char' is used
(I did use both --enable-locale and --with-mb=WIN). I am still searching for
it, but the regex code is not very readable... :-(

Daniel

#6Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Daniel Kalchev (#5)
Re: [HACKERS] another locale problem

Ok. Now I almost certain that:

(strcmp(opname,"~*")
== 0 && isalpha((unsigned char)n->val.val.str[pos])))

should work for you. Can you confirm this?
--
Tatsuo Ishii

#7Angelos Karageorgiou
angelos@awesome.incredible.com
In reply to: Tatsuo Ishii (#6)
Re: [HACKERS] another locale problem

On Fri, 11 Jun 1999, Tatsuo Ishii wrote:

Ok. Now I almost certain that:

(strcmp(opname,"~*")
== 0 && isalpha((unsigned char)n->val.val.str[pos])))

should work for you. Can you confirm this?
--
Tatsuo Ishii

It did for me , BSD/OS is veeeery finicky with the way it treats chars. It
upgrades them to ints and from then on isalpha isupper do not work.
I have tried with --unsigned-char gcc switch the result was the same, only
with a cast to (unsigned char) do all the char related functions work
as advertised.

I am sorry to say I have not had any input from other internationals using
BSDI with a non-english character set.

--
Incredible Networks LTD Angelos Karageorgiou
20 Karea st, +30.1.92.12.312 (voice)
116 36 Athens, Greece. +30.1.92.12.314 (fax)
http://www.incredible.com angelos@incredible.com (e-mail)

#8Daniel Kalchev
daniel@digsys.bg
In reply to: Tatsuo Ishii (#6)
Re: [HACKERS] another locale problem

This worked, however it made all selects of ~* '^sometext' sequential.

Why can't we use index searches in this case? I believe it is :-)

But this

SELECT key FROM t WHERE key ~* '^ sometext';

(note the space after the carret) still does not work! It says index scan will
be used and finds nothng. If 'sometext' is all uppercase the proper result is
returned. The problem seems to be in the regex handling anyway. I tried
compiling regcomp.c with -funsigned-char, but that was not enough. Further
hacking...

Daniel

Show quoted text

Tatsuo Ishii said:

Ok. Now I almost certain that:

(strcmp(opname,"~*")
== 0 && isalpha((unsigned char)n->val.val.str[pos])))

should work for you. Can you confirm this?
--
Tatsuo Ishii

#9Daniel Kalchev
daniel@digsys.bg
In reply to: Angelos Karageorgiou (#7)
Re: [HACKERS] another locale problem

This is not because of BSD/OS, but because of gcc. I found some 'char'
references in the regex routines and am working on that currently.

I can confirm that other software on BSD/OS works correctly with locales, such
as vi, less etc. Apparently we need unsigned char in order to do anything
sensible with char values over 127.

Daniel

Show quoted text

Angelos Karageorgiou said:

(strcmp(opname,"~*")
== 0 && isalpha((unsigned char)n->val.val.str[pos])))

should work for you. Can you confirm this?
--
Tatsuo Ishii

It did for me , BSD/OS is veeeery finicky with the way it treats chars. It
upgrades them to ints and from then on isalpha isupper do not work.
I have tried with --unsigned-char gcc switch the result was the same, only
with a cast to (unsigned char) do all the char related functions work
as advertised.

I am sorry to say I have not had any input from other internationals using
BSDI with a non-english character set.

--
Incredible Networks LTD Angelos Karageorgiou
20 Karea st, +30.1.92.12.312 (voice)
116 36 Athens, Greece. +30.1.92.12.314 (fax)
http://www.incredible.com angelos@incredible.com (e-mail)

#10Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Daniel Kalchev (#9)
Re: [HACKERS] another locale problem

This worked, however it made all selects of ~* '^sometext' sequential.

That's correct behavior.

Why can't we use index searches in this case? I believe it is :-)

No. It's due to the nature of the Btree index.

But this

SELECT key FROM t WHERE key ~* '^ sometext';

(note the space after the carret) still does not work! It says index scan will
be used and finds nothng. If 'sometext' is all uppercase the proper result is
returned. The problem seems to be in the regex handling anyway. I tried
compiling regcomp.c with -funsigned-char, but that was not enough. Further
hacking...

Again, that should not be index scan.

BTW, if you want to play with regex, you might find retest.c be
useful. You can build the test tool by:

make retest

enjoy:-)
--
Tatsuo Ishii

#11Daniel Kalchev
daniel@digsys.bg
In reply to: Tatsuo Ishii (#10)
Re: [HACKERS] another locale problem

Tatsuo Ishii said:

This worked, however it made all selects of ~* '^sometext' sequential.

That's correct behavior.

Why can't we use index searches in this case? I believe it is :-)

No. It's due to the nature of the Btree index.

But why it did use index scan when the case of the field and the expression
matched and did find the correct results?

But this

SELECT key FROM t WHERE key ~* '^ sometext';

Again, that should not be index scan.

SELECT key FROM t WHERE key ~* '^ ';

(space only) uses index and works correctly.... weird!

BTW, if you want to play with regex, you might find retest.c be
useful. You can build the test tool by:

make retest

enjoy:-)

In a wild guess I changed all 'char' to 'unsigned char' in regcomp.c, with no
positive or negative results. Maybe there are other places in regex where this
should be done, such as regexec.c? Is there an regex guru over here ... :-)

My long-time wondering about regex in Postgres has always been - isn't there
something better than this old regex code that we can use?

Daniel

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Daniel Kalchev (#11)
Re: [HACKERS] another locale problem

Daniel Kalchev <daniel@digsys.bg> writes:

To summarize the problem. If key contains (equivalent cyrillic
letters) 'ABC', 'ABCD', 'DAB' and 'ABX' and the query is:

SELECT key FROM t WHERE key ~* '^AB';

index scan will be used and the correct tuples ('ABC', 'ABCD' and
'ABX') will be returned. If the query is

SELECT key FROM t WHERE key ~* '^ab';

index scan will be used and no tuples will be returned.

Hm. Is it possible that isalpha() is doing the wrong thing on your
machine? makeIndexable() currently assumes that isalpha() returns true
for any character that is subject to case conversion, but I wonder
whether that's a good enough test.

The other possibility is that regexp's internal handling of
case-insensitive matching is not right.

regards, tom lane

#13Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Daniel Kalchev (#3)
Re: [HACKERS] another locale problem

Tatsuo Ishii said:

SELECT key FROM t WHERE key ~* '^somestring'

returns no tuples and explain says it will use the index on key. Why is th

is?

That's strange. It should be seq scan in this case?

I forgot to mention, that if 'somestring' is all uppercase, everything works
(the key field in the table is all uppercase). It still says index scan will
be used.

To summarize the problem. If key contains (equivalent cyrillic letters) 'ABC',
'ABCD', 'DAB' and 'ABX' and the query is:

SELECT key FROM t WHERE key ~* '^AB';

Oops, forgot ~* is case-insensitive. The conditions we adde for this
require the query to be between > A and < a, which is not very
restrictive.

No real better way to do this.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#14Daniel Kalchev
daniel@digsys.bg
In reply to: Tom Lane (#12)
Re: [HACKERS] another locale problem

Tom Lane said:

Daniel Kalchev <daniel@digsys.bg> writes:

To summarize the problem. If key contains (equivalent cyrillic
letters) 'ABC', 'ABCD', 'DAB' and 'ABX' and the query is:

SELECT key FROM t WHERE key ~* '^AB';

index scan will be used and the correct tuples ('ABC', 'ABCD' and
'ABX') will be returned. If the query is

SELECT key FROM t WHERE key ~* '^ab';

index scan will be used and no tuples will be returned.

Hm. Is it possible that isalpha() is doing the wrong thing on your
machine? makeIndexable() currently assumes that isalpha() returns true
for any character that is subject to case conversion, but I wonder
whether that's a good enough test.

In fact, after giving it some though... the expression in gram.y

(strcmp(opname,"~*")
== 0 && isalpha(n->val.val.str[pos])))

is wrong. The statement in my view decides that a regular expression is not
indexable if it contains special characters or if it contains non-alpha
characters. Therefore, the statement should be written as:

(strcmp(opname,"~*")
== 0 && !isalpha((unsigned char)n->val.val.str[pos])))

(two fixes :) This makes indexes work for '^abc' (lowercase ASCII). But does
not find anything, which means regex does not work. It does not work for both
ASCII and non-ASCII text/patterns. :-(

The other possibility is that regexp's internal handling of
case-insensitive matching is not right.

I believe it to be terribly wrong, and some releases ago it worked with 8-bit
characters by just compiling it with -funsigned-char. Now this breaks things...

Daniel

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Daniel Kalchev (#14)
Re: [HACKERS] another locale problem

Daniel Kalchev <daniel@digsys.bg> writes:

In fact, after giving it some though... the expression in gram.y

(strcmp(opname,"~*")
== 0 && isalpha(n->val.val.str[pos])))

is wrong. The statement in my view decides that a regular expression is not
indexable if it contains special characters or if it contains non-alpha
characters. Therefore, the statement should be written as:

(strcmp(opname,"~*")
== 0 && !isalpha((unsigned char)n->val.val.str[pos])))

No, it's not wrong, at least not in that way! You've missed the point
entirely. ~* is the *case insensitive* regexp match operator.
Therefore if I have a pattern like '^abc' it can match anything
beginning with either 'a' or 'A'. If the index restriction were to
include the letter 'a' then it would exclude valid matches starting with
'A'. The simplest solution, which is what's in makeIndexable(), is
to exclude case-foldable characters from the index restriction pattern.
In this particular case you end up getting no index restriction at all,
but that is indeed what's supposed to happen.

I am not sure that isalpha() is an adequate test for case-foldable
characters in non-ASCII locales, but inverting it is definitely wrong ;-)

regards, tom lane

#16Daniel Kalchev
daniel@digsys.bg
In reply to: Tom Lane (#15)
Re: [HACKERS] another locale problem

Tom,

So you say that this check prevents the use of indexes, when we use the ~*
operator and have alpha characters in the pattern, because apparently the
index cannot do case insensitive matching.

I was under the (apparently wrong impression) that it was possible to use
indexes for case insensitive matching. Dreaming... :-)

For your information, isalpha() is the correct match for case foldable
characters, at least in the cp1251 (windows-1251) locale. I believe a more
correct test could be to access the locale's MAPLOWER and MAPUPPER tables.

It is not the case in Bulgarian, but there might be languages where an letter
does not exist in both upper and lower cases and therefore requires more
complex handling. Perhaps such situation exists in the multibyte locales.

Please excuse my confusion. :-)

Daniel

Tom Lane said:

Daniel Kalchev <daniel@digsys.bg> writes:

In fact, after giving it some though... the expression in gram.y

(strcmp(opname,"~*")
== 0 && isalpha(n->val.val.str[pos])))

is wrong. The statement in my view decides that a regular expression is no

t

Show quoted text

indexable if it contains special characters or if it contains non-alpha
characters. Therefore, the statement should be written as:

(strcmp(opname,"~*")
== 0 && !isalpha((unsigned char)n->val.val.str[pos])))

No, it's not wrong, at least not in that way! You've missed the point
entirely. ~* is the *case insensitive* regexp match operator.
Therefore if I have a pattern like '^abc' it can match anything
beginning with either 'a' or 'A'. If the index restriction were to
include the letter 'a' then it would exclude valid matches starting with
'A'. The simplest solution, which is what's in makeIndexable(), is
to exclude case-foldable characters from the index restriction pattern.
In this particular case you end up getting no index restriction at all,
but that is indeed what's supposed to happen.

I am not sure that isalpha() is an adequate test for case-foldable
characters in non-ASCII locales, but inverting it is definitely wrong ;-)

regards, tom lane