Fixing the Turkish problem

Started by Tom Laneover 21 years ago6 messages
#1Tom Lane
tgl@sss.pgh.pa.us

We're sort of halfway there on coping with the Turkish-locale i-vs-I
problem. I'd like to finish the job for 7.5.

What we presently have is that identifier and keyword downcasing is done
without trusting tolower():

/*
* SQL99 specifies Unicode-aware case normalization, which we don't yet
* have the infrastructure for. Instead we use tolower() to provide a
* locale-aware translation. However, there are some locales where this
* is not right either (eg, Turkish may do strange things with 'i' and
* 'I'). Our current compromise is to use tolower() for characters with
* the high bit set, and use an ASCII-only downcasing for 7-bit
* characters.
*/
for (i = 0; i < len; i++)
{
unsigned char ch = (unsigned char) ident[i];

if (ch >= 'A' && ch <= 'Z')
ch += 'a' - 'A';
else if (ch >= 0x80 && isupper(ch))
ch = tolower(ch);
result[i] = (char) ch;
}

AFAICS the remaining problem is that there are a bunch of places that
use strcasecmp() or strncasecmp() to match inputs against locally known
keywords (such as datestyle or timezone names). We need to make a
variant version of strcasecmp that uses this same style of case-folding.

What I'm thinking of doing is inventing "pg_strcasecmp" and
"pg_strncasecmp" that act like the above and replacing all calls of the
standard library functions with these.

The routines need to be available in client code (eg, psql) as well as
the backend, so I'm thinking of putting them into libpgport (src/port/).
Another possibility would be to associate them with the multibyte
character code, which is already imported into client code in places.

Any thoughts, objections?

regards, tom lane

#2Devrim GUNDUZ
devrim@gunduz.org
In reply to: Tom Lane (#1)
Re: Fixing the Turkish problem

Hi,

On Thu, 6 May 2004, Tom Lane wrote:

We're sort of halfway there on coping with the Turkish-locale i-vs-I
problem. I'd like to finish the job for 7.5.

Cool!

<snip>

AFAICS the remaining problem is that there are a bunch of places that
use strcasecmp() or strncasecmp() to match inputs against locally known
keywords (such as datestyle or timezone names). We need to make a
variant version of strcasecmp that uses this same style of case-folding.

What I'm thinking of doing is inventing "pg_strcasecmp" and
"pg_strncasecmp" that act like the above and replacing all calls of the
standard library functions with these.

If you can post all the patches you'd like to apply, I'd be happy to test
them. (Sorry for the very late response, btw.)

Regards,

--
Devrim GUNDUZ
devrim~gunduz.org devrim.gunduz~linux.org.tr
http://www.TDMSoft.com
http://www.gunduz.org

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Devrim GUNDUZ (#2)
Re: Fixing the Turkish problem

Devrim GUNDUZ <devrim@gunduz.org> writes:

On Thu, 6 May 2004, Tom Lane wrote:

What I'm thinking of doing is inventing "pg_strcasecmp" and
"pg_strncasecmp" that act like the above and replacing all calls of the
standard library functions with these.

If you can post all the patches you'd like to apply, I'd be happy to test
them. (Sorry for the very late response, btw.)

The patches are in; please give CVS tip a shot and see what you think.
It passed regression tests in a Turkish locale for me.

regards, tom lane

#4Devrim GUNDUZ
devrim@gunduz.org
In reply to: Tom Lane (#3)
Re: Fixing the Turkish problem

Hi,

On Sun, 23 May 2004, Tom Lane wrote:

"pg_strncasecmp" that act like the above and replacing all calls of the
standard library functions with these.

If you can post all the patches you'd like to apply, I'd be happy to test
them. (Sorry for the very late response, btw.)

The patches are in; please give CVS tip a shot and see what you think.
It passed regression tests in a Turkish locale for me.

Yes, it solves the initdb bug #1133. Thanks.

However, we still fail to sort small I (i dotless) and i. "i dotless"
comes before i in Turkish Alphabet, but ORDER BY sorts i before "i
dotless".

I would post a sample, but I'm not sure that anyone on the list could view
it :)

Regards,
--
Devrim GUNDUZ
devrim~gunduz.org devrim.gunduz~linux.org.tr
http://www.TDMSoft.com
http://www.gunduz.org

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Devrim GUNDUZ (#4)
Re: Fixing the Turkish problem

Devrim GUNDUZ <devrim@gunduz.org> writes:

However, we still fail to sort small I (i dotless) and i. "i dotless"
comes before i in Turkish Alphabet, but ORDER BY sorts i before "i
dotless".

For that, you have to complain to your locale's designer. We just do
what strcoll tells us to.

regards, tom lane

#6Devrim GUNDUZ
devrim@gunduz.org
In reply to: Tom Lane (#5)
Re: Fixing the Turkish problem

Hi,

On Sun, 23 May 2004, Tom Lane wrote:

However, we still fail to sort small I (i dotless) and i. "i dotless"
comes before i in Turkish Alphabet, but ORDER BY sorts i before "i
dotless".

For that, you have to complain to your locale's designer. We just do
what strcoll tells us to.

Ok, complained :-) We've reported a bug before, and now latest snapshot of
Glibc contains the correct order for Turkish characters. initdb and sort
on 7.5devel works perfectly .

I'm glad that Turkish locale problems were solved before 7.5 was out.
Thanks Tom.

Regards,
--
Devrim GUNDUZ
devrim~gunduz.org devrim.gunduz~linux.org.tr
http://www.TDMSoft.com
http://www.gunduz.org