locale support

Started by Tatsuo Ishiialmost 25 years ago11 messages
#1Tatsuo Ishii
t-ishii@sra.co.jp

There is a serious problem with the PostgreSQL locale support on
certain platforms and certain locale combo. That is: simply ordering,
indexes etc. are broken because strcoll() does not work. Example
combo includes: RedHat 6.2J(Japanese localized version) + ja_JP.eucJP
locale. Here is a test program that expose the problem.

#include <string.h>
#include <locale.h>
main()
{
static char *s1 = "a Japanese string";
static char *s2 = "another Japanese string";

setlocale(LC_ALL,"");

printf("%d\n",strcoll(s1,s2));
printf("%d\n",strcoll(s2,s1));
}

This program prints 0s, that means strcoll() regards that those differnt
Japanese strings are same!

I know this is not PostgreSQL's fault but the broken locale data on
certain platforms. The problem makes it impossible to use PostgreSQL
RPMs in Japan.

I'm looking for solutions/workarounds for this problem. Maybe we
should disable locale support at runntime if strcoll() does not work?
Comments?
--
Tatsuo Ishii

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tatsuo Ishii (#1)
Re: locale support

Tatsuo Ishii <t-ishii@sra.co.jp> writes:

I know this is not PostgreSQL's fault but the broken locale data on
certain platforms. The problem makes it impossible to use PostgreSQL
RPMs in Japan.

I'm looking for solutions/workarounds for this problem.

Build a set of RPMs without locale support?

regards, tom lane

#3Noname
ncm@zembu.com
In reply to: Tom Lane (#2)
Re: locale support

On Mon, Feb 12, 2001 at 09:59:37PM -0500, Tom Lane wrote:

Tatsuo Ishii <t-ishii@sra.co.jp> writes:

I know this is not PostgreSQL's fault but the broken locale data on
certain platforms. The problem makes it impossible to use PostgreSQL
RPMs in Japan.

I'm looking for solutions/workarounds for this problem.

Build a set of RPMs without locale support?

Run it with LC_ALL="C".

Nathan Myers
ncm@zembu.com

#4Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tom Lane (#2)
Re: locale support

I know this is not PostgreSQL's fault but the broken locale data on
certain platforms. The problem makes it impossible to use PostgreSQL
RPMs in Japan.

I'm looking for solutions/workarounds for this problem.

Build a set of RPMs without locale support?

Run it with LC_ALL="C".

Both of them seem not ideal solutions for RPM. It would be nice if we
could distribute single binary and start up file in RPM.
--
Tatsuo Ishii

#5Hannu Krosing
hannu@tm.ee
In reply to: Tatsuo Ishii (#1)
Re: locale support

Nathan Myers wrote:

On Mon, Feb 12, 2001 at 09:59:37PM -0500, Tom Lane wrote:

Tatsuo Ishii <t-ishii@sra.co.jp> writes:

I know this is not PostgreSQL's fault but the broken locale data on
certain platforms. The problem makes it impossible to use PostgreSQL
RPMs in Japan.

I'm looking for solutions/workarounds for this problem.

Build a set of RPMs without locale support?

Run it with LC_ALL="C".

It would help if there was a sample working LC_ALL=xxx line
/etc/rc.d/init.d/postgresql

As it stands now it is a real pita to get LC_xx settings down to the
real postmaster
through all the layers (and quessing if it did take effect after each
restart ;)

---------
Hannu

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tatsuo Ishii (#4)
Re: locale support

Tatsuo Ishii <t-ishii@sra.co.jp> writes:

I know this is not PostgreSQL's fault but the broken locale data on
certain platforms. The problem makes it impossible to use PostgreSQL
RPMs in Japan.

I'm looking for solutions/workarounds for this problem.

Build a set of RPMs without locale support?

Run it with LC_ALL="C".

Both of them seem not ideal solutions for RPM. It would be nice if we
could distribute single binary and start up file in RPM.

If you can find a non-intrusive way to do that, sure ... but I don't
think that we should expend any great amount of effort, nor uglify the
code, in order to cater to a demonstrably broken library on one
particular platform.

The LC_ALL answer seems the best to me.

regards, tom lane

#7Lamar Owen
lamar.owen@wgcr.org
In reply to: Tatsuo Ishii (#1)
Re: locale support

Tom Lane wrote:

Tatsuo Ishii <t-ishii@sra.co.jp> writes:

I know this is not PostgreSQL's fault but the broken locale data on
certain platforms. The problem makes it impossible to use PostgreSQL
RPMs in Japan.
I'm looking for solutions/workarounds for this problem.

Build a set of RPMs without locale support?

Run it with LC_ALL="C".

Both of them seem not ideal solutions for RPM. It would be nice if we
could distribute single binary and start up file in RPM.

If you can find a non-intrusive way to do that, sure ... but I don't
think that we should expend any great amount of effort, nor uglify the
code, in order to cater to a demonstrably broken library on one
particular platform.

Tatsuo, what is LC_ALL (or the other locale envvars) set to when you run
the program? The man page for setlocale() on my machine documents that
the main() starts in C or POSIX locale mode by default. The call to
setlocale(LC_ALL, "") reads the envvars and sets the locale
accordingly. Maybe RedHat's 6.2J isn't setting up the locale properly
to begin with? See what /etc/sysconfig/i18n contains -- if it is empty
or doesn't exist, then locale is simply not set up. But you specfically
mention the particular locale....

Ok, what combinations _do_ work? We _know_ C or POSIX works -- but
which ones don't work, on RH >6.1? While I want to make sure that a
broken locale data set isn't used, I also want to make sure that a good
locale set isn't thrown out, either. Forcing to LC_COLLATE=C is
overkill, IMHO. And building without locale support doesn't work,
either, because, at least on RH 6.1, strncmp() is buggered to use the
locale's collation.

The real solution is for the vendors to fix their broken locales.
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11

#8Peter Eisentraut
peter_e@gmx.net
In reply to: Lamar Owen (#7)
Re: locale support

Lamar Owen writes:

And building without locale support doesn't work, either, because, at
least on RH 6.1, strncmp() is buggered to use the locale's collation.

I don't think so. On RH 6.1, strncmp() is the same it's ever been:

int
strncmp (s1, s2, n)
const char *s1;
const char *s2;
size_t n;
{
unsigned reg_char c1 = '\0';
unsigned reg_char c2 = '\0';

if (n >= 4)
{
size_t n4 = n >> 2;
do
{
c1 = (unsigned char) *s1++;
c2 = (unsigned char) *s2++;
if (c1 == '\0' || c1 != c2)
return c1 - c2;
c1 = (unsigned char) *s1++;
c2 = (unsigned char) *s2++;
if (c1 == '\0' || c1 != c2)
return c1 - c2;
c1 = (unsigned char) *s1++;
c2 = (unsigned char) *s2++;
if (c1 == '\0' || c1 != c2)
return c1 - c2;
c1 = (unsigned char) *s1++;
c2 = (unsigned char) *s2++;
if (c1 == '\0' || c1 != c2)
return c1 - c2;
} while (--n4 > 0);
n &= 3;
}

while (n > 0)
{
c1 = (unsigned char) *s1++;
c2 = (unsigned char) *s2++;
if (c1 == '\0' || c1 != c2)
return c1 - c2;
n--;
}

return c1 - c2;
}

--
Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/

#9Lamar Owen
lamar.owen@wgcr.org
In reply to: Peter Eisentraut (#8)
Re: locale support

Peter Eisentraut wrote:

Lamar Owen writes:

And building without locale support doesn't work, either, because, at
least on RH 6.1, strncmp() is buggered to use the locale's collation.

I don't think so. On RH 6.1, strncmp() is the same it's ever been:

[snip]

Is that the code after any glibc RPM patches are applied? 'Pristine
source, perhaps -- but patch like crazy!' Reference the classic
'Reflections on Trusting Trust' by Ken Thompson (which you have probably
read already, but, for those on-list who may not have read this classic
work on security, you can find the paper at
http://www.acm.org/classics/sep95/). Although reading the glibc spec
file indicates that patching isn't done in the 'conventional' manner
here. (Lovely).

I base my assertion on running test queries on a RedHat 6.1 box over a
year ago, using the non-locale 6.5.3 RPMset I distributed at that point
(I distributed non-locale RPMs because of it's speed being greater in
indexing, etc). The user who was having difficulties also tried the
non-locale RPMset -- and no change, until removing /etc/sysconfig/i18n.
I've referenced the thread before in the archives; see the message
http://www.postgresql.org/mhonarc/pgsql-hackers/1999-12/msg00678.html
for the middle of the thread.

But, of course, that was 6.5.3. If 7.x behaves differently, I wouldn't
know, as I've not built a 'non-locale' RPMset of 7.x. But, I can if
needed. Or try the test queries on your own RH 7 box, with a non-locale
build.
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11

#10Peter Eisentraut
peter_e@gmx.net
In reply to: Lamar Owen (#9)
Re: locale support

Lamar Owen writes:

I don't think so. On RH 6.1, strncmp() is the same it's ever been:

[snip]

Is that the code after any glibc RPM patches are applied?

Yes.

I base my assertion on running test queries on a RedHat 6.1 box over a
year ago, using the non-locale 6.5.3 RPMset I distributed at that point
(I distributed non-locale RPMs because of it's speed being greater in
indexing, etc). The user who was having difficulties also tried the
non-locale RPMset -- and no change, until removing /etc/sysconfig/i18n.

I recall that thread, but the conclusion that was reached (that strncmp()
is at fault in some way) was never proved sufficiently.

--
Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/

#11Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Lamar Owen (#7)
Re: locale support

Tatsuo, what is LC_ALL (or the other locale envvars) set to when you run
the program? The man page for setlocale() on my machine documents that
the main() starts in C or POSIX locale mode by default. The call to
setlocale(LC_ALL, "") reads the envvars and sets the locale
accordingly. Maybe RedHat's 6.2J isn't setting up the locale properly
to begin with? See what /etc/sysconfig/i18n contains -- if it is empty
or doesn't exist, then locale is simply not set up. But you specfically
mention the particular locale....

It's "ja_JP.eucJP". Definitely that locale exists, so I guess the
contents is broken...

Ok, what combinations _do_ work? We _know_ C or POSIX works -- but
which ones don't work, on RH >6.1? While I want to make sure that a
broken locale data set isn't used, I also want to make sure that a good
locale set isn't thrown out, either. Forcing to LC_COLLATE=C is
overkill, IMHO. And building without locale support doesn't work,

I guess most single byte locales work. However I seriously doubt that
locales for multibyte language would work.

either, because, at least on RH 6.1, strncmp() is buggered to use the
locale's collation.

Really? I see PostgreSQL installations without the locale support work
just fine on RH 6.1J.

The real solution is for the vendors to fix their broken locales.

Of course.
--
Tatsuo Ishii