OK, that's one LOCALE bug report too many...
... and I am not going to allow 7.1 to go out without a fix for this
class of problems. I'm fed up ;-)
As near as I can tell from the setlocale() man page, the only locale
categories that are really hazardous for us are LC_COLLATE and LC_CTYPE;
the other categories like LC_MONETARY affect only I/O routines, not
sort ordering, and so cannot result in corrupt indices.
I propose, therefore, that in an --enable-locale installation, initdb
should save its values for LC_COLLATE and LC_CTYPE in pg_control, and
backend startup should restore these settings from pg_control. Other
locale categories will continue to be acquired from the postmaster
environment. This will eliminate the class of bugs associated with
index corruption from not always starting the postmaster with the same
locale settings, while not forcing people to do an initdb to change
harmless settings.
Also, since "LC_COLLATE=en_US" seems to misbehave rather spectacularly
on recent RedHat releases, I propose that initdb change "en_US" to "C"
if it finds that setting. (Are there any platforms where there are
non-bogus differences between the two?)
Finally, until we have a really bulletproof solution for LIKE indexing
optimization, I will disable that optimization if --enable-locale is
compiled *and* LC_COLLATE is not C. Better to get "LIKE is slow" bug
reports than "LIKE gives wrong answers" bug reports.
Comments? Anyone think that initdb should lock down more categories
than just these two?
regards, tom lane
Tom Lane writes:
I propose, therefore, that in an --enable-locale installation, initdb
should save its values for LC_COLLATE and LC_CTYPE in pg_control, and
backend startup should restore these settings from pg_control.
Note that when these are unset there might still be a "catch-all" locale
value coming from the LANG env. var. (or LC_ALL on some systems).
Also, since "LC_COLLATE=en_US" seems to misbehave rather spectacularly
on recent RedHat releases, I propose that initdb change "en_US" to "C"
if it finds that setting. (Are there any platforms where there are
non-bogus differences between the two?)
There *should* be differences and it is definitely not okay to mix them
up.
Finally, until we have a really bulletproof solution for LIKE indexing
optimization, I will disable that optimization if --enable-locale is
compiled *and* LC_COLLATE is not C. Better to get "LIKE is slow" bug
reports than "LIKE gives wrong answers" bug reports.
(C or POSIX)
I have a question about that optimization: If you have X LIKE 'foo%',
wouldn't it be enough to use X >= 'foo' (which certainly works for any
locale I've ever heard of)? Why do you need the X <= 'foo???' at all?
Comments? Anyone think that initdb should lock down more categories
than just these two?
Not sure whether LC_CTYPE is necessary.
--
Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/
Peter Eisentraut <peter_e@gmx.net> writes:
Tom Lane writes:
I propose, therefore, that in an --enable-locale installation, initdb
should save its values for LC_COLLATE and LC_CTYPE in pg_control, and
backend startup should restore these settings from pg_control.
Note that when these are unset there might still be a "catch-all" locale
value coming from the LANG env. var. (or LC_ALL on some systems).
Actually, what I intend to do while writing pg_control is read the
current effective values via "setlocale(category, NULL)" --- then it
shouldn't matter where they came from, no?
This brings up a question I had just come across while doing further
research: backend/main/main.c does
#ifdef USE_LOCALE
setlocale(LC_CTYPE, ""); /* take locale information from an
* environment */
setlocale(LC_COLLATE, "");
setlocale(LC_MONETARY, "");
#endif
which seems a little odd --- why not setlocale(LC_ALL, "") ? Karel
Zak said in a thread around 8/15/00 that this is deliberate, but
I don't quite see why.
Also, since "LC_COLLATE=en_US" seems to misbehave rather spectacularly
on recent RedHat releases, I propose that initdb change "en_US" to "C"
if it finds that setting. (Are there any platforms where there are
non-bogus differences between the two?)
There *should* be differences and it is definitely not okay to mix them
up.
I have now received positive proof that en_US sort order on RedHat is
broken. For example, it asserts
'/root/' < '/root0'
but
'/root/t' > '/root0'
I defy you to find anyone in the US who will say that that is a
reasonable definition of string collation.
Of course, if you prefer the notion of disabling LIKE optimization
on a default RedHat installation, we can go ahead and accept en_US.
But I say it's broken and we shouldn't use it.
Finally, until we have a really bulletproof solution for LIKE indexing
optimization, I will disable that optimization if --enable-locale is
compiled *and* LC_COLLATE is not C. Better to get "LIKE is slow" bug
reports than "LIKE gives wrong answers" bug reports.
(C or POSIX)
Do you think there are cases where setlocale(,NULL) will give back
"POSIX" rather than "C"? We can certainly test for either.
I have a question about that optimization: If you have X LIKE 'foo%',
wouldn't it be enough to use X >= 'foo' (which certainly works for any
locale I've ever heard of)? Why do you need the X <= 'foo???' at all?
Because you need a two-sided index constraint, not a one-sided one.
Otherwise you're probably better off doing a sequential scan ---
scanning 50% of the table (on average) via an index will be slower
than sequential.
Comments? Anyone think that initdb should lock down more categories
than just these two?
Not sure whether LC_CTYPE is necessary.
I'm not either, but I'm afraid to leave it float...
regards, tom lane
Tom Lane writes:
Also, since "LC_COLLATE=en_US" seems to misbehave rather spectacularly
on recent RedHat releases, I propose that initdb change "en_US" to "C"
if it finds that setting. (Are there any platforms where there are
non-bogus differences between the two?)There *should* be differences and it is definitely not okay to mix them
up.I have now received positive proof that en_US sort order on RedHat is
broken. For example, it asserts
'/root/' < '/root0'
but
'/root/t' > '/root0'
I defy you to find anyone in the US who will say that that is a
reasonable definition of string collation.
That's certainly very odd, but Unixware does this too, so it's probably
some sort of standard. And a few other European/Latin locales I tried
also do this.
But here's another example of why C and en_US are different.
peter ~$ cat foo
Delta
�crire
Beta
alpha
gamma
peter ~$ LC_COLLATE=C sort foo
Beta
Delta
alpha
gamma
�crire
peter ~$ LC_COLLATE=en_US sort foo
alpha
Beta
Delta
�crire
gamma
The C locale sorts strictly by character code. But in the en_US locale
the accented letter is put into a "natural" position, and the upper and
lower case letters are grouped together. Intuitively, the en_US order is
in which you'd look up things in a dictionary.
This also explains (to me at least) the example you have above: When you
look up words in a dictionary you ignore "funny characters". My American
Heritage Dictionary explains:
: Entries are listed in alphabetical order without taking into account
: spaces or hyphens.
So at least this concept isn't that far out.
Do you think there are cases where setlocale(,NULL) will give back
"POSIX" rather than "C"? We can certainly test for either.
I know there are (old) systems that reject LANG=C as invalid locale, but I
don't know what setlocale returns there.
--
Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/
Peter Eisentraut <peter_e@gmx.net> writes:
I have now received positive proof that en_US sort order on RedHat is
broken. For example, it asserts
'/root/' < '/root0'
but
'/root/t' > '/root0'
I defy you to find anyone in the US who will say that that is a
reasonable definition of string collation.
That's certainly very odd, but Unixware does this too, so it's probably
some sort of standard. And a few other European/Latin locales I tried
also do this.
I don't have very many platforms to try, but HPUX does not think that
en_US sorts that way. It may well be standard in some European locales,
but there's a reason why C locale acts the way it does: that behavior is
the accepted one on this side of the pond. Sufficiently well accepted
that it was quite a few years before American programmers noticed there
was any reason to behave differently ;-)
This also explains (to me at least) the example you have above: When you
look up words in a dictionary you ignore "funny characters". My American
Heritage Dictionary explains:
: Entries are listed in alphabetical order without taking into account
: spaces or hyphens.
That's workable for an English dictionary, where symbols other than
letters are (a) rare and (b) usually irrelevant to the meaning. Do
you think anyone would tolerate treating "/" as a noise character in a
listing of Unix filenames, to take one counterexample? Unfortunately,
en_US does so.
This'd be less of a problem if we had support for per-column charset
and locale specifications. There'd be no objection to sorting a column
that contains only (or mostly) words like that. But I've got strong
doubts that the average user of a default RedHat installation expects
*all* data to get sorted that way, or that he wants us to honor a
default that he didn't ask for to the extent of disabling LIKE
optimization to make it work.
I suppose we could do it that way and add a FAQ entry:
Q. Why are my LIKE queries so slow?
A. Change your locale to C, then dump, initdb, reload.
But somehow I don't think that'll go over well...
regards, tom lane
Tom Lane wrote:
that contains only (or mostly) words like that. But I've got strong
doubts that the average user of a default RedHat installation expects
*all* data to get sorted that way, or that he wants us to honor a
default that he didn't ask for to the extent of disabling LIKE
optimization to make it work.
The change in collation for RedHat >6.0 is deliberate -- and conforms to
ISO standards. There was noise in an unmentionable list at an
unmentionable time about why it was this way -- and the result was a
seesaw -- it was almost turned back to 'conventional' collation, but was
then put back into ISO-conforming shape.
Ask Trond (teg@redhat.com) about it.
I suppose we could do it that way and add a FAQ entry:
Q. Why are my LIKE queries so slow?
A. Change your locale to C, then dump, initdb, reload.
But somehow I don't think that'll go over well...
Methinks you are very right. Very right.
I am not at all happy about the 'broken' RedHat locale -- the quick and
dirty solution is to remove or rename '/etc/sysconfig/i18n' -- but that
doesn't cure the root issue.
Oh, and to make matters that much worse, on a RedHat system it doesn't
matter if you build with or without --enable-locale -- locale support is
in the libc used, and locale support gets used regardless of what you
select on the configure line :-(. Been there; distributed that in the
6.5.x 'nl' RPM series.
But it sounds to me like you're on the right track, Tom.
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11
Lamar Owen <lamar.owen@wgcr.org> writes:
Oh, and to make matters that much worse, on a RedHat system it doesn't
matter if you build with or without --enable-locale -- locale support is
in the libc used, and locale support gets used regardless of what you
select on the configure line :-(.
I don't follow. Of course locale support is in libc; where else would
it be? But without --enable-locale, we will never call setlocale().
Surely even RedHat is not so broken that they default to non-C locale
in a program that has not called setlocale()? That directly contravenes
the letter of the ISO C standard, IIRC.
regards, tom lane
Lamar Owen <lamar.owen@wgcr.org> writes:
I am not at all happy about the 'broken' RedHat locale -- the quick and
dirty solution is to remove or rename '/etc/sysconfig/i18n' -- but that
doesn't cure the root issue.
Actually, that suggestion points out that just nailing down LC_COLLATE
at initdb time isn't sufficient, at least not on systems where libc's
locale behavior depends on user-alterable external files. Even with
my proposed initdb change in place, a user could still corrupt his
indices by removing or replacing /etc/sysconfig/i18n. Ugh. Not sure
I see a way around this, though, short of dumping libc and bringing
along our own locale support.
Of course, we might end up doing that anyway to support column-specific
locales. I suspect setlocale() is far too slow on many implementations
to be executed again for every string comparison :-(
regards, tom lane
Possible compromise: let initdb accept en_US, but have it spit out a
warning message:
NOTICE: initializing database with en_US collation order.
If you're not certain that's what you want, then it's probably not what
you want. We recommend you set LC_COLLATE to "C" and re-initdb.
For more information see <appropriate place in admin guide>
Thoughts?
regards, tom lane
Tom Lane wrote:
Lamar Owen <lamar.owen@wgcr.org> writes:
Oh, and to make matters that much worse, on a RedHat system it doesn't
matter if you build with or without --enable-locale -- locale support is
in the libc used, and locale support gets used regardless of what you
select on the configure line :-(.
But without --enable-locale, we will never call setlocale().
Surely even RedHat is not so broken that they default to non-C locale
in a program that has not called setlocale()? That directly contravenes
the letter of the ISO C standard, IIRC.
I just know this -- regression tests failed the same way with the 'nl'
non-locale RPM's as they did (and do) with the regular locale-enabled
RPM's. Collation was the same, regardless of the --enable-locale
setting. I got lots of 'bug' reports about the RPM's failing
regression, giving an unexpected sort order (see the archives -- the
best model thread's start post is:
http://www.postgresql.org/mhonarc/pgsql-hackers/1999-12/msg00587.html).
I was pretty ignorant back then of some of these issues :-).
Apparently RedHat is _that_ broken in that respect (among others).
Thankfully some of RedHat's more egregious faults have been fixed in
7.....
But then again what Unix isn't broken in some respect :-).
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11
Lamar Owen <lamar.owen@wgcr.org> writes:
Collation was the same, regardless of the --enable-locale
setting. I got lots of 'bug' reports about the RPM's failing
regression, giving an unexpected sort order (see the archives -- the
best model thread's start post is:
http://www.postgresql.org/mhonarc/pgsql-hackers/1999-12/msg00587.html).
Hmm. I reviewed that thread and found this comment from you:
: > Any differences in the environment variables maybe?
:
: In a nutshell, yes. /etc/sysconfig/i18n on the fresh install sets LANG,
: LC_ALL, and LINGUAS all to be "en_US". The upgraded machine at home doesn't
: have an /etc/sysconfig/i18n -- nor does the RH 6.0 box.
That makes it sounds like /etc/sysconfig/i18n is not what I'd assumed
(namely, a data file read at runtime by libc) but only a bit of shell
script that sets exported environment variables during bootup. I don't
have that file here, so could you enlighten me as to exactly what it
is/does?
If it is just setting some default environment variables for the system,
then it isn't anything we can't deal with by forcing setlocale() at
postmaster start. That'd make me feel a lot better ;-)
regards, tom lane
Peter Eisentraut <peter_e@gmx.net> writes:
Tom Lane writes:
Possible compromise: let initdb accept en_US, but have it spit out a
warning message:
I certainly don't like treating en_US specially, when in fact all locales
are affected by this.
Well, my thought was that another locale, say en_FR, would be far more
likely to be something that the system's user had explicitly chosen to
use at some point, and thus there's less reason to suppose that he
doesn't know what he's getting into. However, I have no objection to
printing such a complaint whenever the locale is one that will defeat
LIKE optimization --- how does that sound?
regards, tom lane
Import Notes
Reply to msg id not found: Pine.LNX.4.21.0011250210240.791-100000@peter.localdomainReference msg id not found: Pine.LNX.4.21.0011250210240.791-100000@peter.localdomain | Resolved by subject fallback
Tom Lane writes:
Possible compromise: let initdb accept en_US, but have it spit out a
warning message:NOTICE: initializing database with en_US collation order.
If you're not certain that's what you want, then it's probably not what
you want. We recommend you set LC_COLLATE to "C" and re-initdb.
For more information see <appropriate place in admin guide>
I certainly don't like treating en_US specially, when in fact all locales
are affected by this. You could print a general notice that the database
system will be initialized with a (non-C, non-POSIX) locale and that this
may/will affect the performance in certain cases. Maybe a
--disable-locale switch to initdb as well?
But IMHO we're not in the business of nitpicking or telling people how to
write, install, or use their operating systems when the issue is not a
show-stopper type, but really an aesthetics/convenience issue.
--
Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/
Tom Lane wrote:
Lamar Owen <lamar.owen@wgcr.org> writes:
Collation was the same, regardless of the --enable-locale
setting. I got lots of 'bug' reports about the RPM's failing
Hmm. I reviewed that thread and found this comment from you:
: In a nutshell, yes. /etc/sysconfig/i18n on the fresh install sets LANG,
: LC_ALL, and LINGUAS all to be "en_US". The upgraded machine at home doesn't
: have an /etc/sysconfig/i18n -- nor does the RH 6.0 box.
That makes it sounds like /etc/sysconfig/i18n is not what I'd assumed
(namely, a data file read at runtime by libc) but only a bit of shell
script that sets exported environment variables during bootup. I don't
have that file here, so could you enlighten me as to exactly what it
is/does?
Oh, yes, sorry -- /etc/sysconfig/i18n is read during sysinit,
immediately before starting swap (IOW, it's only read the once). On my
RH 6.2 box, it is the following line:
----- /etc/sysconfig/i18n -------
LANG="en_US"
------------- EOF ---------------
It's the same on a fresh RedHat 7.0 install.
If it is just setting some default environment variables for the system,
then it isn't anything we can't deal with by forcing setlocale() at
postmaster start. That'd make me feel a lot better ;-)
Then you need to feel alot better :-).....
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11
At 07:32 PM 11/24/00 -0500, Tom Lane wrote:
Possible compromise: let initdb accept en_US, but have it spit out a
warning message:NOTICE: initializing database with en_US collation order.
If you're not certain that's what you want, then it's probably not what
you want. We recommend you set LC_COLLATE to "C" and re-initdb.
For more information see <appropriate place in admin guide>Thoughts?
Are you SURE you want to use en_US collation? [no]
(ask the question, default to no?)
Yes, a question in initdb is ugly, this whole thing is ugly.
- Don Baccus, Portland OR <dhogaza@pacifier.com>
Nature photos, on-line guides, Pacific Northwest
Rare Bird Alert Service and other goodies at
http://donb.photo.net.
Don Baccus <dhogaza@pacifier.com> writes:
Are you SURE you want to use en_US collation? [no]
(ask the question, default to no?)
Yes, a question in initdb is ugly, this whole thing is ugly.
A question in initdb won't fly for RPM installations, since the RPMs
try to do initdb themselves (or am I wrong about that?)
regards, tom lane
Tom Lane wrote:
Don Baccus <dhogaza@pacifier.com> writes:
Are you SURE you want to use en_US collation? [no]
(ask the question, default to no?)
Yes, a question in initdb is ugly, this whole thing is ugly.
A question in initdb won't fly for RPM installations, since the RPMs
try to do initdb themselves (or am I wrong about that?)
The RPMset initdb's the first time the initscript is run to start
postmaster, not at installation time.
A command-line argument to initdb would suffice to override -- maybe a
'--initlocale' parameter?? Now, what sort of default for
--initlocale.....
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11
Lamar Owen <lamar.owen@wgcr.org> writes:
A command-line argument to initdb would suffice to override -- maybe a
'--initlocale' parameter??
Hardly need one, when setting LANG or LC_ALL will do just as well.
Now, what sort of default for --initlocale.....
I think your complaints about RedHat's default are right back in your
lap ;-). Do you want to ignore their default, or not?
regards, tom lane
Tom Lane wrote:
Lamar Owen <lamar.owen@wgcr.org> writes:
I think your complaints about RedHat's default are right back in your
lap ;-). Do you want to ignore their default, or not?
Yes, I want to ignore their default. This problem is more than just
cosmetic, thanks to the bugs that sparked this thread.
I can do things in the initscript if necessary. That only helps the
RPM's, though, not those from-source RedHat installations.
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11
Lamar Owen writes:
Yes, I want to ignore their default.
If you want to do that then the infinitely better solution is to compile
without locale support in the first place. (Make the locale-enabled
server a separate package.) Alternatively, the locale of the postgres
user to POSIX.
I can do things in the initscript if necessary. That only helps the
RPM's, though, not those from-source RedHat installations.
The subject of this whole discussion was IIRC the "default Red Hat
installation". Those who compile from source can always make more
informed decisions about what features to enable.
--
Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/