Rough draft for Unicode-aware UPPER()/LOWER()/INITCAP()

Started by Tom Laneabout 22 years ago9 messageshackers

tgl@sss.pgh.pa.us

about 22 years ago

I got tired of reading complaints about how upper/lower don't work with
Unicode, so I went and prototyped a solution. The attached code uses
the C99-standard functions mbstowcs and wcstombs to convert to and from
a "wchar_t[]" representation that can be fed to the also-C99 functions
towupper, towlower, etc.

This code will only work if the database is running under an LC_CTYPE
setting that implies the same encoding specified by server_encoding.
However, I don't see that as a fatal objection, because in point of fact
the existing upper/lower code assumes the same thing. When they don't
match, this code may deliver an "invalid multibyte character" error
rather than silently producing a wrong answer, but is that really a step
backward?

Note this patch is *not* meant for application to CVS yet. It's not
autoconfiscated. But if you have a platform that has mbstowcs and
friends, please try it and let me know about any portability gotchas
you see.

Also, as a character-set-impaired American, I'm probably not the best
qualified person to judge whether the patch actually does what's wanted.
It seemed to do the right sorts of conversions in my limited testing,
but does it do what *you* want it to do?

regards, tom lane

PS: the patch works against either 7.4 or CVS tip.

Jean-Michel POURE

jm@poure.com

about 22 years ago

In reply to: Tom Lane (#1)

Re: Rough draft for Unicode-aware UPPER()/LOWER()/INITCAP()

Le jeudi 13 Mai 2004 04:42, Tom Lane a écrit :

I got tired of reading complaints about how upper/lower don't work with
Unicode, so I went and prototyped a solution. The attached code uses
the C99-standard functions mbstowcs and wcstombs to convert to and from
a "wchar_t[]" representation that can be fed to the also-C99 functions
towupper, towlower, etc.

These are really good news, thanks.
Jean-Michel Pouré

Markus Bertheau

twanger@bluetwanger.de

about 22 years ago

In reply to: Tom Lane (#1)

Re: Rough draft for Unicode-aware

В Чтв, 13.05.2004, в 04:42, Tom Lane пишет:

But if you have a platform that has mbstowcs and
friends, please try it and let me know about any portability gotchas
you see.

I can't test it because with a clean 7.4.2 with the patch applied I get
[bert@yarrow postgresql-7.4.2]$ LANG=C make install
make -C doc install
make[1]: Entering directory `/home/bert/src/postgresql-7.4.2/doc'
mkdir /home/bertheau/pg742
mkdir /home/bertheau/pg742/doc
mkdir /home/bertheau/pg742/doc/postgresql
mkdir /home/bertheau/pg742/doc/postgresql/html
make[1]: *** [installdirs] Error 1
make[1]: Leaving directory `/home/bert/src/postgresql-7.4.2/doc'
make: *** [install] Error 2
[bert@yarrow postgresql-7.4.2]$

make and make check worked ok.

--
Markus Bertheau <twanger@bluetwanger.de>

Marko Karppinen

marko@karppinen.fi

about 22 years ago

In reply to: Tom Lane (#1)

Re: Rough draft for Unicode-aware UPPER()/LOWER()/INITCAP()

Tom Lane wrote:

This code will only work if the database is running under an LC_CTYPE
setting that implies the same encoding specified by server_encoding.
However, I don't see that as a fatal objection, because in point of
fact
the existing upper/lower code assumes the same thing.

I think this interaction between the locale and server_encoding is
confusing. Is there any use case for running an incompatible mix?
If not, would it not make sense to fetch initdb's default database
encoding with nl_langinfo(CODESET) instead of using SQL_ASCII?

initdb could even emit a warning if the --encoding option was
used without also specifying --no-locale.

Using nl_langinfo(CODESET) was discussed and quietly dismissed a
year ago (although the topic was the client encoding back then).
But I think that the idea is worth revisiting because it would
allow UPPER() and LOWER() to work correctly with international
alphabets -- out of the box and without configuration -- on a
wide variety of modern systems.

Peter Eisentraut

peter_e@gmx.net

about 22 years ago

In reply to: Marko Karppinen (#4)

Re: Rough draft for Unicode-aware UPPER()/LOWER()/INITCAP()

Marko Karppinen wrote:

I think this interaction between the locale and server_encoding is
confusing. Is there any use case for running an incompatible mix?
If not, would it not make sense to fetch initdb's default database
encoding with nl_langinfo(CODESET) instead of using SQL_ASCII?

This would be fine and dandy if we had any sort of idea about what sort
of strings nl_langinfo(CODESET) returns and how to map them to our
encoding names.

Marko Karppinen

marko@karppinen.fi

about 22 years ago

In reply to: Peter Eisentraut (#5)

Re: Rough draft for Unicode-aware UPPER()/LOWER()/INITCAP()

Marko Karppinen wrote:

I think this interaction between the locale and server_encoding is
confusing. Is there any use case for running an incompatible mix?
If not, would it not make sense to fetch initdb's default database
encoding with nl_langinfo(CODESET) instead of using SQL_ASCII?

Peter Eisentraut wrote:

This would be fine and dandy if we had any sort of idea about what sort
of strings nl_langinfo(CODESET) returns and how to map them to our
encoding names.

Karel Zak posted an answer to this last year, here on pgsql-hackers:
http://archives.postgresql.org/pgsql-hackers/2003-05/msg00744.php
It's not complete, but it's sort of an idea.

The code is under LGPL, but copyright doesn't reach down to the
actual information about the encoding strings used by various
operating systems, so it's possible to reappropriate. I'd imagine
that it covers many, if not most, of the likely cases.

The current situation of upper/lower/collating/etc just being
broken by default on many non-C locales is bad enough to warrant
bailing out during initdb when this situation is detected
(with a reasonably cautious heuristic).

It used to be that you got what you deserved if you were stupid
enough to define a non-C, non-ASCII-based locale. You had only
yourself to blame for everything breaking. These days, however,
millions of systems get shipped and installed with UTF-8 locales
on by default, so it's not possible to portray this as an user error.

Requiring every one of these people to configure initdb's encoding
manually would be harsh, however, so I think that an heuristic
that'd work with most modern systems would strike an appropriate
balance of correctness and path-of-least-surprise.

Tatsuo Ishii

ishii@postgresql.org

about 22 years ago

In reply to: Marko Karppinen (#4)

Re: Rough draft for Unicode-aware

initdb could even emit a warning if the --encoding option was
used without also specifying --no-locale.

Please don't do that. Most Asian chasets does not work with locale
enabled PostgreSQL installation. i.e. it returns WRONG SELECT
results. I've been telling this to Japanese users for hundreds of
times when they ask me why SELECT returns results. If that kind of
wanings are installed, I think more Japanese users will be confused.
--
Tatsuo Ishii

Marko Karppinen

marko@karppinen.fi

about 22 years ago

In reply to: Tatsuo Ishii (#7)

Re: Rough draft for Unicode-aware UPPER()/LOWER()/INITCAP()

Tatsuo Ishii wrote:

initdb could even emit a warning if the --encoding option was
used without also specifying --no-locale.

Please don't do that. Most Asian chasets does not work with locale
enabled PostgreSQL installation. i.e. it returns WRONG SELECT
results. I've been telling this to Japanese users for hundreds of
times when they ask me why SELECT returns results. If that kind of
wanings are installed, I think more Japanese users will be confused.

You've advocated a default of --no-locale yourself for this reason.
If using a Japanese --encoding setting without --no-locale emitted
a warning suggesting the use of --no-locale, I'd imagine you wouldn't
have had to give human support to most of those hundreds of people?

Wouldn't that be a halfway point to your goal?

Tom Lane

tgl@sss.pgh.pa.us

about 22 years ago

In reply to: Marko Karppinen (#4)

Re: Rough draft for Unicode-aware UPPER()/LOWER()/INITCAP()

Marko Karppinen <marko@karppinen.fi> writes:

I think this interaction between the locale and server_encoding is
confusing. Is there any use case for running an incompatible mix?

In hindsight we should probably not have invented per-database encoding
selection, since it's so fragile to use in combination with cluster-wide
locale settings. However I believe that a lot of people in the Far East
are using multiple database encodings successfully, since they don't
much care about upper()/lower() etc ...

The long-term answer is to write our own locale support so we can
eliminate the cluster-wide-locale restriction. In the meantime I don't
want to remove flexibility that is useful to some people.

regards, tom lane