UPPER()/LOWER() and UTF-8

Started by Alexey Mahotkinover 22 years ago9 messageshackers

alexm@w-m.ru

over 22 years ago

Hello,

I'm running Postgresql 7.3.4 with ru_RU.UTF-8 locale (with UNICODE
database encoding), and all is almost well, except that UPPER() and
LOWER() seem to ignore locale.

I searched the sources couple of times, but do not understand where is
the implementation of UPPER()/LOWER(). Could you please point me to
the right direction?

I'll try to understand and fix that. (But maybe patches for that
exist? Or maybe FreeBSD 4.8-RELEASE utf-8 locales are broken in that
respect?)

Thanks a lot,

--alexm

Tom Lane

tgl@sss.pgh.pa.us

over 22 years ago

In reply to: Alexey Mahotkin (#1)

Re: UPPER()/LOWER() and UTF-8

Alexey Mahotkin <alexm@w-m.ru> writes:

I'm running Postgresql 7.3.4 with ru_RU.UTF-8 locale (with UNICODE
database encoding), and all is almost well, except that UPPER() and
LOWER() seem to ignore locale.

upper/lower aren't going to work desirably in any multi-byte character
set encoding. I think Peter E. is looking into what it would take to
fix this for 7.5, but at present you are going to need to use a
single-byte encoding within the server. (Nothing to stop you from using
UTF-8 on the client side though.)

regards, tom lane

Karel Zak

zakkr@zf.jcu.cz

over 22 years ago

In reply to: Tom Lane (#2)

Re: UPPER()/LOWER() and UTF-8

On Tue, Nov 04, 2003 at 04:52:33PM -0500, Tom Lane wrote:

Alexey Mahotkin <alexm@w-m.ru> writes:

I'm running Postgresql 7.3.4 with ru_RU.UTF-8 locale (with UNICODE
database encoding), and all is almost well, except that UPPER() and
LOWER() seem to ignore locale.

upper/lower aren't going to work desirably in any multi-byte character
set encoding. I think Peter E. is looking into what it would take to

It's a PostgreSQL and no UTF problem, because standard PostgreSQL text
functions doesn't know something about arguments encoding and for this
functions cannot use another (an example UTF's lower/upper) method for
a work with strings.

Maybe a little extend internal "text" datatype and like VARSIZE() use
VARENCODING(). Maybe Peter already has some better idea.

fix this for 7.5, but at present you are going to need to use a
single-byte encoding within the server. (Nothing to stop you from using
UTF-8 on the client side though.)

You can use mutibyte on server side too, but you must to use for
example convert() function for upper/lower arguments.

Karel

--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/

Alexey Mahotkin

alexm@w-m.ru

over 22 years ago

In reply to: Tom Lane (#2)

Re: UPPER()/LOWER() and UTF-8

"TL" == Tom Lane <tgl@sss.pgh.pa.us> writes:

TL> Alexey Mahotkin <alexm@w-m.ru> writes:

I'm running Postgresql 7.3.4 with ru_RU.UTF-8 locale (with
UNICODE database encoding), and all is almost well, except that
UPPER() and LOWER() seem to ignore locale.

TL> upper/lower aren't going to work desirably in any multi-byte
TL> character set encoding.

Can you please point me at their implementation? I do not understand
why that's impossible.

TL> I think Peter E. is looking into what
TL> it would take to fix this for 7.5, but at present you are
TL> going to need to use a single-byte encoding within the server.
TL> (Nothing to stop you from using UTF-8 on the client side
TL> though.)

Thanks,

--alexm

Tom Lane

tgl@sss.pgh.pa.us

over 22 years ago

In reply to: Alexey Mahotkin (#4)

Re: UPPER()/LOWER() and UTF-8

Alexey Mahotkin <alexm@w-m.ru> writes:

TL> upper/lower aren't going to work desirably in any multi-byte
TL> character set encoding.

Can you please point me at their implementation? I do not understand
why that's impossible.

Because they use <ctype.h>'s toupper() and tolower() functions, which
only work on single-byte characters.

There has been some discussion of using <wctype.h> where available, but
this has a number of issues, notably figuring out the correct mapping
from the server string encoding (eg UTF-8) to unpacked wide characters.
At minimum we'd need to know which charset the locale setting is
expecting, and there doesn't seem to be a portable way to find that out.

IIRC, Peter thinks we must abandon use of libc's locale functionality
altogether and write our own locale layer before we can really have all
the locale-specific functionality we want.

regards, tom lane

Alexey Mahotkin

alexm@w-m.ru

over 22 years ago

In reply to: Tom Lane (#5)

Re: UPPER()/LOWER() and UTF-8

"TL" == Tom Lane <tgl@sss.pgh.pa.us> writes:

TL> writes: upper/lower aren't
TL> going to work desirably in any multi-byte character set
TL> encoding.

Can you please point me at their implementation? I do not
understand why that's impossible.

TL> Because they use <ctype.h>'s toupper() and tolower()
TL> functions, which only work on single-byte characters.

Aha, that's in src/backend/utils/adt/formatting.c, right?

Yes, I see, it goes byte by byte and uses toupper(). I believe we
could look at the locale, and if it is UTF-8, then use (or copy)
e.g. g_utf8_strup/strdown, right?

http://developer.gnome.org/doc/API/2.0/glib/glib-Unicode-Manipulation.html#g-utf8-strup

I belive that patch could be written in a matter of hours.

TL> There has been some discussion of using <wctype.h> where
TL> available, but this has a number of issues, notably figuring
TL> out the correct mapping from the server string encoding (eg
TL> UTF-8) to unpacked wide characters. At minimum we'd need to
TL> know which charset the locale setting is expecting, and there
TL> doesn't seem to be a portable way to find that out.

TL> IIRC, Peter thinks we must abandon use of libc's locale
TL> functionality altogether and write our own locale layer before
TL> we can really have all the locale-specific functionality we
TL> want.

I believe that native Unicode strings (together with human language
handling) should be introduced as (almost) separate data type (which
have nothing to do with locale), but that's bluesky maybe.

--alexm

Hannu Krosing

hannu@tm.ee

over 22 years ago

In reply to: Alexey Mahotkin (#6)

Re: UPPER()/LOWER() and UTF-8

Alexey Mahotkin kirjutas K, 05.11.2003 kell 17:11:

Aha, that's in src/backend/utils/adt/formatting.c, right?

Yes, I see, it goes byte by byte and uses toupper(). I believe we
could look at the locale, and if it is UTF-8, then use (or copy)
e.g. g_utf8_strup/strdown, right?

http://developer.gnome.org/doc/API/2.0/glib/glib-Unicode-Manipulation.html#g-utf8-strup

I belive that patch could be written in a matter of hours.

TL> There has been some discussion of using <wctype.h> where
TL> available, but this has a number of issues, notably figuring
TL> out the correct mapping from the server string encoding (eg
TL> UTF-8) to unpacked wide characters. At minimum we'd need to
TL> know which charset the locale setting is expecting, and there
TL> doesn't seem to be a portable way to find that out.

TL> IIRC, Peter thinks we must abandon use of libc's locale
TL> functionality altogether and write our own locale layer before
TL> we can really have all the locale-specific functionality we
TL> want.

I believe that native Unicode strings (together with human language
handling) should be introduced as (almost) separate data type (which
have nothing to do with locale), but that's bluesky maybe.

They should have nothing to do with _system_ locale, but you can
neither UPPER()/LOWER() nor ORDER BY unless you know the locale. It is
just that the locale should either be property of column or given in the
SQL statement.

I guess one could write UCHAR, UVARCHAR, UTEXT types based on ICU.

-------------
Hannu

Michael Brusser

michael@synchronicity.com

over 22 years ago

In reply to: Karel Zak (#3)

database is shutting down

We have a customer who reports a weird problem.
Too often the App. Server fails to connect to the database.
Sometimes the scheduled vacuum fails as well.
The error message is always the same:
FATAL: The database system is shutting down
But from what I see no one is trying to shut down the database at this time.

I am still waiting for the database-log to see if I can find a clue there,
but I wonder if someone knows what can make the database respond this way.

This is Pg 7.3.2, on HP 11.0, using the Unix Domain Socket.
Thank you,
Mike

Tom Lane

tgl@sss.pgh.pa.us

over 22 years ago

In reply to: Michael Brusser (#8)

Re: database is shutting down

Michael Brusser <michael@synchronicity.com> writes:

The error message is always the same:
FATAL: The database system is shutting down
But from what I see no one is trying to shut down the database at this time.

*Something* has sent the postmaster a shutdown signal --- either
SIGINT or SIGTERM. Look around and find out what.

regards, tom lane