lower and upper not UTF-8 safe

Started by Julian Satchellover 22 years ago4 messageshackers
Jump to latest
#1Julian Satchell
j.satchell@eris.qinetiq.com

The implementations of lower and upper in
src/backend/utils/adt/oracle_compat.c use the single byte macros from
ctype.h to alter individual bytes in the text string.

If the text is UTF-8 encoded this is totally wrong, and will result in
an invalid string that is no longer UTF-8.

The code is basically unchanged in both 7.3.4 and CVS tip.

I can see two options - remove access to these functions if the database
is running UNICODE, or rewrite/extend them so the correct thing happens.
The easiest way to do this is probably to convert the UTF-8 to a fixed
width encoding (say UCS-4), perform the lower operation to get a new
set of character indices, then convert back to UTF-8. The byte length of
the output might even be different from the input, although I don't know
of an example where this happens.

At the very least, the documentation for lower and upper in the manual
should warn the user not to use them in a UNICODE database.

--
Julian Satchell <j.satchell@eris.qinetiq.com>
QinetiQ

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Julian Satchell (#1)
Re: lower and upper not UTF-8 safe

Julian Satchell <j.satchell@eris.qinetiq.com> writes:

The implementations of lower and upper in
src/backend/utils/adt/oracle_compat.c use the single byte macros from
ctype.h to alter individual bytes in the text string.

If the text is UTF-8 encoded this is totally wrong, and will result in
an invalid string that is no longer UTF-8.

Only if you use a locale that is assuming a character set that is not
UTF8 but does have characters with the high bit set. I'm not sure that
we can do anything to defend against locale/charset mismatch.

regards, tom lane

#3Karel Zak
zakkr@zf.jcu.cz
In reply to: Tom Lane (#2)
Re: lower and upper not UTF-8 safe

On Mon, Aug 04, 2003 at 05:03:02PM -0400, Tom Lane wrote:

Julian Satchell <j.satchell@eris.qinetiq.com> writes:

The implementations of lower and upper in
src/backend/utils/adt/oracle_compat.c use the single byte macros from
ctype.h to alter individual bytes in the text string.

If the text is UTF-8 encoded this is totally wrong, and will result in
an invalid string that is no longer UTF-8.

Only if you use a locale that is assuming a character set that is not
UTF8 but does have characters with the high bit set. I'm not sure that
we can do anything to defend against locale/charset mismatch.

We can try detect typical locale charset and compare it with actual
charset used in DB and send NOTICE to FE if it's mismatched. The problem
is portability of charset detection code, because there is differences
between OS. The best it's if libc support nl_langinfo(CODESET) call.
The complete code of charset detection you can found in libcharset or
glib (I use simplification of these codes and it's 300 lines:-).

Karel

--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Karel Zak (#3)
Re: lower and upper not UTF-8 safe

Karel Zak <zakkr@zf.jcu.cz> writes:

On Mon, Aug 04, 2003 at 05:03:02PM -0400, Tom Lane wrote:

Only if you use a locale that is assuming a character set that is not
UTF8 but does have characters with the high bit set. I'm not sure that
we can do anything to defend against locale/charset mismatch.

We can try detect typical locale charset and compare it with actual
charset used in DB and send NOTICE to FE if it's mismatched. The problem
is portability of charset detection code, because there is differences
between OS.

Yeah. If we had a portable, reliable way of testing for incompatibility,
I'd be in favor of just forbidding creation of databases that have
encoding choices incompatible with the server's LC_COLLATE/LC_CTYPE
settings. (If we ever allow those settings to be more dynamic than they
are, then the test would have to be made somewhere else, but for now it'd
be sufficient to put it in CREATE DATABASE.)

But I don't see a portable way to find out what charset a locale
supports. nl_langinfo() isn't in the C standard at all.

regards, tom lane