Unicode problems on IRC
Hey guys,
The 'Unicode characters above 0x10000' issue keeps rearing its ugly head
in the IRC channel. I propose that it be fixed, even backported...
This is John Hansen's most recent patch to fix it:
http://archives.postgresql.org/pgsql-patches/2004-11/msg00259.php
And from what I can tell it was committed, then reverted because it
wasn't a "bug". It was going to go in for 8.1.
We on the channel are starting to think that it is in fact a bug. There
are are people with legitimately utf-8 encoded XML documents that they
cannot store in PostgreSQL. Apparently in the distant past, Unicode was
limited to 0x10000, but then was extended.
Perhaps we can reopen this case...
Chris
Christopher Kings-Lynne wrote:
Hey guys,
The 'Unicode characters above 0x10000' issue keeps rearing its ugly head
in the IRC channel. I propose that it be fixed, even backported...This is John Hansen's most recent patch to fix it:
http://archives.postgresql.org/pgsql-patches/2004-11/msg00259.php
And from what I can tell it was committed, then reverted because it
wasn't a "bug". It was going to go in for 8.1.We on the channel are starting to think that it is in fact a bug. There
are are people with legitimately utf-8 encoded XML documents that they
cannot store in PostgreSQL. Apparently in the distant past, Unicode was
limited to 0x10000, but then was extended.Perhaps we can reopen this case...
Uh, I thought we fixed this another way, buy not using Unicode-aware
functions for upper/lower/initcap when the locale is "C" or "POSIX".
That is backpatched to 8.0.X. Does that not fix the problem reported?
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
On 2005-04-09, Bruce Momjian <pgman@candle.pha.pa.us> wrote:
Uh, I thought we fixed this another way, buy not using Unicode-aware
functions for upper/lower/initcap when the locale is "C" or "POSIX".
That is backpatched to 8.0.X. Does that not fix the problem reported?
Unicode values over 0xFFFF are simply not accepted on input, so no, it
doesn't fix the problem. What do upper/lower/initcap have to do with it?
textin() unconditionally calls pg_verifymbstr, which in turn explicitly
checks for such values (if the encoding is UTF8) and throws ERROR if it
finds them.
--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services
-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Bruce Momjian
Sent: Sunday, April 10, 2005 8:18 AM
To: Christopher Kings-Lynne
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Unicode problems on IRCChristopher Kings-Lynne wrote:
Hey guys,
The 'Unicode characters above 0x10000' issue keeps rearing its ugly
head in the IRC channel. I propose that it be fixed, evenbackported...
This is John Hansen's most recent patch to fix it:
http://archives.postgresql.org/pgsql-patches/2004-11/msg00259.php
And from what I can tell it was committed, then reverted because it
wasn't a "bug". It was going to go in for 8.1.We on the channel are starting to think that it is in fact a bug.
There are are people with legitimately utf-8 encoded XML documents
that they cannot store in PostgreSQL. Apparently in thedistant past,
Unicode was limited to 0x10000, but then was extended.
Perhaps we can reopen this case...
Uh, I thought we fixed this another way, buy not using
Unicode-aware functions for upper/lower/initcap when the
locale is "C" or "POSIX".
That is backpatched to 8.0.X. Does that not fix the problem reported?
No, as andrew said, what this patch does, is allow values > 0xffff and
at the same time validates the input to make sure it's valid utf8.
... John
Show quoted text
-- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073---------------------------(end of
broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index
scan if your
joining column's datatypes do not match
Import Notes
Resolved by subject fallback
"John Hansen" <john@geeknet.com.au> writes:
That is backpatched to 8.0.X. Does that not fix the problem reported?
No, as andrew said, what this patch does, is allow values > 0xffff and
at the same time validates the input to make sure it's valid utf8.
The impression I get is that most of the 'Unicode characters above
0x10000' reports we've seen did not come from people who actually needed
more-than-16-bit Unicode codepoints, but from people who had screwed up
their encoding settings and were trying to tell the backend that Latin1
was Unicode or some such. So I'm a bit worried that extending the
backend support to full 32-bit Unicode will do more to mask encoding
mistakes than it will do to create needed functionality.
Not that I'm against adding the functionality. I'm just doubtful that
the reports we've seen really indicate that we need it, or that adding
it will cut down on the incidence of complaints :-(
regards, tom lane
Tom Lane wrote:
"John Hansen" <john@geeknet.com.au> writes:
That is backpatched to 8.0.X. Does that not fix the problem reported?
No, as andrew said, what this patch does, is allow values > 0xffff and
at the same time validates the input to make sure it's valid utf8.The impression I get is that most of the 'Unicode characters above
0x10000' reports we've seen did not come from people who actually needed
more-than-16-bit Unicode codepoints, but from people who had screwed up
their encoding settings and were trying to tell the backend that Latin1
was Unicode or some such. So I'm a bit worried that extending the
backend support to full 32-bit Unicode will do more to mask encoding
mistakes than it will do to create needed functionality.
Yes, that was my impression too.
The upper/lower/initcap issue was that some operating systems were
testing unicode values even if the local was set to C. That is fixed in
8.0.2, but I now see this is a different problem.
Not that I'm against adding the functionality. I'm just doubtful that
the reports we've seen really indicate that we need it, or that adding
it will cut down on the incidence of complaints :-(
Yea, that was my question too. I figured Japan or Chinese would be
using these longer values, and if they are fine, why are others having
problems. It would be great to find a test case that actually shows the
failure.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
On 2005-04-10, Tom Lane <tgl@sss.pgh.pa.us> wrote:
The impression I get is that most of the 'Unicode characters above
0x10000' reports we've seen did not come from people who actually needed
more-than-16-bit Unicode codepoints, but from people who had screwed up
their encoding settings and were trying to tell the backend that Latin1
was Unicode or some such. So I'm a bit worried that extending the
backend support to full 32-bit Unicode will do more to mask encoding
mistakes than it will do to create needed functionality.
I think you will find that this impression is actually false. Or that at
the very least, _correct_ verification of UTF-8 sequences will still
catch essentially all cases of non-utf-8 input mislabelled as utf-8
while allowing the full range of Unicode codepoints. (The current check
will report the "characters above 0x10000" error even on input which is
blatantly not utf-8 at all.)
One of UTF-8's nicest properties is that other encodings are almost never
also valid utf-8. I did some tests on this myself some years ago, feeding
hundreds of thousands of short non-utf-8 strings (taken from Usenet
subject lines in non-english-speaking hierarchies) into a utf-8 decoder.
The false accept rate was on the order of 0.01%, and going back and
re-checking my old data, _none_ of the incorrectly detected sequences
would have been interpreted as characters over 0xFFFF.
--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services
Andrew - Supernews <andrew+nonews@supernews.com> writes:
On 2005-04-10, Tom Lane <tgl@sss.pgh.pa.us> wrote:
The impression I get is that most of the 'Unicode characters above
0x10000' reports we've seen did not come from people who actually needed
more-than-16-bit Unicode codepoints, but from people who had screwed up
their encoding settings and were trying to tell the backend that Latin1
was Unicode or some such.
I think you will find that this impression is actually false. Or that at
the very least, _correct_ verification of UTF-8 sequences will still
catch essentially all cases of non-utf-8 input mislabelled as utf-8
while allowing the full range of Unicode codepoints.
Yeah? Cool. Does John's proposed patch do it "correctly"?
http://candle.pha.pa.us/mhonarc/patches2/msg00076.html
regards, tom lane
On 2005-04-10, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andrew - Supernews <andrew+nonews@supernews.com> writes:
I think you will find that this impression is actually false. Or that at
the very least, _correct_ verification of UTF-8 sequences will still
catch essentially all cases of non-utf-8 input mislabelled as utf-8
while allowing the full range of Unicode codepoints.Yeah? Cool. Does John's proposed patch do it "correctly"?
It looks correct to me. The only thing I think that code will let through
incorrectly are encoded surrogates; those could be fixed by adding one line:
switch (*source) {
/* no fall-through in this inner switch */
case 0xE0: if (a < 0xA0) return false; break;
+ case 0xED: if (a > 0x9F) return false; break;
case 0xF0: if (a < 0x90) return false; break;
case 0xF4: if (a > 0x8F) return false; break;
(Accepting encoded surrogates in utf-8 was always forbidden by most
specifications that used utf-8, though the Unicode specs originally were
not absolute about it (but forbade generating them). Current Unicode
specifications define those sequences as malformed. Surrogates are the
code points from 0xD800 - 0xDFFF, which are used in UTF-16 to encode
characters 0x10000 - 0x10FFFF as two 16-bit values; UTF-8 requires that
such characters are encoded directly rather than via surrogate pairs.)
--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services
On 2005-04-10, Tom Lane <tgl ( at ) sss ( dot ) pgh ( dot ) pa ( dot )
us> wrote:
Andrew - Supernews <andrew+nonews ( at ) supernews ( dot ) com>
writes:
I think you will find that this impression is actually false. Or
that at
the very least, _correct_ verification of UTF-8 sequences will still
catch essentially all cases of non-utf-8 input mislabelled as utf-8
while allowing the full range of Unicode codepoints.Yeah? Cool. Does John's proposed patch do it "correctly"?
It looks correct to me. The only thing I think that code will let
through
incorrectly are encoded surrogates; those could be fixed by adding one
line:
switch (*source) {
/* no fall-through in this inner switch */
case 0xE0: if (a < 0xA0) return false; break;
+ case 0xED: if (a > 0x9F) return false; break;
case 0xF0: if (a < 0x90) return false; break;
case 0xF4: if (a > 0x8F) return false; break;
That's right, dono how I missed that one, but looks correct to me, and
is in line with the code in ConvertUTF.c from unicode.org, on which I
based the patch, extended to support 6 byte utf8 characters.
(Accepting encoded surrogates in utf-8 was always forbidden by most
specifications that used utf-8, though the Unicode specs originally
were
not absolute about it (but forbade generating them). Current Unicode
specifications define those sequences as malformed. Surrogates are the
code points from 0xD800 - 0xDFFF, which are used in UTF-16 to encode
characters 0x10000 - 0x10FFFF as two 16-bit values; UTF-8 requires that
such characters are encoded directly rather than via surrogate pairs.)--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services
... John
Import Notes
Resolved by subject fallback
Tom Lane wrote:
Yeah? Cool. Does John's proposed patch do it "correctly"?
Some comments on that patch:
Doesn't pg_utf2wchar_with_len need changes for the longer sequences?
UtfToLocal also appears to need changes.
If we support sequences >4 bytes (>U+10FFFF), then UtfToLocal/LocalToUtf
and the associated translation tables need a redesign as they currently
assume the sequence fits in an unsigned int. (IIRC, Unicode doesn't use
U+10FFFF, but UTF-8 can encode it?)
-O
On 2005-04-10, "John Hansen" <john@geeknet.com.au> wrote:
That's right, dono how I missed that one, but looks correct to me, and
is in line with the code in ConvertUTF.c from unicode.org, on which I
based the patch, extended to support 6 byte utf8 characters.
Frankly, you should probably de-extend it back down to 4 bytes. That's
enough to encode the Unicode range of 0x000000 - 0x10FFFF, and enough
other stuff would break if anyone allocated a character outside that
range that I don't think it it worth worrying about. (Even the ISO
people have agreed to conform to that limitation.) Even if insanity
struck simultaneously at both standards bodies, 4 bytes is enough to
go to 0x1FFFFF so there is still substantial slack. (A number of other
specifications based on utf-8 have removed the 5 and 6 byte sequences
too, so there is substantial precedent for this.)
--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services