UNICODE characters above 0x10000
I've started work on a patch for this problem.
Doing regression tests at present.
I'll get back when done.
Regards,
John
Attached, as promised, small patch removing the limitation, adding
correct utf8 validation.
Regards,
John
-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of John Hansen
Sent: Friday, August 06, 2004 2:20 PM
To: 'Hackers'
Subject: [HACKERS] UNICODE characters above 0x10000
I've started work on a patch for this problem.
Doing regression tests at present.
I'll get back when done.
Regards,
John
---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?
Attachments:
wchar.c.patchapplication/octet-stream; name=wchar.c.patchDownload+51-2
Import Notes
Resolved by subject fallback
"John Hansen" <john@geeknet.com.au> writes:
Attached, as promised, small patch removing the limitation, adding
correct utf8 validation.
Surely this is badly broken --- it will happily access data outside the
bounds of the given string. Also, doesn't pg_mblen already know the
length rules for UTF8? Why are you duplicating that knowledge?
regards, tom lane
My apologies for not reading the code properly.
Attached patch using pg_utf_mblen() instead of an indexed table.
It now also do bounds checks.
Regards,
John Hansen
-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Saturday, August 07, 2004 4:37 AM
To: John Hansen
Cc: Hackers; Patches
Subject: Re: [HACKERS] UNICODE characters above 0x10000
"John Hansen" <john@geeknet.com.au> writes:
Attached, as promised, small patch removing the limitation, adding
correct utf8 validation.
Surely this is badly broken --- it will happily access data outside the
bounds of the given string. Also, doesn't pg_mblen already know the
length rules for UTF8? Why are you duplicating that knowledge?
regards, tom lane
Attachments:
wchar.c.patchapplication/octet-stream; name=wchar.c.patchDownload+34-2
Import Notes
Resolved by subject fallback
"John Hansen" <john@geeknet.com.au> writes:
My apologies for not reading the code properly.
Attached patch using pg_utf_mblen() instead of an indexed table.
It now also do bounds checks.
I think you missed my point. If we don't need this limitation, the
correct patch is simply to delete the whole check (ie, delete lines
827-836 of wchar.c, and for that matter we'd then not need the encoding
local variable). What's really at stake here is whether anything else
breaks if we do that. What else, if anything, assumes that UTF
characters are not more than 2 bytes?
Now it's entirely possible that the underlying support is a few bricks
shy of a load --- for instance I see that pg_utf_mblen thinks there are
no UTF8 codes longer than 3 bytes whereas your code goes to 4. I'm not
an expert on this stuff, so I don't know what the UTF8 spec actually
says. But I do think you are fixing the code at the wrong level.
regards, tom lane
Ahh, but that's not the case. You cannot just delete the check, since
not all combinations of bytes are valid UTF8. UTF bytes FE & FF never
appear in a byte sequence for instance.
UTF8 is more that two bytes btw, up to 6 bytes are used to represent an
UTF8 character.
The 5 and 6 byte characters are currently not in use tho.
I didn't actually notice the difference in UTF8 width between my
original patch and my last, so attached, updated patch.
Regards,
John Hansen
-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Saturday, August 07, 2004 3:07 PM
To: John Hansen
Cc: Hackers; Patches
Subject: Re: [HACKERS] UNICODE characters above 0x10000
"John Hansen" <john@geeknet.com.au> writes:
My apologies for not reading the code properly.
Attached patch using pg_utf_mblen() instead of an indexed table.
It now also do bounds checks.
I think you missed my point. If we don't need this limitation, the
correct patch is simply to delete the whole check (ie, delete lines
827-836 of wchar.c, and for that matter we'd then not need the encoding
local variable). What's really at stake here is whether anything else
breaks if we do that. What else, if anything, assumes that UTF
characters are not more than 2 bytes?
Now it's entirely possible that the underlying support is a few bricks
shy of a load --- for instance I see that pg_utf_mblen thinks there are
no UTF8 codes longer than 3 bytes whereas your code goes to 4. I'm not
an expert on this stuff, so I don't know what the UTF8 spec actually
says. But I do think you are fixing the code at the wrong level.
regards, tom lane
Attachments:
wchar.c.patchapplication/octet-stream; name=wchar.c.patchDownload+38-4
Import Notes
Resolved by subject fallback
On Sat, 2004-08-07 at 06:06, Tom Lane wrote:
Now it's entirely possible that the underlying support is a few bricks
shy of a load --- for instance I see that pg_utf_mblen thinks there are
no UTF8 codes longer than 3 bytes whereas your code goes to 4. I'm not
an expert on this stuff, so I don't know what the UTF8 spec actually
says. But I do think you are fixing the code at the wrong level.
UTF-8 characters can be up to 6 bytes long:
http://www.cl.cam.ac.uk/~mgk25/unicode.html
glibc provides various routines (mb...) for handling Unicode. How many
of our supported platforms don't have these? If there are still some
that don't, wouldn't it be better to use the standard routines where
they do exist?
--
Oliver Elphick olly@lfix.co.uk
Isle of Wight http://www.lfix.co.uk/oliver
GPG: 1024D/A54310EA 92C8 39E7 280E 3631 3F0E 1EC0 5664 7A2F A543 10EA
========================================
"Be still before the LORD and wait patiently for him;
do not fret when men succeed in their ways, when they
carry out their wicked schemes."
Psalms 37:7
Oliver Elphick <olly@lfix.co.uk> writes:
glibc provides various routines (mb...) for handling Unicode. How many
of our supported platforms don't have these?
Every one that doesn't use glibc. Don't bother proposing a glibc-only
solution (and that's from someone who works for a glibc-only company;
you don't even want to think about the push-back you'll get from other
quarters).
regards, tom lane
On Sat, 7 Aug 2004, Tom Lane wrote:
shy of a load --- for instance I see that pg_utf_mblen thinks there are
no UTF8 codes longer than 3 bytes whereas your code goes to 4. I'm not
an expert on this stuff, so I don't know what the UTF8 spec actually
says. But I do think you are fixing the code at the wrong level.
I can give some general info about utf-9. This is how it is encoded:
character encoding
------------------- ---------
00000000 - 0000007F: 0xxxxxxx
00000080 - 000007FF: 110xxxxx 10xxxxxx
00000800 - 0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
00010000 - 001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
00200000 - 03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
04000000 - 7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
If the first byte starts with a 1 then the number of ones give the
length of the utf-8 sequence. And the rest of the bytes in the sequence
always starts with 10 (this makes it possble to look anywhere in the
string and fast find the start of a character).
This also means that the start byte can never start with 7 or 8 ones, that
is illegal and should be tested for and rejected. So the longest utf-8
sequence is 6 bytes (and the longest character needs 4 bytes (or 31
bits)).
--
/Dennis Bj�rklund
Possibly, since I got it wrong once more....
About to give up, but attached, Updated patch.
Regards,
John Hansen
-----Original Message-----
From: Oliver Elphick [mailto:olly@lfix.co.uk]
Sent: Saturday, August 07, 2004 3:56 PM
To: Tom Lane
Cc: John Hansen; Hackers; Patches
Subject: Re: [HACKERS] UNICODE characters above 0x10000
On Sat, 2004-08-07 at 06:06, Tom Lane wrote:
Now it's entirely possible that the underlying support is a few bricks
shy of a load --- for instance I see that pg_utf_mblen thinks there
are no UTF8 codes longer than 3 bytes whereas your code goes to 4.
I'm not an expert on this stuff, so I don't know what the UTF8 spec
actually says. But I do think you are fixing the code at the wrong
level.
UTF-8 characters can be up to 6 bytes long:
http://www.cl.cam.ac.uk/~mgk25/unicode.html
glibc provides various routines (mb...) for handling Unicode. How many
of our supported platforms don't have these? If there are still some
that don't, wouldn't it be better to use the standard routines where
they do exist?
--
Oliver Elphick olly@lfix.co.uk
Isle of Wight http://www.lfix.co.uk/oliver
GPG: 1024D/A54310EA 92C8 39E7 280E 3631 3F0E 1EC0 5664 7A2F A543 10EA
========================================
"Be still before the LORD and wait patiently for him;
do not fret when men succeed in their ways, when they
carry out their wicked schemes."
Psalms 37:7
Attachments:
wchar.c.patchapplication/octet-stream; name=wchar.c.patchDownload+38-4
Import Notes
Resolved by subject fallback
Dennis Bjorklund <db@zigo.dhs.org> writes:
... This also means that the start byte can never start with 7 or 8
ones, that is illegal and should be tested for and rejected. So the
longest utf-8 sequence is 6 bytes (and the longest character needs 4
bytes (or 31 bits)).
Tatsuo would know more about this than me, but it looks from here like
our coding was originally designed to support only 16-bit-wide internal
characters (ie, 16-bit pg_wchar datatype width). I believe that the
regex library limitation here is gone, and that as far as that library
is concerned we could assume a 32-bit internal character width. The
question at hand is whether we can support 32-bit characters or not ---
and if not, what's the next bug to fix?
regards, tom lane
On Sat, 7 Aug 2004, Tom Lane wrote:
question at hand is whether we can support 32-bit characters or not ---
and if not, what's the next bug to fix?
True, and that's hard to just give an answer to. One could do some simple
testing, make sure regexps work and then treat anything else that might
not work, as bugs to be fixed later on when found.
The alternative is to inspect all code paths that involve strings, not fun
at all :-)
My previous mail talked about utf-8 translation. Not all characters
possible to form using utf-8 are assigned by the unicode org. However,
the part that interprets the unicode strings are in the os so different
os'es can give different results. So I think pg should just accept even 6
byte utf-8 sequences even if some characters are not currently assigned.
--
/Dennis Bj�rklund
This should do it.
Regards,
John Hansen
-----Original Message-----
From: Dennis Bjorklund [mailto:db@zigo.dhs.org]
Sent: Saturday, August 07, 2004 5:02 PM
To: Tom Lane
Cc: John Hansen; Hackers; Patches
Subject: Re: [HACKERS] UNICODE characters above 0x10000
On Sat, 7 Aug 2004, Tom Lane wrote:
question at hand is whether we can support 32-bit characters or not --- and if not, what's the next bug to fix?
True, and that's hard to just give an answer to. One could do some simple testing, make sure regexps work and then treat anything else that might not work, as bugs to be fixed later on when found.
The alternative is to inspect all code paths that involve strings, not fun at all :-)
My previous mail talked about utf-8 translation. Not all characters possible to form using utf-8 are assigned by the unicode org. However, the part that interprets the unicode strings are in the os so different os'es can give different results. So I think pg should just accept even 6 byte utf-8 sequences even if some characters are not currently assigned.
--
/Dennis Björklund
Attachments:
wchar.c.patchapplication/octet-stream; name=wchar.c.patchDownload+44-4
Import Notes
Resolved by subject fallback
Dennis Bjorklund <db@zigo.dhs.org> writes:
... This also means that the start byte can never start with 7 or 8
ones, that is illegal and should be tested for and rejected. So the
longest utf-8 sequence is 6 bytes (and the longest character needs 4
bytes (or 31 bits)).Tatsuo would know more about this than me, but it looks from here like
our coding was originally designed to support only 16-bit-wide internal
characters (ie, 16-bit pg_wchar datatype width). I believe that the
regex library limitation here is gone, and that as far as that library
is concerned we could assume a 32-bit internal character width. The
question at hand is whether we can support 32-bit characters or not ---
and if not, what's the next bug to fix?
pg_wchar has been already 32-bit datatype. However I doubt there's
actually a need for 32-but width character sets. Even Unicode only
uese up 0x0010FFFF, so 24-bit should be enough...
--
Tatsuo Ishii
Yes, but the specification allows for 6byte sequences, or 32bit
characters.
As dennis pointed out, just because they're not used, doesn't mean we
should not allow them to be stored, since there might me someone using
the high ranges for a private character set, which could very well be
included in the specification some day.
Regards,
John Hansen
-----Original Message-----
From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp]
Sent: Saturday, August 07, 2004 8:09 PM
To: tgl@sss.pgh.pa.us
Cc: db@zigo.dhs.org; John Hansen; pgsql-hackers@postgresql.org;
pgsql-patches@postgresql.org
Subject: Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000
Dennis Bjorklund <db@zigo.dhs.org> writes:
... This also means that the start byte can never start with 7 or 8
ones, that is illegal and should be tested for and rejected. So the
longest utf-8 sequence is 6 bytes (and the longest character needs 4
bytes (or 31 bits)).
Tatsuo would know more about this than me, but it looks from here like
our coding was originally designed to support only 16-bit-wide
internal characters (ie, 16-bit pg_wchar datatype width). I believe
that the regex library limitation here is gone, and that as far as
that library is concerned we could assume a 32-bit internal character
width. The question at hand is whether we can support 32-bit
characters or not --- and if not, what's the next bug to fix?
pg_wchar has been already 32-bit datatype. However I doubt there's
actually a need for 32-but width character sets. Even Unicode only uese
up 0x0010FFFF, so 24-bit should be enough...
--
Tatsuo Ishii
Import Notes
Resolved by subject fallback
Yes, but the specification allows for 6byte sequences, or 32bit
characters.
UTF-8 is just an encoding specification, not character set
specification. Unicode only has 17 256x256 planes in its
specification.
As dennis pointed out, just because they're not used, doesn't mean we
should not allow them to be stored, since there might me someone using
the high ranges for a private character set, which could very well be
included in the specification some day.
We should expand it to 64-bit since some day the specification might
be changed then:-)
More seriously, Unicode is filled with tons of confusion and
inconsistency IMO. Remember that once Unicode adovocates said that the
merit of Unicode was it only requires 16-bit width. Now they say they
need surrogate pairs and 32-bit width chars...
Anyway my point is if current specification of Unicode only allows
24-bit range, why we need to allow usage against the specification?
--
Tatsuo Ishii
Now it's entirely possible that the underlying support is a few bricks
shy of a load --- for instance I see that pg_utf_mblen thinks there are
no UTF8 codes longer than 3 bytes whereas your code goes to 4. I'm not
an expert on this stuff, so I don't know what the UTF8 spec actually
says. But I do think you are fixing the code at the wrong level.
Surely there are UTF-8 codes that are at least 3 bytes. I have a
_vague_ recollection that you have to keep escaping and escaping to get
up to like 4 bytes for some asian code points?
Chris
4 actually,
10FFFF needs four bytes:
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
10FFFF = 00001010 11111111 11111111
Fill in the blanks, starting from the bottom, you get:
11110000 10101111 10111111 10111111
Regards,
John Hansen
-----Original Message-----
From: Christopher Kings-Lynne [mailto:chriskl@familyhealth.com.au]
Sent: Saturday, August 07, 2004 8:47 PM
To: Tom Lane
Cc: John Hansen; Hackers; Patches
Subject: Re: [HACKERS] UNICODE characters above 0x10000
Now it's entirely possible that the underlying support is a few bricks
shy of a load --- for instance I see that pg_utf_mblen thinks there
are no UTF8 codes longer than 3 bytes whereas your code goes to 4.
I'm not an expert on this stuff, so I don't know what the UTF8 spec
actually says. But I do think you are fixing the code at the wrong
level.
Surely there are UTF-8 codes that are at least 3 bytes. I have a
_vague_ recollection that you have to keep escaping and escaping to get
up to like 4 bytes for some asian code points?
Chris
Import Notes
Resolved by subject fallback
On Sat, 7 Aug 2004, John Hansen wrote:
should not allow them to be stored, since there might me someone using
the high ranges for a private character set, which could very well be
included in the specification some day.
There are areas reserved for private character sets.
--
/Dennis Bj�rklund
Well, maybe we'd be better off, compiling a list of (in?)valid ranges
from the full unicode database
(http://www.unicode.org/Public/UNIDATA/UnicodeData.txt and
http://www.unicode.org/Public/UNIDATA/Unihan.txt)
and with every release of pg, update the detection logic so only valid
characters are allowed?
Regards,
John Hansen
-----Original Message-----
From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp]
Sent: Saturday, August 07, 2004 8:46 PM
To: John Hansen
Cc: tgl@sss.pgh.pa.us; db@zigo.dhs.org; pgsql-hackers@postgresql.org;
pgsql-patches@postgresql.org
Subject: Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000
Yes, but the specification allows for 6byte sequences, or 32bit
characters.
UTF-8 is just an encoding specification, not character set
specification. Unicode only has 17 256x256 planes in its specification.
As dennis pointed out, just because they're not used, doesn't mean we
should not allow them to be stored, since there might me someone using
the high ranges for a private character set, which could very well be
included in the specification some day.
We should expand it to 64-bit since some day the specification might be
changed then:-)
More seriously, Unicode is filled with tons of confusion and
inconsistency IMO. Remember that once Unicode adovocates said that the
merit of Unicode was it only requires 16-bit width. Now they say they
need surrogate pairs and 32-bit width chars...
Anyway my point is if current specification of Unicode only allows
24-bit range, why we need to allow usage against the specification?
--
Tatsuo Ishii
Import Notes
Resolved by subject fallback