UNICODE characters above 0x10000

Started by John Hansenover 21 years ago36 messageshackers
Jump to latest
#1John Hansen
john@geeknet.com.au

I've started work on a patch for this problem.

Doing regression tests at present.

I'll get back when done.

Regards,

John

#2John Hansen
john@geeknet.com.au
In reply to: John Hansen (#1)
Re: UNICODE characters above 0x10000

Attached, as promised, small patch removing the limitation, adding
correct utf8 validation.

Regards,

John

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of John Hansen
Sent: Friday, August 06, 2004 2:20 PM
To: 'Hackers'
Subject: [HACKERS] UNICODE characters above 0x10000

I've started work on a patch for this problem.

Doing regression tests at present.

I'll get back when done.

Regards,

John

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Attachments:

wchar.c.patchapplication/octet-stream; name=wchar.c.patchDownload+51-2
#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: John Hansen (#2)
Re: UNICODE characters above 0x10000

"John Hansen" <john@geeknet.com.au> writes:

Attached, as promised, small patch removing the limitation, adding
correct utf8 validation.

Surely this is badly broken --- it will happily access data outside the
bounds of the given string. Also, doesn't pg_mblen already know the
length rules for UTF8? Why are you duplicating that knowledge?

regards, tom lane

#4John Hansen
john@geeknet.com.au
In reply to: Tom Lane (#3)
Re: UNICODE characters above 0x10000

My apologies for not reading the code properly.

Attached patch using pg_utf_mblen() instead of an indexed table.
It now also do bounds checks.

Regards,

John Hansen

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Saturday, August 07, 2004 4:37 AM
To: John Hansen
Cc: Hackers; Patches
Subject: Re: [HACKERS] UNICODE characters above 0x10000

"John Hansen" <john@geeknet.com.au> writes:

Attached, as promised, small patch removing the limitation, adding
correct utf8 validation.

Surely this is badly broken --- it will happily access data outside the
bounds of the given string. Also, doesn't pg_mblen already know the
length rules for UTF8? Why are you duplicating that knowledge?

regards, tom lane

Attachments:

wchar.c.patchapplication/octet-stream; name=wchar.c.patchDownload+34-2
#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: John Hansen (#4)
Re: UNICODE characters above 0x10000

"John Hansen" <john@geeknet.com.au> writes:

My apologies for not reading the code properly.

Attached patch using pg_utf_mblen() instead of an indexed table.
It now also do bounds checks.

I think you missed my point. If we don't need this limitation, the
correct patch is simply to delete the whole check (ie, delete lines
827-836 of wchar.c, and for that matter we'd then not need the encoding
local variable). What's really at stake here is whether anything else
breaks if we do that. What else, if anything, assumes that UTF
characters are not more than 2 bytes?

Now it's entirely possible that the underlying support is a few bricks
shy of a load --- for instance I see that pg_utf_mblen thinks there are
no UTF8 codes longer than 3 bytes whereas your code goes to 4. I'm not
an expert on this stuff, so I don't know what the UTF8 spec actually
says. But I do think you are fixing the code at the wrong level.

regards, tom lane

#6John Hansen
john@geeknet.com.au
In reply to: Tom Lane (#5)
Re: UNICODE characters above 0x10000

Ahh, but that's not the case. You cannot just delete the check, since
not all combinations of bytes are valid UTF8. UTF bytes FE & FF never
appear in a byte sequence for instance.
UTF8 is more that two bytes btw, up to 6 bytes are used to represent an
UTF8 character.
The 5 and 6 byte characters are currently not in use tho.

I didn't actually notice the difference in UTF8 width between my
original patch and my last, so attached, updated patch.

Regards,

John Hansen

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Saturday, August 07, 2004 3:07 PM
To: John Hansen
Cc: Hackers; Patches
Subject: Re: [HACKERS] UNICODE characters above 0x10000

"John Hansen" <john@geeknet.com.au> writes:

My apologies for not reading the code properly.

Attached patch using pg_utf_mblen() instead of an indexed table.
It now also do bounds checks.

I think you missed my point. If we don't need this limitation, the
correct patch is simply to delete the whole check (ie, delete lines
827-836 of wchar.c, and for that matter we'd then not need the encoding
local variable). What's really at stake here is whether anything else
breaks if we do that. What else, if anything, assumes that UTF
characters are not more than 2 bytes?

Now it's entirely possible that the underlying support is a few bricks
shy of a load --- for instance I see that pg_utf_mblen thinks there are
no UTF8 codes longer than 3 bytes whereas your code goes to 4. I'm not
an expert on this stuff, so I don't know what the UTF8 spec actually
says. But I do think you are fixing the code at the wrong level.

regards, tom lane

Attachments:

wchar.c.patchapplication/octet-stream; name=wchar.c.patchDownload+38-4
#7Oliver Elphick
olly@lfix.co.uk
In reply to: Tom Lane (#5)
Re: UNICODE characters above 0x10000

On Sat, 2004-08-07 at 06:06, Tom Lane wrote:

Now it's entirely possible that the underlying support is a few bricks
shy of a load --- for instance I see that pg_utf_mblen thinks there are
no UTF8 codes longer than 3 bytes whereas your code goes to 4. I'm not
an expert on this stuff, so I don't know what the UTF8 spec actually
says. But I do think you are fixing the code at the wrong level.

UTF-8 characters can be up to 6 bytes long:
http://www.cl.cam.ac.uk/~mgk25/unicode.html

glibc provides various routines (mb...) for handling Unicode. How many
of our supported platforms don't have these? If there are still some
that don't, wouldn't it be better to use the standard routines where
they do exist?

--
Oliver Elphick olly@lfix.co.uk
Isle of Wight http://www.lfix.co.uk/oliver
GPG: 1024D/A54310EA 92C8 39E7 280E 3631 3F0E 1EC0 5664 7A2F A543 10EA
========================================
"Be still before the LORD and wait patiently for him;
do not fret when men succeed in their ways, when they
carry out their wicked schemes."
Psalms 37:7

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Oliver Elphick (#7)
Re: UNICODE characters above 0x10000

Oliver Elphick <olly@lfix.co.uk> writes:

glibc provides various routines (mb...) for handling Unicode. How many
of our supported platforms don't have these?

Every one that doesn't use glibc. Don't bother proposing a glibc-only
solution (and that's from someone who works for a glibc-only company;
you don't even want to think about the push-back you'll get from other
quarters).

regards, tom lane

#9Dennis Bjorklund
db@zigo.dhs.org
In reply to: Tom Lane (#5)
Re: UNICODE characters above 0x10000

On Sat, 7 Aug 2004, Tom Lane wrote:

shy of a load --- for instance I see that pg_utf_mblen thinks there are
no UTF8 codes longer than 3 bytes whereas your code goes to 4. I'm not
an expert on this stuff, so I don't know what the UTF8 spec actually
says. But I do think you are fixing the code at the wrong level.

I can give some general info about utf-9. This is how it is encoded:

character encoding
------------------- ---------
00000000 - 0000007F: 0xxxxxxx
00000080 - 000007FF: 110xxxxx 10xxxxxx
00000800 - 0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
00010000 - 001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
00200000 - 03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
04000000 - 7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

If the first byte starts with a 1 then the number of ones give the
length of the utf-8 sequence. And the rest of the bytes in the sequence
always starts with 10 (this makes it possble to look anywhere in the
string and fast find the start of a character).

This also means that the start byte can never start with 7 or 8 ones, that
is illegal and should be tested for and rejected. So the longest utf-8
sequence is 6 bytes (and the longest character needs 4 bytes (or 31
bits)).

--
/Dennis Bj�rklund

#10John Hansen
john@geeknet.com.au
In reply to: Dennis Bjorklund (#9)
Re: UNICODE characters above 0x10000

Possibly, since I got it wrong once more....
About to give up, but attached, Updated patch.

Regards,

John Hansen

-----Original Message-----
From: Oliver Elphick [mailto:olly@lfix.co.uk]
Sent: Saturday, August 07, 2004 3:56 PM
To: Tom Lane
Cc: John Hansen; Hackers; Patches
Subject: Re: [HACKERS] UNICODE characters above 0x10000

On Sat, 2004-08-07 at 06:06, Tom Lane wrote:

Now it's entirely possible that the underlying support is a few bricks

shy of a load --- for instance I see that pg_utf_mblen thinks there
are no UTF8 codes longer than 3 bytes whereas your code goes to 4.
I'm not an expert on this stuff, so I don't know what the UTF8 spec
actually says. But I do think you are fixing the code at the wrong

level.

UTF-8 characters can be up to 6 bytes long:
http://www.cl.cam.ac.uk/~mgk25/unicode.html

glibc provides various routines (mb...) for handling Unicode. How many
of our supported platforms don't have these? If there are still some
that don't, wouldn't it be better to use the standard routines where
they do exist?

--
Oliver Elphick olly@lfix.co.uk
Isle of Wight http://www.lfix.co.uk/oliver
GPG: 1024D/A54310EA 92C8 39E7 280E 3631 3F0E 1EC0 5664 7A2F A543 10EA
========================================
"Be still before the LORD and wait patiently for him;
do not fret when men succeed in their ways, when they
carry out their wicked schemes."
Psalms 37:7

Attachments:

wchar.c.patchapplication/octet-stream; name=wchar.c.patchDownload+38-4
#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dennis Bjorklund (#9)
Re: UNICODE characters above 0x10000

Dennis Bjorklund <db@zigo.dhs.org> writes:

... This also means that the start byte can never start with 7 or 8
ones, that is illegal and should be tested for and rejected. So the
longest utf-8 sequence is 6 bytes (and the longest character needs 4
bytes (or 31 bits)).

Tatsuo would know more about this than me, but it looks from here like
our coding was originally designed to support only 16-bit-wide internal
characters (ie, 16-bit pg_wchar datatype width). I believe that the
regex library limitation here is gone, and that as far as that library
is concerned we could assume a 32-bit internal character width. The
question at hand is whether we can support 32-bit characters or not ---
and if not, what's the next bug to fix?

regards, tom lane

#12Dennis Bjorklund
db@zigo.dhs.org
In reply to: Tom Lane (#11)
Re: UNICODE characters above 0x10000

On Sat, 7 Aug 2004, Tom Lane wrote:

question at hand is whether we can support 32-bit characters or not ---
and if not, what's the next bug to fix?

True, and that's hard to just give an answer to. One could do some simple
testing, make sure regexps work and then treat anything else that might
not work, as bugs to be fixed later on when found.

The alternative is to inspect all code paths that involve strings, not fun
at all :-)

My previous mail talked about utf-8 translation. Not all characters
possible to form using utf-8 are assigned by the unicode org. However,
the part that interprets the unicode strings are in the os so different
os'es can give different results. So I think pg should just accept even 6
byte utf-8 sequences even if some characters are not currently assigned.

--
/Dennis Bj�rklund

#13John Hansen
john@geeknet.com.au
In reply to: Dennis Bjorklund (#12)
Re: UNICODE characters above 0x10000

This should do it.

Regards,

John Hansen

-----Original Message-----
From: Dennis Bjorklund [mailto:db@zigo.dhs.org]
Sent: Saturday, August 07, 2004 5:02 PM
To: Tom Lane
Cc: John Hansen; Hackers; Patches
Subject: Re: [HACKERS] UNICODE characters above 0x10000

On Sat, 7 Aug 2004, Tom Lane wrote:

question at hand is whether we can support 32-bit characters or not 
--- and if not, what's the next bug to fix?

True, and that's hard to just give an answer to. One could do some simple testing, make sure regexps work and then treat anything else that might not work, as bugs to be fixed later on when found.

The alternative is to inspect all code paths that involve strings, not fun at all :-)

My previous mail talked about utf-8 translation. Not all characters possible to form using utf-8 are assigned by the unicode org. However, the part that interprets the unicode strings are in the os so different os'es can give different results. So I think pg should just accept even 6 byte utf-8 sequences even if some characters are not currently assigned.

--
/Dennis Björklund

Attachments:

wchar.c.patchapplication/octet-stream; name=wchar.c.patchDownload+44-4
#14Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tom Lane (#11)
Re: [PATCHES] UNICODE characters above 0x10000

Dennis Bjorklund <db@zigo.dhs.org> writes:

... This also means that the start byte can never start with 7 or 8
ones, that is illegal and should be tested for and rejected. So the
longest utf-8 sequence is 6 bytes (and the longest character needs 4
bytes (or 31 bits)).

Tatsuo would know more about this than me, but it looks from here like
our coding was originally designed to support only 16-bit-wide internal
characters (ie, 16-bit pg_wchar datatype width). I believe that the
regex library limitation here is gone, and that as far as that library
is concerned we could assume a 32-bit internal character width. The
question at hand is whether we can support 32-bit characters or not ---
and if not, what's the next bug to fix?

pg_wchar has been already 32-bit datatype. However I doubt there's
actually a need for 32-but width character sets. Even Unicode only
uese up 0x0010FFFF, so 24-bit should be enough...
--
Tatsuo Ishii

#15John Hansen
john@geeknet.com.au
In reply to: Tatsuo Ishii (#14)
Re: [PATCHES] UNICODE characters above 0x10000

Yes, but the specification allows for 6byte sequences, or 32bit
characters.
As dennis pointed out, just because they're not used, doesn't mean we
should not allow them to be stored, since there might me someone using
the high ranges for a private character set, which could very well be
included in the specification some day.

Regards,

John Hansen

-----Original Message-----
From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp]
Sent: Saturday, August 07, 2004 8:09 PM
To: tgl@sss.pgh.pa.us
Cc: db@zigo.dhs.org; John Hansen; pgsql-hackers@postgresql.org;
pgsql-patches@postgresql.org
Subject: Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

Dennis Bjorklund <db@zigo.dhs.org> writes:

... This also means that the start byte can never start with 7 or 8
ones, that is illegal and should be tested for and rejected. So the
longest utf-8 sequence is 6 bytes (and the longest character needs 4

bytes (or 31 bits)).

Tatsuo would know more about this than me, but it looks from here like

our coding was originally designed to support only 16-bit-wide
internal characters (ie, 16-bit pg_wchar datatype width). I believe
that the regex library limitation here is gone, and that as far as
that library is concerned we could assume a 32-bit internal character
width. The question at hand is whether we can support 32-bit
characters or not --- and if not, what's the next bug to fix?

pg_wchar has been already 32-bit datatype. However I doubt there's
actually a need for 32-but width character sets. Even Unicode only uese
up 0x0010FFFF, so 24-bit should be enough...
--
Tatsuo Ishii

#16Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: John Hansen (#15)
Re: [PATCHES] UNICODE characters above 0x10000

Yes, but the specification allows for 6byte sequences, or 32bit
characters.

UTF-8 is just an encoding specification, not character set
specification. Unicode only has 17 256x256 planes in its
specification.

As dennis pointed out, just because they're not used, doesn't mean we
should not allow them to be stored, since there might me someone using
the high ranges for a private character set, which could very well be
included in the specification some day.

We should expand it to 64-bit since some day the specification might
be changed then:-)

More seriously, Unicode is filled with tons of confusion and
inconsistency IMO. Remember that once Unicode adovocates said that the
merit of Unicode was it only requires 16-bit width. Now they say they
need surrogate pairs and 32-bit width chars...

Anyway my point is if current specification of Unicode only allows
24-bit range, why we need to allow usage against the specification?
--
Tatsuo Ishii

#17Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: Tom Lane (#5)
Re: UNICODE characters above 0x10000

Now it's entirely possible that the underlying support is a few bricks
shy of a load --- for instance I see that pg_utf_mblen thinks there are
no UTF8 codes longer than 3 bytes whereas your code goes to 4. I'm not
an expert on this stuff, so I don't know what the UTF8 spec actually
says. But I do think you are fixing the code at the wrong level.

Surely there are UTF-8 codes that are at least 3 bytes. I have a
_vague_ recollection that you have to keep escaping and escaping to get
up to like 4 bytes for some asian code points?

Chris

#18John Hansen
john@geeknet.com.au
In reply to: Christopher Kings-Lynne (#17)
Re: UNICODE characters above 0x10000

4 actually,
10FFFF needs four bytes:

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
10FFFF = 00001010 11111111 11111111

Fill in the blanks, starting from the bottom, you get:
11110000 10101111 10111111 10111111

Regards,

John Hansen

-----Original Message-----
From: Christopher Kings-Lynne [mailto:chriskl@familyhealth.com.au]
Sent: Saturday, August 07, 2004 8:47 PM
To: Tom Lane
Cc: John Hansen; Hackers; Patches
Subject: Re: [HACKERS] UNICODE characters above 0x10000

Now it's entirely possible that the underlying support is a few bricks

shy of a load --- for instance I see that pg_utf_mblen thinks there
are no UTF8 codes longer than 3 bytes whereas your code goes to 4.
I'm not an expert on this stuff, so I don't know what the UTF8 spec
actually says. But I do think you are fixing the code at the wrong

level.

Surely there are UTF-8 codes that are at least 3 bytes. I have a
_vague_ recollection that you have to keep escaping and escaping to get
up to like 4 bytes for some asian code points?

Chris

#19Dennis Bjorklund
db@zigo.dhs.org
In reply to: John Hansen (#15)
Re: [PATCHES] UNICODE characters above 0x10000

On Sat, 7 Aug 2004, John Hansen wrote:

should not allow them to be stored, since there might me someone using
the high ranges for a private character set, which could very well be
included in the specification some day.

There are areas reserved for private character sets.

--
/Dennis Bj�rklund

#20John Hansen
john@geeknet.com.au
In reply to: Dennis Bjorklund (#19)
Re: [PATCHES] UNICODE characters above 0x10000

Well, maybe we'd be better off, compiling a list of (in?)valid ranges
from the full unicode database
(http://www.unicode.org/Public/UNIDATA/UnicodeData.txt and
http://www.unicode.org/Public/UNIDATA/Unihan.txt)
and with every release of pg, update the detection logic so only valid
characters are allowed?

Regards,

John Hansen

-----Original Message-----
From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp]
Sent: Saturday, August 07, 2004 8:46 PM
To: John Hansen
Cc: tgl@sss.pgh.pa.us; db@zigo.dhs.org; pgsql-hackers@postgresql.org;
pgsql-patches@postgresql.org
Subject: Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

Yes, but the specification allows for 6byte sequences, or 32bit
characters.

UTF-8 is just an encoding specification, not character set
specification. Unicode only has 17 256x256 planes in its specification.

As dennis pointed out, just because they're not used, doesn't mean we
should not allow them to be stored, since there might me someone using

the high ranges for a private character set, which could very well be
included in the specification some day.

We should expand it to 64-bit since some day the specification might be
changed then:-)

More seriously, Unicode is filled with tons of confusion and
inconsistency IMO. Remember that once Unicode adovocates said that the
merit of Unicode was it only requires 16-bit width. Now they say they
need surrogate pairs and 32-bit width chars...

Anyway my point is if current specification of Unicode only allows
24-bit range, why we need to allow usage against the specification?
--
Tatsuo Ishii

#21John Hansen
john@geeknet.com.au
In reply to: John Hansen (#20)
#22Dennis Bjorklund
db@zigo.dhs.org
In reply to: Tatsuo Ishii (#16)
#23Dennis Bjorklund
db@zigo.dhs.org
In reply to: Dennis Bjorklund (#22)
#24John Hansen
john@geeknet.com.au
In reply to: Dennis Bjorklund (#23)
#25Dennis Bjorklund
db@zigo.dhs.org
In reply to: John Hansen (#24)
#26John Hansen
john@geeknet.com.au
In reply to: Dennis Bjorklund (#25)
#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dennis Bjorklund (#22)
#28Oliver Elphick
olly@lfix.co.uk
In reply to: Tom Lane (#8)
#29John Hansen
john@geeknet.com.au
In reply to: Oliver Elphick (#28)
#30John Hansen
john@geeknet.com.au
In reply to: John Hansen (#29)
#31Oliver Jowett
oliver@opencloud.com
In reply to: Tom Lane (#27)
#32Tom Lane
tgl@sss.pgh.pa.us
In reply to: Oliver Jowett (#31)
#33Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Oliver Jowett (#31)
#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: John Hansen (#6)
#35Oliver Jowett
oliver@opencloud.com
In reply to: Tatsuo Ishii (#33)
#36John Hansen
john@geeknet.com.au
In reply to: Oliver Jowett (#35)