UNICODE characters above 0x10000

Started by John Hansenover 21 years ago36 messages
#1John Hansen
john@geeknet.com.au

I've started work on a patch for this problem.

Doing regression tests at present.

I'll get back when done.

Regards,

John

#2John Hansen
john@geeknet.com.au
In reply to: John Hansen (#1)
1 attachment(s)
Re: UNICODE characters above 0x10000

Attached, as promised, small patch removing the limitation, adding
correct utf8 validation.

Regards,

John

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of John Hansen
Sent: Friday, August 06, 2004 2:20 PM
To: 'Hackers'
Subject: [HACKERS] UNICODE characters above 0x10000

I've started work on a patch for this problem.

Doing regression tests at present.

I'll get back when done.

Regards,

John

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Attachments:

wchar.c.patchapplication/octet-stream; name=wchar.c.patchDownload
diff -ruN postgresql-7.5-old/src/backend/utils/mb/wchar.c postgresql-7.5/src/backend/utils/mb/wchar.c
--- postgresql-7.5-old/src/backend/utils/mb/wchar.c	2004-08-06 22:44:26.000000000 +1000
+++ postgresql-7.5/src/backend/utils/mb/wchar.c		2004-08-07 01:19:39.000000000 +1000
@@ -801,6 +801,53 @@
 
 #ifndef FRONTEND
 
+/* --------------------------------------------------------------------- */
+
+/*
+ * Index into the table below with the first byte of a UTF-8 sequence to
+ * get the number of trailing bytes that are supposed to follow it.
+ */
+
+static const char trailingBytesForUTF8[256] = {
+	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+	1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
+	2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5
+};
+
+/* --------------------------------------------------------------------- */
+
+/*
+ * Utility routine to tell whether a sequence of bytes is legal UTF-8.
+ */
+
+static unsigned char isLegalUTF8(const unsigned char *source) {
+	int length = trailingBytesForUTF8[*source]+1;
+	unsigned char a;
+	const unsigned char *srcptr = source+length;
+	switch (length) {
+	default: return false;
+		/* Everything else falls through when "true"... */
+	case 4: if ((a = (*--srcptr)) < 0x80 || a > 0xBF) return false;
+	case 3: if ((a = (*--srcptr)) < 0x80 || a > 0xBF) return false;
+	case 2: if ((a = (*--srcptr)) > 0xBF) return false;
+		switch (*source) {
+		    /* no fall-through in this inner switch */
+		    case 0xE0: if (a < 0xA0) return false; break;
+		    case 0xF0: if (a < 0x90) return false; break;
+		    case 0xF4: if (a > 0x8F) return false; break;
+		    default:  if (a < 0x80) return false;
+		}
+    	case 1: if (*source >= 0x80 && *source < 0xC2) return false;
+		if (*source > 0xF4) return false;
+	}
+	return true;
+}
+
 /*
  * Verify mbstr to make sure that it has a valid character sequence.
  * mbstr is not necessarily NULL terminated; length of mbstr is
@@ -825,14 +872,16 @@
 	while (len > 0 && *mbstr)
 	{
 		/* special UTF-8 check */
-		if (encoding == PG_UTF8 && (*mbstr & 0xf8) == 0xf0)
+		if (encoding == PG_UTF8 && !isLegalUTF8(mbstr))
 		{
 			if (noError)
 				return false;
 			ereport(ERROR,
 					(errcode(ERRCODE_CHARACTER_NOT_IN_REPERTOIRE),
-			 errmsg("Unicode characters greater than or equal to 0x10000 are not supported")));
+			 errmsg("Invalid UNICODE byte sequence detected")));
 		}
+		if (encoding == PG_UTF8)
+		    return true;
 
 		l = pg_mblen(mbstr);
 
#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: John Hansen (#2)
Re: UNICODE characters above 0x10000

"John Hansen" <john@geeknet.com.au> writes:

Attached, as promised, small patch removing the limitation, adding
correct utf8 validation.

Surely this is badly broken --- it will happily access data outside the
bounds of the given string. Also, doesn't pg_mblen already know the
length rules for UTF8? Why are you duplicating that knowledge?

regards, tom lane

#4John Hansen
john@geeknet.com.au
In reply to: Tom Lane (#3)
1 attachment(s)
Re: UNICODE characters above 0x10000

My apologies for not reading the code properly.

Attached patch using pg_utf_mblen() instead of an indexed table.
It now also do bounds checks.

Regards,

John Hansen

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Saturday, August 07, 2004 4:37 AM
To: John Hansen
Cc: Hackers; Patches
Subject: Re: [HACKERS] UNICODE characters above 0x10000

"John Hansen" <john@geeknet.com.au> writes:

Attached, as promised, small patch removing the limitation, adding
correct utf8 validation.

Surely this is badly broken --- it will happily access data outside the
bounds of the given string. Also, doesn't pg_mblen already know the
length rules for UTF8? Why are you duplicating that knowledge?

regards, tom lane

Attachments:

wchar.c.patchapplication/octet-stream; name=wchar.c.patchDownload
diff -ruN postgresql-7.5-old/src/backend/utils/mb/wchar.c postgresql-7.5/src/backend/utils/mb/wchar.c
--- postgresql-7.5-old/src/backend/utils/mb/wchar.c     2004-08-06 22:44:26.000000000 +1000
+++ postgresql-7.5/src/backend/utils/mb/wchar.c         2004-08-07 01:19:39.000000000 +1000
@@ -801,6 +801,36 @@
 
 #ifndef FRONTEND
 
+/* --------------------------------------------------------------------- */
+
+/*
+ * Utility routine to tell whether a sequence of bytes is legal UTF-8.
+ */
+
+static unsigned char isLegalUTF8(const unsigned char *source, int srclen) {
+	int length = pg_utf_mblen(source);
+	if(length > srclen) return false;
+	unsigned char a;
+	const unsigned char *srcptr = source+length;
+	switch (length) {
+	default: return false;
+		/* Everything else falls through when "true"... */
+	case 4: if ((a = (*--srcptr)) < 0x80 || a > 0xBF) return false;
+	case 3: if ((a = (*--srcptr)) < 0x80 || a > 0xBF) return false;
+	case 2: if ((a = (*--srcptr)) > 0xBF) return false;
+		switch (*source) {
+		    /* no fall-through in this inner switch */
+		    case 0xE0: if (a < 0xA0) return false; break;
+		    case 0xF0: if (a < 0x90) return false; break;
+		    case 0xF4: if (a > 0x8F) return false; break;
+		    default:  if (a < 0x80) return false;
+		}
+    	case 1: if (*source >= 0x80 && *source < 0xC2) return false;
+		if (*source > 0xF4) return false;
+	}
+	return true;
+}
+
 /*
  * Verify mbstr to make sure that it has a valid character sequence.
  * mbstr is not necessarily NULL terminated; length of mbstr is
@@ -825,14 +855,16 @@
 	while (len > 0 && *mbstr)
 	{
 		/* special UTF-8 check */
-		if (encoding == PG_UTF8 && (*mbstr & 0xf8) == 0xf0)
+		if (encoding == PG_UTF8 && !isLegalUTF8(mbstr,len))
 		{
 			if (noError)
 				return false;
 			ereport(ERROR,
 					(errcode(ERRCODE_CHARACTER_NOT_IN_REPERTOIRE),
-			 errmsg("Unicode characters greater than or equal to 0x10000 are not supported")));
+			 errmsg("Invalid UNICODE byte sequence detected")));
 		}
+		if (encoding == PG_UTF8)
+		    return true;
 
 		l = pg_mblen(mbstr);
 
#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: John Hansen (#4)
Re: UNICODE characters above 0x10000

"John Hansen" <john@geeknet.com.au> writes:

My apologies for not reading the code properly.

Attached patch using pg_utf_mblen() instead of an indexed table.
It now also do bounds checks.

I think you missed my point. If we don't need this limitation, the
correct patch is simply to delete the whole check (ie, delete lines
827-836 of wchar.c, and for that matter we'd then not need the encoding
local variable). What's really at stake here is whether anything else
breaks if we do that. What else, if anything, assumes that UTF
characters are not more than 2 bytes?

Now it's entirely possible that the underlying support is a few bricks
shy of a load --- for instance I see that pg_utf_mblen thinks there are
no UTF8 codes longer than 3 bytes whereas your code goes to 4. I'm not
an expert on this stuff, so I don't know what the UTF8 spec actually
says. But I do think you are fixing the code at the wrong level.

regards, tom lane

#6John Hansen
john@geeknet.com.au
In reply to: Tom Lane (#5)
1 attachment(s)
Re: UNICODE characters above 0x10000

Ahh, but that's not the case. You cannot just delete the check, since
not all combinations of bytes are valid UTF8. UTF bytes FE & FF never
appear in a byte sequence for instance.
UTF8 is more that two bytes btw, up to 6 bytes are used to represent an
UTF8 character.
The 5 and 6 byte characters are currently not in use tho.

I didn't actually notice the difference in UTF8 width between my
original patch and my last, so attached, updated patch.

Regards,

John Hansen

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Saturday, August 07, 2004 3:07 PM
To: John Hansen
Cc: Hackers; Patches
Subject: Re: [HACKERS] UNICODE characters above 0x10000

"John Hansen" <john@geeknet.com.au> writes:

My apologies for not reading the code properly.

Attached patch using pg_utf_mblen() instead of an indexed table.
It now also do bounds checks.

I think you missed my point. If we don't need this limitation, the
correct patch is simply to delete the whole check (ie, delete lines
827-836 of wchar.c, and for that matter we'd then not need the encoding
local variable). What's really at stake here is whether anything else
breaks if we do that. What else, if anything, assumes that UTF
characters are not more than 2 bytes?

Now it's entirely possible that the underlying support is a few bricks
shy of a load --- for instance I see that pg_utf_mblen thinks there are
no UTF8 codes longer than 3 bytes whereas your code goes to 4. I'm not
an expert on this stuff, so I don't know what the UTF8 spec actually
says. But I do think you are fixing the code at the wrong level.

regards, tom lane

Attachments:

wchar.c.patchapplication/octet-stream; name=wchar.c.patchDownload
diff -ruN postgresql-7.5-old/src/backend/utils/mb/wchar.c postgresql-7.5/src/backend/utils/mb/wchar.c
--- postgresql-7.5-old/src/backend/utils/mb/wchar.c     2004-08-06 22:44:26.000000000 +1000
+++ postgresql-7.5/src/backend/utils/mb/wchar.c         2004-08-07 01:19:39.000000000 +1000
@@ -404,10 +404,12 @@
 
 	if ((*s & 0x80) == 0)
 		len = 1;
-	else if ((*s & 0xe0) == 0xc0)
+	else if ((*s & 0xf0) == 0xc0)
 		len = 2;
-	else if ((*s & 0xe0) == 0xe0)
+	else if ((*s & 0xf0) == 0xe0)
 		len = 3;
+	else if ((*s & 0xf0) == 0xf0)
+		len = 4;
 	return (len);
 }
 
@@ -801,6 +803,36 @@
 
 #ifndef FRONTEND
 
+/* --------------------------------------------------------------------- */
+
+/*
+ * Utility routine to tell whether a sequence of bytes is legal UTF-8.
+ */
+
+static unsigned char isLegalUTF8(const unsigned char *source, int srclen) {
+	int length = pg_utf_mblen(source);
+	if(length > srclen) return false;
+	unsigned char a;
+	const unsigned char *srcptr = source+length;
+	switch (length) {
+	default: return false;
+		/* Everything else falls through when "true"... */
+	case 4: if ((a = (*--srcptr)) < 0x80 || a > 0xBF) return false;
+	case 3: if ((a = (*--srcptr)) < 0x80 || a > 0xBF) return false;
+	case 2: if ((a = (*--srcptr)) > 0xBF) return false;
+		switch (*source) {
+		    /* no fall-through in this inner switch */
+		    case 0xE0: if (a < 0xA0) return false; break;
+		    case 0xF0: if (a < 0x90) return false; break;
+		    case 0xF4: if (a > 0x8F) return false; break;
+		    default:  if (a < 0x80) return false;
+		}
+    	case 1: if (*source >= 0x80 && *source < 0xC2) return false;
+		if (*source > 0xF4) return false;
+	}
+	return true;
+}
+
 /*
  * Verify mbstr to make sure that it has a valid character sequence.
  * mbstr is not necessarily NULL terminated; length of mbstr is
@@ -825,14 +857,16 @@
 	while (len > 0 && *mbstr)
 	{
 		/* special UTF-8 check */
-		if (encoding == PG_UTF8 && (*mbstr & 0xf8) == 0xf0)
+		if (encoding == PG_UTF8 && !isLegalUTF8(mbstr,len))
 		{
 			if (noError)
 				return false;
 			ereport(ERROR,
 					(errcode(ERRCODE_CHARACTER_NOT_IN_REPERTOIRE),
-			 errmsg("Unicode characters greater than or equal to 0x10000 are not supported")));
+			 errmsg("Invalid UNICODE byte sequence detected")));
 		}
+		if (encoding == PG_UTF8)
+		    return true;
 
 		l = pg_mblen(mbstr);
 
#7Oliver Elphick
olly@lfix.co.uk
In reply to: Tom Lane (#5)
Re: UNICODE characters above 0x10000

On Sat, 2004-08-07 at 06:06, Tom Lane wrote:

Now it's entirely possible that the underlying support is a few bricks
shy of a load --- for instance I see that pg_utf_mblen thinks there are
no UTF8 codes longer than 3 bytes whereas your code goes to 4. I'm not
an expert on this stuff, so I don't know what the UTF8 spec actually
says. But I do think you are fixing the code at the wrong level.

UTF-8 characters can be up to 6 bytes long:
http://www.cl.cam.ac.uk/~mgk25/unicode.html

glibc provides various routines (mb...) for handling Unicode. How many
of our supported platforms don't have these? If there are still some
that don't, wouldn't it be better to use the standard routines where
they do exist?

--
Oliver Elphick olly@lfix.co.uk
Isle of Wight http://www.lfix.co.uk/oliver
GPG: 1024D/A54310EA 92C8 39E7 280E 3631 3F0E 1EC0 5664 7A2F A543 10EA
========================================
"Be still before the LORD and wait patiently for him;
do not fret when men succeed in their ways, when they
carry out their wicked schemes."
Psalms 37:7

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Oliver Elphick (#7)
Re: UNICODE characters above 0x10000

Oliver Elphick <olly@lfix.co.uk> writes:

glibc provides various routines (mb...) for handling Unicode. How many
of our supported platforms don't have these?

Every one that doesn't use glibc. Don't bother proposing a glibc-only
solution (and that's from someone who works for a glibc-only company;
you don't even want to think about the push-back you'll get from other
quarters).

regards, tom lane

#9Dennis Bjorklund
db@zigo.dhs.org
In reply to: Tom Lane (#5)
Re: UNICODE characters above 0x10000

On Sat, 7 Aug 2004, Tom Lane wrote:

shy of a load --- for instance I see that pg_utf_mblen thinks there are
no UTF8 codes longer than 3 bytes whereas your code goes to 4. I'm not
an expert on this stuff, so I don't know what the UTF8 spec actually
says. But I do think you are fixing the code at the wrong level.

I can give some general info about utf-9. This is how it is encoded:

character encoding
------------------- ---------
00000000 - 0000007F: 0xxxxxxx
00000080 - 000007FF: 110xxxxx 10xxxxxx
00000800 - 0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
00010000 - 001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
00200000 - 03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
04000000 - 7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

If the first byte starts with a 1 then the number of ones give the
length of the utf-8 sequence. And the rest of the bytes in the sequence
always starts with 10 (this makes it possble to look anywhere in the
string and fast find the start of a character).

This also means that the start byte can never start with 7 or 8 ones, that
is illegal and should be tested for and rejected. So the longest utf-8
sequence is 6 bytes (and the longest character needs 4 bytes (or 31
bits)).

--
/Dennis Bj�rklund

#10John Hansen
john@geeknet.com.au
In reply to: Dennis Bjorklund (#9)
1 attachment(s)
Re: UNICODE characters above 0x10000

Possibly, since I got it wrong once more....
About to give up, but attached, Updated patch.

Regards,

John Hansen

-----Original Message-----
From: Oliver Elphick [mailto:olly@lfix.co.uk]
Sent: Saturday, August 07, 2004 3:56 PM
To: Tom Lane
Cc: John Hansen; Hackers; Patches
Subject: Re: [HACKERS] UNICODE characters above 0x10000

On Sat, 2004-08-07 at 06:06, Tom Lane wrote:

Now it's entirely possible that the underlying support is a few bricks

shy of a load --- for instance I see that pg_utf_mblen thinks there
are no UTF8 codes longer than 3 bytes whereas your code goes to 4.
I'm not an expert on this stuff, so I don't know what the UTF8 spec
actually says. But I do think you are fixing the code at the wrong

level.

UTF-8 characters can be up to 6 bytes long:
http://www.cl.cam.ac.uk/~mgk25/unicode.html

glibc provides various routines (mb...) for handling Unicode. How many
of our supported platforms don't have these? If there are still some
that don't, wouldn't it be better to use the standard routines where
they do exist?

--
Oliver Elphick olly@lfix.co.uk
Isle of Wight http://www.lfix.co.uk/oliver
GPG: 1024D/A54310EA 92C8 39E7 280E 3631 3F0E 1EC0 5664 7A2F A543 10EA
========================================
"Be still before the LORD and wait patiently for him;
do not fret when men succeed in their ways, when they
carry out their wicked schemes."
Psalms 37:7

Attachments:

wchar.c.patchapplication/octet-stream; name=wchar.c.patchDownload
diff -ruN postgresql-7.5-old/src/backend/utils/mb/wchar.c postgresql-7.5/src/backend/utils/mb/wchar.c
--- postgresql-7.5-old/src/backend/utils/mb/wchar.c     2004-08-06 22:44:26.000000000 +1000
+++ postgresql-7.5/src/backend/utils/mb/wchar.c         2004-08-07 01:19:39.000000000 +1000
@@ -404,10 +404,12 @@
 
 	if ((*s & 0x80) == 0)
 		len = 1;
-	else if ((*s & 0xe0) == 0xc0)
+	else if ((*s & 0xe0) == 0xc0)
 		len = 2;
-	else if ((*s & 0xe0) == 0xe0)
+	else if ((*s & 0xf0) == 0xe0)
 		len = 3;
+	else if ((*s & 0xf8) == 0xf0)
+		len = 4;
 	return (len);
 }
 
@@ -801,6 +803,36 @@
 
 #ifndef FRONTEND
 
+/* --------------------------------------------------------------------- */
+
+/*
+ * Utility routine to tell whether a sequence of bytes is legal UTF-8.
+ */
+
+static unsigned char isLegalUTF8(const unsigned char *source, int srclen) {
+	int length = pg_utf_mblen(source);
+	if(length > srclen) return false;
+	unsigned char a;
+	const unsigned char *srcptr = source+length;
+	switch (length) {
+	default: return false;
+		/* Everything else falls through when "true"... */
+	case 4: if ((a = (*--srcptr)) < 0x80 || a > 0xBF) return false;
+	case 3: if ((a = (*--srcptr)) < 0x80 || a > 0xBF) return false;
+	case 2: if ((a = (*--srcptr)) > 0xBF) return false;
+		switch (*source) {
+		    /* no fall-through in this inner switch */
+		    case 0xE0: if (a < 0xA0) return false; break;
+		    case 0xF0: if (a < 0x90) return false; break;
+		    case 0xF4: if (a > 0x8F) return false; break;
+		    default:  if (a < 0x80) return false;
+		}
+    	case 1: if (*source >= 0x80 && *source < 0xC2) return false;
+		if (*source > 0xF4) return false;
+	}
+	return true;
+}
+
 /*
  * Verify mbstr to make sure that it has a valid character sequence.
  * mbstr is not necessarily NULL terminated; length of mbstr is
@@ -825,14 +857,16 @@
 	while (len > 0 && *mbstr)
 	{
 		/* special UTF-8 check */
-		if (encoding == PG_UTF8 && (*mbstr & 0xf8) == 0xf0)
+		if (encoding == PG_UTF8 && !isLegalUTF8(mbstr,len))
 		{
 			if (noError)
 				return false;
 			ereport(ERROR,
 					(errcode(ERRCODE_CHARACTER_NOT_IN_REPERTOIRE),
-			 errmsg("Unicode characters greater than or equal to 0x10000 are not supported")));
+			 errmsg("Invalid UNICODE byte sequence detected")));
 		}
+		if (encoding == PG_UTF8)
+		    return true;
 
 		l = pg_mblen(mbstr);
 
#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dennis Bjorklund (#9)
Re: UNICODE characters above 0x10000

Dennis Bjorklund <db@zigo.dhs.org> writes:

... This also means that the start byte can never start with 7 or 8
ones, that is illegal and should be tested for and rejected. So the
longest utf-8 sequence is 6 bytes (and the longest character needs 4
bytes (or 31 bits)).

Tatsuo would know more about this than me, but it looks from here like
our coding was originally designed to support only 16-bit-wide internal
characters (ie, 16-bit pg_wchar datatype width). I believe that the
regex library limitation here is gone, and that as far as that library
is concerned we could assume a 32-bit internal character width. The
question at hand is whether we can support 32-bit characters or not ---
and if not, what's the next bug to fix?

regards, tom lane

#12Dennis Bjorklund
db@zigo.dhs.org
In reply to: Tom Lane (#11)
Re: UNICODE characters above 0x10000

On Sat, 7 Aug 2004, Tom Lane wrote:

question at hand is whether we can support 32-bit characters or not ---
and if not, what's the next bug to fix?

True, and that's hard to just give an answer to. One could do some simple
testing, make sure regexps work and then treat anything else that might
not work, as bugs to be fixed later on when found.

The alternative is to inspect all code paths that involve strings, not fun
at all :-)

My previous mail talked about utf-8 translation. Not all characters
possible to form using utf-8 are assigned by the unicode org. However,
the part that interprets the unicode strings are in the os so different
os'es can give different results. So I think pg should just accept even 6
byte utf-8 sequences even if some characters are not currently assigned.

--
/Dennis Bj�rklund

#13John Hansen
john@geeknet.com.au
In reply to: Dennis Bjorklund (#12)
1 attachment(s)
Re: UNICODE characters above 0x10000

This should do it.

Regards,

John Hansen

-----Original Message-----
From: Dennis Bjorklund [mailto:db@zigo.dhs.org]
Sent: Saturday, August 07, 2004 5:02 PM
To: Tom Lane
Cc: John Hansen; Hackers; Patches
Subject: Re: [HACKERS] UNICODE characters above 0x10000

On Sat, 7 Aug 2004, Tom Lane wrote:

question at hand is whether we can support 32-bit characters or not 
--- and if not, what's the next bug to fix?

True, and that's hard to just give an answer to. One could do some simple testing, make sure regexps work and then treat anything else that might not work, as bugs to be fixed later on when found.

The alternative is to inspect all code paths that involve strings, not fun at all :-)

My previous mail talked about utf-8 translation. Not all characters possible to form using utf-8 are assigned by the unicode org. However, the part that interprets the unicode strings are in the os so different os'es can give different results. So I think pg should just accept even 6 byte utf-8 sequences even if some characters are not currently assigned.

--
/Dennis Björklund

Attachments:

wchar.c.patchapplication/octet-stream; name=wchar.c.patchDownload
diff -ruN postgresql-7.5-old/src/backend/utils/mb/wchar.c postgresql-7.5/src/backend/utils/mb/wchar.c
--- postgresql-7.5-old/src/backend/utils/mb/wchar.c     2004-08-06 22:44:26.000000000 +1000
+++ postgresql-7.5/src/backend/utils/mb/wchar.c         2004-08-07 01:19:39.000000000 +1000
@@ -406,8 +406,14 @@
 		len = 1;
 	else if ((*s & 0xe0) == 0xc0)
 		len = 2;
-	else if ((*s & 0xe0) == 0xe0)
-		len = 3;
+        else if ((*s & 0xf0) == 0xe0)
+                len = 3;
+        else if ((*s & 0xf8) == 0xf0)
+                len = 4;
+        else if ((*s & 0xfc) == 0xf8)
+                len = 5;
+        else if ((*s & 0xfe) == 0xfc)
+                len = 6;
 	return (len);
 }
 
@@ -801,6 +801,38 @@
 
 #ifndef FRONTEND
 
+/* --------------------------------------------------------------------- */
+
+/*
+ * Utility routine to tell whether a sequence of bytes is legal UTF-8.
+ */
+  
+static unsigned char isLegalUTF8(const unsigned char *source, int srclen) {
+    int length = pg_utf_mblen(source);
+    if(length > srclen) return false;
+    unsigned char a;
+    const unsigned char *srcptr = source+length;
+    switch (length) {
+	default: return false;
+	/* Everything else falls through when "true"... */
+	case 6: if ((a = (*--srcptr)) < 0x80 || a > 0xBF) return false;
+	case 5: if ((a = (*--srcptr)) < 0x80 || a > 0xBF) return false;
+	case 4: if ((a = (*--srcptr)) < 0x80 || a > 0xBF) return false;
+	case 3: if ((a = (*--srcptr)) < 0x80 || a > 0xBF) return false;
+	case 2: if ((a = (*--srcptr)) > 0xBF) return false;
+	switch (*source) {
+	    /* no fall-through in this inner switch */
+	    case 0xE0: if (a < 0xA0) return false; break;
+	    case 0xF0: if (a < 0x90) return false; break;
+	    case 0xF4: if (a > 0x8F) return false; break;
+	    default:  if (a < 0x80) return false;
+	}
+	case 1: if (*source >= 0x80 && *source < 0xC2) return false;
+	if (*source > 0xFD) return false;
+    }
+    return true;
+}
+
 /*
  * Verify mbstr to make sure that it has a valid character sequence.
  * mbstr is not necessarily NULL terminated; length of mbstr is
@@ -825,14 +863,16 @@
 	while (len > 0 && *mbstr)
 	{
 		/* special UTF-8 check */
-		if (encoding == PG_UTF8 && (*mbstr & 0xf8) == 0xf0)
+		if (encoding == PG_UTF8 && !isLegalUTF8(mbstr,len))
 		{
 			if (noError)
 				return false;
 			ereport(ERROR,
 					(errcode(ERRCODE_CHARACTER_NOT_IN_REPERTOIRE),
-			 errmsg("Unicode characters greater than or equal to 0x10000 are not supported")));
+			 errmsg("Invalid UNICODE byte sequence detected")));
 		}
+		if (encoding == PG_UTF8)
+		    return true;
 
 		l = pg_mblen(mbstr);
 
#14Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tom Lane (#11)
Re: [PATCHES] UNICODE characters above 0x10000

Dennis Bjorklund <db@zigo.dhs.org> writes:

... This also means that the start byte can never start with 7 or 8
ones, that is illegal and should be tested for and rejected. So the
longest utf-8 sequence is 6 bytes (and the longest character needs 4
bytes (or 31 bits)).

Tatsuo would know more about this than me, but it looks from here like
our coding was originally designed to support only 16-bit-wide internal
characters (ie, 16-bit pg_wchar datatype width). I believe that the
regex library limitation here is gone, and that as far as that library
is concerned we could assume a 32-bit internal character width. The
question at hand is whether we can support 32-bit characters or not ---
and if not, what's the next bug to fix?

pg_wchar has been already 32-bit datatype. However I doubt there's
actually a need for 32-but width character sets. Even Unicode only
uese up 0x0010FFFF, so 24-bit should be enough...
--
Tatsuo Ishii

#15John Hansen
john@geeknet.com.au
In reply to: Tatsuo Ishii (#14)
Re: [PATCHES] UNICODE characters above 0x10000

Yes, but the specification allows for 6byte sequences, or 32bit
characters.
As dennis pointed out, just because they're not used, doesn't mean we
should not allow them to be stored, since there might me someone using
the high ranges for a private character set, which could very well be
included in the specification some day.

Regards,

John Hansen

-----Original Message-----
From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp]
Sent: Saturday, August 07, 2004 8:09 PM
To: tgl@sss.pgh.pa.us
Cc: db@zigo.dhs.org; John Hansen; pgsql-hackers@postgresql.org;
pgsql-patches@postgresql.org
Subject: Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

Dennis Bjorklund <db@zigo.dhs.org> writes:

... This also means that the start byte can never start with 7 or 8
ones, that is illegal and should be tested for and rejected. So the
longest utf-8 sequence is 6 bytes (and the longest character needs 4

bytes (or 31 bits)).

Tatsuo would know more about this than me, but it looks from here like

our coding was originally designed to support only 16-bit-wide
internal characters (ie, 16-bit pg_wchar datatype width). I believe
that the regex library limitation here is gone, and that as far as
that library is concerned we could assume a 32-bit internal character
width. The question at hand is whether we can support 32-bit
characters or not --- and if not, what's the next bug to fix?

pg_wchar has been already 32-bit datatype. However I doubt there's
actually a need for 32-but width character sets. Even Unicode only uese
up 0x0010FFFF, so 24-bit should be enough...
--
Tatsuo Ishii

#16Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: John Hansen (#15)
Re: [PATCHES] UNICODE characters above 0x10000

Yes, but the specification allows for 6byte sequences, or 32bit
characters.

UTF-8 is just an encoding specification, not character set
specification. Unicode only has 17 256x256 planes in its
specification.

As dennis pointed out, just because they're not used, doesn't mean we
should not allow them to be stored, since there might me someone using
the high ranges for a private character set, which could very well be
included in the specification some day.

We should expand it to 64-bit since some day the specification might
be changed then:-)

More seriously, Unicode is filled with tons of confusion and
inconsistency IMO. Remember that once Unicode adovocates said that the
merit of Unicode was it only requires 16-bit width. Now they say they
need surrogate pairs and 32-bit width chars...

Anyway my point is if current specification of Unicode only allows
24-bit range, why we need to allow usage against the specification?
--
Tatsuo Ishii

#17Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: Tom Lane (#5)
Re: UNICODE characters above 0x10000

Now it's entirely possible that the underlying support is a few bricks
shy of a load --- for instance I see that pg_utf_mblen thinks there are
no UTF8 codes longer than 3 bytes whereas your code goes to 4. I'm not
an expert on this stuff, so I don't know what the UTF8 spec actually
says. But I do think you are fixing the code at the wrong level.

Surely there are UTF-8 codes that are at least 3 bytes. I have a
_vague_ recollection that you have to keep escaping and escaping to get
up to like 4 bytes for some asian code points?

Chris

#18John Hansen
john@geeknet.com.au
In reply to: Christopher Kings-Lynne (#17)
Re: UNICODE characters above 0x10000

4 actually,
10FFFF needs four bytes:

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
10FFFF = 00001010 11111111 11111111

Fill in the blanks, starting from the bottom, you get:
11110000 10101111 10111111 10111111

Regards,

John Hansen

-----Original Message-----
From: Christopher Kings-Lynne [mailto:chriskl@familyhealth.com.au]
Sent: Saturday, August 07, 2004 8:47 PM
To: Tom Lane
Cc: John Hansen; Hackers; Patches
Subject: Re: [HACKERS] UNICODE characters above 0x10000

Now it's entirely possible that the underlying support is a few bricks

shy of a load --- for instance I see that pg_utf_mblen thinks there
are no UTF8 codes longer than 3 bytes whereas your code goes to 4.
I'm not an expert on this stuff, so I don't know what the UTF8 spec
actually says. But I do think you are fixing the code at the wrong

level.

Surely there are UTF-8 codes that are at least 3 bytes. I have a
_vague_ recollection that you have to keep escaping and escaping to get
up to like 4 bytes for some asian code points?

Chris

#19Dennis Bjorklund
db@zigo.dhs.org
In reply to: John Hansen (#15)
Re: [PATCHES] UNICODE characters above 0x10000

On Sat, 7 Aug 2004, John Hansen wrote:

should not allow them to be stored, since there might me someone using
the high ranges for a private character set, which could very well be
included in the specification some day.

There are areas reserved for private character sets.

--
/Dennis Bj�rklund

#20John Hansen
john@geeknet.com.au
In reply to: Dennis Bjorklund (#19)
Re: [PATCHES] UNICODE characters above 0x10000

Well, maybe we'd be better off, compiling a list of (in?)valid ranges
from the full unicode database
(http://www.unicode.org/Public/UNIDATA/UnicodeData.txt and
http://www.unicode.org/Public/UNIDATA/Unihan.txt)
and with every release of pg, update the detection logic so only valid
characters are allowed?

Regards,

John Hansen

-----Original Message-----
From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp]
Sent: Saturday, August 07, 2004 8:46 PM
To: John Hansen
Cc: tgl@sss.pgh.pa.us; db@zigo.dhs.org; pgsql-hackers@postgresql.org;
pgsql-patches@postgresql.org
Subject: Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

Yes, but the specification allows for 6byte sequences, or 32bit
characters.

UTF-8 is just an encoding specification, not character set
specification. Unicode only has 17 256x256 planes in its specification.

As dennis pointed out, just because they're not used, doesn't mean we
should not allow them to be stored, since there might me someone using

the high ranges for a private character set, which could very well be
included in the specification some day.

We should expand it to 64-bit since some day the specification might be
changed then:-)

More seriously, Unicode is filled with tons of confusion and
inconsistency IMO. Remember that once Unicode adovocates said that the
merit of Unicode was it only requires 16-bit width. Now they say they
need surrogate pairs and 32-bit width chars...

Anyway my point is if current specification of Unicode only allows
24-bit range, why we need to allow usage against the specification?
--
Tatsuo Ishii

#21John Hansen
john@geeknet.com.au
In reply to: John Hansen (#20)
Re: [PATCHES] UNICODE characters above 0x10000

Yea,. I know....

100000 - 10FFFF : 2 separate planes iirc

... John

-----Original Message-----
From: Dennis Bjorklund [mailto:db@zigo.dhs.org]
Sent: Saturday, August 07, 2004 9:06 PM
To: John Hansen
Cc: Tatsuo Ishii; tgl@sss.pgh.pa.us; pgsql-hackers@postgresql.org; pgsql-patches@postgresql.org
Subject: RE: [PATCHES] [HACKERS] UNICODE characters above 0x10000

On Sat, 7 Aug 2004, John Hansen wrote:

should not allow them to be stored, since there might me someone using
the high ranges for a private character set, which could very well be
included in the specification some day.

There are areas reserved for private character sets.

--
/Dennis Björklund

#22Dennis Bjorklund
db@zigo.dhs.org
In reply to: Tatsuo Ishii (#16)
Re: [PATCHES] UNICODE characters above 0x10000

On Sat, 7 Aug 2004, Tatsuo Ishii wrote:

More seriously, Unicode is filled with tons of confusion and
inconsistency IMO. Remember that once Unicode adovocates said that the
merit of Unicode was it only requires 16-bit width. Now they say they
need surrogate pairs and 32-bit width chars...

Anyway my point is if current specification of Unicode only allows
24-bit range, why we need to allow usage against the specification?

Whatever problems they have had in the past, the ISO 10646 defines
formally a 31-bit character set. Are you saying that applications should
reject strings that contain characters that it does not recognize?

Is there a specific reason you want to restrict it to 24 bits? In practice
it does not matter much since it's not used today, I just don't know why
you want it.

--
/Dennis Bj�rklund

#23Dennis Bjorklund
db@zigo.dhs.org
In reply to: Dennis Bjorklund (#22)
Re: [PATCHES] UNICODE characters above 0x10000

On Sat, 7 Aug 2004, Takehiko Abe wrote:

It looked like you sent the last mail only to me and not the list. I
assume it was a misstake and I send the reply to both.

Is there a specific reason you want to restrict it to 24 bits?

ISO 10646 is said to have removed its private use codepoints outside of
the Unicode 0 - 10FFFF range to ensure the compatibility with Unicode.

see Section C.2 and C.3 of Unicode 4.0 Appendix C "Relationship to ISO
10646": <http://www.unicode.org/versions/Unicode4.0.0/appC.pdf&gt;.

The one and only reason for allowing 31 bit is that it's defined by iso
10646. In practice there is probably no one that uses the upper part of
10646 so not supporting it will most likely not hurt anyone.

I'm happy either way so I will put my voice on letting PG use unicode (not
ISO 10646) and restrict it to 24 bits. By the time someone wants (if ever)
iso 10646 we probably have support for different charsets and can easily
handle both at the same time.

--
/Dennis Bj�rklund

#24John Hansen
john@geeknet.com.au
In reply to: Dennis Bjorklund (#23)
Re: [PATCHES] UNICODE characters above 0x10000

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of
Dennis Bjorklund
Sent: Saturday, August 07, 2004 10:48 PM
To: Takehiko Abe
Cc: pgsql-hackers@postgresql.org
Subject: Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

On Sat, 7 Aug 2004, Takehiko Abe wrote:

It looked like you sent the last mail only to me and not the
list. I assume it was a misstake and I send the reply to both.

Is there a specific reason you want to restrict it to 24 bits?

ISO 10646 is said to have removed its private use codepoints outside
of the Unicode 0 - 10FFFF range to ensure the compatibility with Unicode.

see Section C.2 and C.3 of Unicode 4.0 Appendix C

"Relationship to ISO

10646": <http://www.unicode.org/versions/Unicode4.0.0/appC.pdf&gt;.

The one and only reason for allowing 31 bit is that it's
defined by iso 10646. In practice there is probably no one
that uses the upper part of
10646 so not supporting it will most likely not hurt anyone.

I'm happy either way so I will put my voice on letting PG use
unicode (not ISO 10646) and restrict it to 24 bits. By the
time someone wants (if ever) iso 10646 we probably have
support for different charsets and can easily handle both at
the same time.

Point taken.
Since we're supporting UTF8, and not ISO 10646.

Now, is it really 24 bits tho?
Afaict, it's really 21 (0 - 10FFFF or 0 - xxx10000 11111111 11111111)

This would require that we suport 4 byte sequences
(11110100 10001111 10111111 10111111 = 10FFFF)

--
/Dennis Björklund

---------------------------(end of
broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Regards,

John Hansen

#25Dennis Bjorklund
db@zigo.dhs.org
In reply to: John Hansen (#24)
Re: [PATCHES] UNICODE characters above 0x10000

On Sat, 7 Aug 2004, John Hansen wrote:

Now, is it really 24 bits tho?
Afaict, it's really 21 (0 - 10FFFF or 0 - xxx10000 11111111 11111111)

Yes, up to 0x10ffff should be enough.

The 24 is not really important, this is all about what utf-8 strings to
accept as input. The strings are stored as utf-8 strings and when
processed inside pg it uses wchar_t that is 32 bit (on some systems at
least). By restricting the utf-8 input to unicode we can in the future
store each character as 3 bytes if we want.

--
/Dennis Bj�rklund

#26John Hansen
john@geeknet.com.au
In reply to: Dennis Bjorklund (#25)
1 attachment(s)
Re: [PATCHES] UNICODE characters above 0x10000

-----Original Message-----
From: Dennis Bjorklund [mailto:db@zigo.dhs.org]
Sent: Saturday, August 07, 2004 11:23 PM
To: John Hansen
Cc: Takehiko Abe; pgsql-hackers@postgresql.org
Subject: RE: [PATCHES] [HACKERS] UNICODE characters above 0x10000

On Sat, 7 Aug 2004, John Hansen wrote:

Now, is it really 24 bits tho?
Afaict, it's really 21 (0 - 10FFFF or 0 - xxx10000 11111111

11111111)

Yes, up to 0x10ffff should be enough.

The 24 is not really important, this is all about what utf-8
strings to accept as input. The strings are stored as utf-8
strings and when processed inside pg it uses wchar_t that is
32 bit (on some systems at least). By restricting the utf-8
input to unicode we can in the future store each character as
3 bytes if we want.

Which brings us back to something like the attached...

--
/Dennis Björklund

Regards,

John Hansen

Attachments:

wchar.c.patchapplication/octet-stream; name=wchar.c.patchDownload
diff -ruN postgresql-7.5-old/src/backend/utils/mb/wchar.c postgresql-7.5/src/backend/utils/mb/wchar.c
--- postgresql-7.5-old/src/backend/utils/mb/wchar.c     2004-08-06 22:44:26.000000000 +1000
+++ postgresql-7.5/src/backend/utils/mb/wchar.c         2004-08-07 01:19:39.000000000 +1000
@@ -406,8 +406,14 @@
 		len = 1;
 	else if ((*s & 0xe0) == 0xc0)
 		len = 2;
-	else if ((*s & 0xe0) == 0xe0)
-		len = 3;
+        else if ((*s & 0xf0) == 0xe0)
+                len = 3;
+        else if ((*s & 0xf8) == 0xf0)
+                len = 4;
+        else if ((*s & 0xfc) == 0xf8)
+                len = 5;
+        else if ((*s & 0xfe) == 0xfc)
+                len = 6;
 	return (len);
 }
 
@@ -801,6 +801,36 @@
 
 #ifndef FRONTEND
 
+/* --------------------------------------------------------------------- */
+
+/*
+ * Utility routine to tell whether a sequence of bytes is legal UTF-8.
+ */
+  
+static unsigned char isLegalUTF8(const unsigned char *source, int srclen) {
+    int length = pg_utf_mblen(source);
+    if(length > srclen || length > 4) return false;
+    unsigned char a;
+    const unsigned char *srcptr = source+length;
+    switch (length) {
+	default: return false;
+	/* Everything else falls through when "true"... */
+	case 4: if ((a = (*--srcptr)) < 0x80 || a > 0xBF) return false;
+	case 3: if ((a = (*--srcptr)) < 0x80 || a > 0xBF) return false;
+	case 2: if ((a = (*--srcptr)) > 0xBF) return false;
+	switch (*source) {
+	    /* no fall-through in this inner switch */
+	    case 0xE0: if (a < 0xA0) return false; break;
+	    case 0xF0: if (a < 0x90) return false; break;
+	    case 0xF4: if (a > 0x8F) return false; break;
+	    default:  if (a < 0x80) return false;
+	}
+	case 1: if (*source >= 0x80 && *source < 0xC2) return false;
+	if (*source > 0xF4) return false;
+    }
+    return true;
+}
+
 /*
  * Verify mbstr to make sure that it has a valid character sequence.
  * mbstr is not necessarily NULL terminated; length of mbstr is
@@ -825,14 +863,16 @@
 	while (len > 0 && *mbstr)
 	{
 		/* special UTF-8 check */
-		if (encoding == PG_UTF8 && (*mbstr & 0xf8) == 0xf0)
+		if (encoding == PG_UTF8 && !isLegalUTF8(mbstr,len))
 		{
 			if (noError)
 				return false;
 			ereport(ERROR,
 					(errcode(ERRCODE_CHARACTER_NOT_IN_REPERTOIRE),
-			 errmsg("Unicode characters greater than or equal to 0x10000 are not supported")));
+			 errmsg("Invalid UNICODE byte sequence detected")));
 		}
+		if (encoding == PG_UTF8)
+		    return true;
 
 		l = pg_mblen(mbstr);
 
#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dennis Bjorklund (#22)
Re: [PATCHES] UNICODE characters above 0x10000

Dennis Bjorklund <db@zigo.dhs.org> writes:

On Sat, 7 Aug 2004, Tatsuo Ishii wrote:

Anyway my point is if current specification of Unicode only allows
24-bit range, why we need to allow usage against the specification?

Is there a specific reason you want to restrict it to 24 bits?

I see several places that have to allocate space on the basis of the
maximum encoded character length possible in the current encoding
(look for uses of pg_database_encoding_max_length). Probably the only
one that's really significant for performance is text_substr(), but
that's enough to be an argument against setting maxmblen higher than
we have to.

It looks to me like supporting 4-byte UTF-8 characters would be enough
to handle the existing range of Unicode codepoints, and that is probably
as much as we want to do.

If I understood what I was reading, this would take several things:
* Remove the "special UTF-8 check" in pg_verifymbstr;
* Extend pg_utf2wchar_with_len and pg_utf_mblen to handle the 4-byte case;
* Set maxmblen to 4 in the pg_wchar_table[] entry for UTF-8.

Are there any other places that would have to change? Would this break
anything? The testing aspect is what's bothering me at the moment.

regards, tom lane

#28Oliver Elphick
olly@lfix.co.uk
In reply to: Tom Lane (#8)
Re: UNICODE characters above 0x10000

On Sat, 2004-08-07 at 07:10, Tom Lane wrote:

Oliver Elphick <olly@lfix.co.uk> writes:

glibc provides various routines (mb...) for handling Unicode. How many
of our supported platforms don't have these?

Every one that doesn't use glibc. Don't bother proposing a glibc-only
solution (and that's from someone who works for a glibc-only company;
you don't even want to think about the push-back you'll get from other
quarters).

No. that's not what I was proposing. My suggestion was to use these
routines if they are sufficiently widely implemented, and our own
routines where standard ones are not available.

The man page for mblen says
"CONFORMING TO
ISO/ANSI C, UNIX98"

Is glibc really the only C library to conform?

If using the mb... routines isn't feasible, IBM's ICU library
(http://oss.software.ibm.com/icu/) is available under the X licence,
which is compatible with BSD as far as I can see. Besides character
conversion, ICU can also do collation in various locales and encodings.
My point is, we shouldn't be writing a new set of routines to do half a
job if there are already libraries available to do all of it.

--
Oliver Elphick olly@lfix.co.uk
Isle of Wight http://www.lfix.co.uk/oliver
GPG: 1024D/A54310EA 92C8 39E7 280E 3631 3F0E 1EC0 5664 7A2F A543 10EA
========================================
"Be still before the LORD and wait patiently for him;
do not fret when men succeed in their ways, when they
carry out their wicked schemes."
Psalms 37:7

#29John Hansen
john@geeknet.com.au
In reply to: Oliver Elphick (#28)
Re: [PATCHES] UNICODE characters above 0x10000

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Sunday, August 08, 2004 2:43 AM
To: Dennis Bjorklund
Cc: Tatsuo Ishii; John Hansen; pgsql-hackers@postgresql.org;
pgsql-patches@postgresql.org
Subject: Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

Dennis Bjorklund <db@zigo.dhs.org> writes:

On Sat, 7 Aug 2004, Tatsuo Ishii wrote:

Anyway my point is if current specification of Unicode only allows
24-bit range, why we need to allow usage against the specification?

Is there a specific reason you want to restrict it to 24 bits?

I see several places that have to allocate space on the basis
of the maximum encoded character length possible in the
current encoding (look for uses of
pg_database_encoding_max_length). Probably the only one
that's really significant for performance is text_substr(),
but that's enough to be an argument against setting maxmblen
higher than we have to.

It looks to me like supporting 4-byte UTF-8 characters would
be enough to handle the existing range of Unicode codepoints,
and that is probably as much as we want to do.

If I understood what I was reading, this would take several things:
* Remove the "special UTF-8 check" in pg_verifymbstr;

I strongly disagree, this would mean one could store any sequence of
characters in the db, as long as the bytes are above 0x80. This would
not be valid utf8, and leave the data in an inconsistent state.
Setting the client encoding to unicode, implies that this is what we're
going to feed the database, and should guarantee, that what comes out of
a select is valid utf8. We can make sure of that, by doing the check
before it's inserted.

* Extend pg_utf2wchar_with_len and pg_utf_mblen to handle the 4-byte

case;

pg_utf_mblen should handle any case according to the specification.
Currently, it will return 3, even for 4,5, and 6 byte sequences. Those
places where pg_utf_mblen is called, we should check to make sure, that
the length is between 1 and 4 inclusive, and that the sequence is valid.
This is what I made the patch for.

* Set maxmblen to 4 in the pg_wchar_table[] entry for UTF-8.

That I have no problem with.

Are there any other places that would have to change? Would
this break anything? The testing aspect is what's bothering
me at the moment.

regards, tom lane

Just my $0.02 worth,

Kind Regards,

John Hansen

#30John Hansen
john@geeknet.com.au
In reply to: John Hansen (#29)
Re: UNICODE characters above 0x10000

-----Original Message-----
From: Oliver Elphick [mailto:olly@lfix.co.uk]
Sent: Sunday, August 08, 2004 7:43 AM
To: Tom Lane
Cc: John Hansen; Hackers; Patches
Subject: Re: [HACKERS] UNICODE characters above 0x10000

On Sat, 2004-08-07 at 07:10, Tom Lane wrote:

Oliver Elphick <olly@lfix.co.uk> writes:

glibc provides various routines (mb...) for handling Unicode. How

many of our supported platforms don't have these?

Every one that doesn't use glibc. Don't bother proposing a

glibc-only

solution (and that's from someone who works for a glibc-only

company;

you don't even want to think about the push-back you'll get from

other

quarters).

No. that's not what I was proposing. My suggestion was to
use these routines if they are sufficiently widely
implemented, and our own routines where standard ones are not
available.

The man page for mblen says
"CONFORMING TO
ISO/ANSI C, UNIX98"

Is glibc really the only C library to conform?

If using the mb... routines isn't feasible, IBM's ICU library
(http://oss.software.ibm.com/icu/) is available under the X
licence, which is compatible with BSD as far as I can see.
Besides character conversion, ICU can also do collation in
various locales and encodings.
My point is, we shouldn't be writing a new set of routines to
do half a job if there are already libraries available to do
all of it.

This sounds like a brilliant move, if anything.

--
Oliver Elphick
olly@lfix.co.uk
Isle of Wight
http://www.lfix.co.uk/oliver
GPG: 1024D/A54310EA 92C8 39E7 280E 3631 3F0E 1EC0 5664 7A2F
A543 10EA
========================================
"Be still before the LORD and wait patiently for him;
do not fret when men succeed in their ways, when they
carry out their wicked schemes."
Psalms 37:7

Kind Regards,

John Hansen

#31Oliver Jowett
oliver@opencloud.com
In reply to: Tom Lane (#27)
Re: [PATCHES] UNICODE characters above 0x10000

Tom Lane wrote:

If I understood what I was reading, this would take several things:
* Remove the "special UTF-8 check" in pg_verifymbstr;
* Extend pg_utf2wchar_with_len and pg_utf_mblen to handle the 4-byte case;
* Set maxmblen to 4 in the pg_wchar_table[] entry for UTF-8.

Are there any other places that would have to change? Would this break
anything? The testing aspect is what's bothering me at the moment.

Does this change what client_encoding = UNICODE might produce? The JDBC
driver will need some tweaking to handle this -- Java uses UTF-16
internally and I think some supplementary character (?) scheme for
values above 0xffff as of JDK 1.5.

-O

#32Tom Lane
tgl@sss.pgh.pa.us
In reply to: Oliver Jowett (#31)
Re: [PATCHES] UNICODE characters above 0x10000

Oliver Jowett <oliver@opencloud.com> writes:

Does this change what client_encoding = UNICODE might produce? The JDBC
driver will need some tweaking to handle this -- Java uses UTF-16
internally and I think some supplementary character (?) scheme for
values above 0xffff as of JDK 1.5.

You're not likely to get out anything you didn't put in, so I'm not sure
it matters.

regards, tom lane

#33Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Oliver Jowett (#31)
Re: [PATCHES] UNICODE characters above 0x10000

Tom Lane wrote:

If I understood what I was reading, this would take several things:
* Remove the "special UTF-8 check" in pg_verifymbstr;
* Extend pg_utf2wchar_with_len and pg_utf_mblen to handle the 4-byte case;
* Set maxmblen to 4 in the pg_wchar_table[] entry for UTF-8.

Are there any other places that would have to change? Would this break
anything? The testing aspect is what's bothering me at the moment.

Does this change what client_encoding = UNICODE might produce? The JDBC
driver will need some tweaking to handle this -- Java uses UTF-16
internally and I think some supplementary character (?) scheme for
values above 0xffff as of JDK 1.5.

Java doesn't handle UCS above 0xffff? I didn't know that. As long as
you put in/out JDBC, it shouldn't be a problem. However if other APIs
put in such a data, you will get into trouble...
--
Tatsuo Ishii

#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: John Hansen (#6)
Re: UNICODE characters above 0x10000

"John Hansen" <john@geeknet.com.au> writes:

Ahh, but that's not the case. You cannot just delete the check, since
not all combinations of bytes are valid UTF8. UTF bytes FE & FF never
appear in a byte sequence for instance.

Well, this is still working at the wrong level. The code that's in
pg_verifymbstr is mainly intended to enforce the *system wide*
assumption that multibyte characters must have the high bit set in
every byte. (We do not support encodings without this property in
the backend, because it breaks code that looks for ASCII characters
... such as the main parser/lexer ...) It's not really intended to
check that the multibyte character is actually legal in its encoding.

The "special UTF-8 check" was never more than a very quick-n-dirty hack
that was in the wrong place to start with. We ought to be getting rid
of it not institutionalizing it. If you want an exact encoding-specific
check on the legitimacy of a multibyte sequence, I think the right way
to do it is to add another function pointer to pg_wchar_table entries to
let each encoding have its own check routine. Perhaps this could be
defined so as to avoid a separate call to pg_mblen inside the loop, and
thereby not add any new overhead. I'm thinking about an API something
like

int validate_mbchar(const unsigned char *str, int len)

with result +N if a valid character N bytes long is present at
*str, and -N if an invalid character is present at *str and
it would be appropriate to display N bytes in the complaint.
(N must be <= len in either case.) This would reduce the main
loop of pg_verifymbstr to a call of this function and an
error-case-handling block.

regards, tom lane

#35Oliver Jowett
oliver@opencloud.com
In reply to: Tatsuo Ishii (#33)
Re: [PATCHES] UNICODE characters above 0x10000

Tatsuo Ishii wrote:

Tom Lane wrote:

If I understood what I was reading, this would take several things:
* Remove the "special UTF-8 check" in pg_verifymbstr;
* Extend pg_utf2wchar_with_len and pg_utf_mblen to handle the 4-byte case;
* Set maxmblen to 4 in the pg_wchar_table[] entry for UTF-8.

Are there any other places that would have to change? Would this break
anything? The testing aspect is what's bothering me at the moment.

Does this change what client_encoding = UNICODE might produce? The JDBC
driver will need some tweaking to handle this -- Java uses UTF-16
internally and I think some supplementary character (?) scheme for
values above 0xffff as of JDK 1.5.

Java doesn't handle UCS above 0xffff? I didn't know that. As long as
you put in/out JDBC, it shouldn't be a problem. However if other APIs
put in such a data, you will get into trouble...

Internally, Java strings are arrays of UTF-16 values. Before JDK 1.5,
all the string-manipulation library routines assumed that one code point
== one UTF-16 value, so you can't represent values above 0xffff. The 1.5
libraries understand using supplementary characters to use multiple
UTF-16 values per code point. See
http://java.sun.com/developer/technicalArticles/Intl/Supplementary/

However, the JDBC driver needs to be taught about how to translate
between UTF-8 representations of code points above 0xffff and pairs of
UTF-16 values. Previously it didn't need to do anything since the server
didn't use those high values. It's a minor thing..

-O

#36John Hansen
john@geeknet.com.au
In reply to: Oliver Jowett (#35)
Re: UNICODE characters above 0x10000

Well, this is still working at the wrong level. The code
that's in pg_verifymbstr is mainly intended to enforce the
*system wide* assumption that multibyte characters must have
the high bit set in every byte. (We do not support encodings
without this property in the backend, because it breaks code
that looks for ASCII characters ... such as the main
parser/lexer ...) It's not really intended to check that the
multibyte character is actually legal in its encoding.

Ok, point taken.

The "special UTF-8 check" was never more than a very
quick-n-dirty hack that was in the wrong place to start with.
We ought to be getting rid of it not institutionalizing it.
If you want an exact encoding-specific check on the
legitimacy of a multibyte sequence, I think the right way to
do it is to add another function pointer to pg_wchar_table
entries to let each encoding have its own check routine.
Perhaps this could be defined so as to avoid a separate call
to pg_mblen inside the loop, and thereby not add any new
overhead. I'm thinking about an API something like

int validate_mbchar(const unsigned char *str, int len)

with result +N if a valid character N bytes long is present
at *str, and -N if an invalid character is present at *str
and it would be appropriate to display N bytes in the complaint.
(N must be <= len in either case.) This would reduce the
main loop of pg_verifymbstr to a call of this function and an
error-case-handling block.

Sounds like a plan...

regards, tom lane

Regards,

John Hansen