BUG #12845: The GB18030 encoding doesn't support Unicode characters over 0xFFFF

Started by Arjen Nienhuisabout 11 years ago7 messagesbugs

a.g.nienhuis@gmail.com

about 11 years ago

The following bug has been logged on the website:

Bug reference: 12845
Logged by: Arjen Nienhuis
Email address: a.g.nienhuis@gmail.com
PostgreSQL version: 9.3.5
Operating system: Ubuntu Linux
Description:

Step to reproduce:

In psql:

arjen=> select convert_to(chr(128512), 'GB18030');

Actual output:

ERROR: character with byte sequence 0xf0 0x9f 0x98 0x80 in encoding "UTF8"
has no equivalent in encoding "GB18030"

Expected output:

convert_to
------------
\x9439fc36
(1 row)

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 11 years ago

In reply to: Arjen Nienhuis (#1)

Re: BUG #12845: The GB18030 encoding doesn't support Unicode characters over 0xFFFF

On 03/09/2015 10:51 PM, a.g.nienhuis@gmail.com wrote:

The following bug has been logged on the website:

Bug reference: 12845
Logged by: Arjen Nienhuis
Email address: a.g.nienhuis@gmail.com
PostgreSQL version: 9.3.5
Operating system: Ubuntu Linux
Description:

Step to reproduce:

In psql:

arjen=> select convert_to(chr(128512), 'GB18030');

Actual output:

ERROR: character with byte sequence 0xf0 0x9f 0x98 0x80 in encoding "UTF8"
has no equivalent in encoding "GB18030"

Expected output:

convert_to
------------
\x9439fc36
(1 row)

Hmm, looks like our gb18030 <-> Unicode conversion table only contains
the Unicode BMP plane. Unicode points above 0xffff are not included.

If we added all the missing mappings as one to one mappings, like we've
done for the BMP, that would bloat the table horribly. There are over 1
million code points that are currently not mapped. Fortunately, the
missing mappings are in linear ranges that would be fairly simple to
handle in programmatically. See e.g.
https://ssl.icu-project.org/repos/icu/data/trunk/charset/source/gb18030/gb18030.html.
Someone needs to write the code (I'm not volunteering myself).

- Heikki

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Arjen Nienhuis

a.g.nienhuis@gmail.com

about 11 years ago

In reply to: Heikki Linnakangas (#2)

Re: BUG #12845: The GB18030 encoding doesn't support Unicode characters over 0xFFFF

On 10 Mar 2015 22:33, "Heikki Linnakangas" <hlinnaka@iki.fi> wrote:

On 03/09/2015 10:51 PM, a.g.nienhuis@gmail.com wrote:

The following bug has been logged on the website:

Bug reference: 12845
Logged by: Arjen Nienhuis
Email address: a.g.nienhuis@gmail.com
PostgreSQL version: 9.3.5
Operating system: Ubuntu Linux
Description:

Step to reproduce:

In psql:

arjen=> select convert_to(chr(128512), 'GB18030');

Actual output:

ERROR: character with byte sequence 0xf0 0x9f 0x98 0x80 in encoding

"UTF8"

has no equivalent in encoding "GB18030"

Expected output:

convert_to
------------
\x9439fc36
(1 row)

Hmm, looks like our gb18030 <-> Unicode conversion table only contains

the Unicode BMP plane. Unicode points above 0xffff are not included.

If we added all the missing mappings as one to one mappings, like we've

done for the BMP, that would bloat the table horribly. There are over 1
million code points that are currently not mapped. Fortunately, the missing
mappings are in linear ranges that would be fairly simple to handle in
programmatically. See e.g.
https://ssl.icu-project.org/repos/icu/data/trunk/charset/source/gb18030/gb18030.html.
Someone needs to write the code (I'm not volunteering myself).

- Heikki

I can write a "uint32 UTF8toGB18030(uint32)" function, but I don't know
where to put it in the code.

(Maybe at line 479 of conv.c:
https://github.com/postgres/postgres/blob/4baaf863eca5412e07a8441b3b7e7482b7a8b21a/src/backend/utils/mb/conv.c#L479
)

Else I could also extend the map file. It would double in size if it only
needs to include valid code points.

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 11 years ago

In reply to: Arjen Nienhuis (#3)

Re: BUG #12845: The GB18030 encoding doesn't support Unicode characters over 0xFFFF

On 03/10/2015 11:21 PM, Arjen Nienhuis wrote:

On 10 Mar 2015 22:33, "Heikki Linnakangas" <hlinnaka@iki.fi> wrote:

On 03/09/2015 10:51 PM, a.g.nienhuis@gmail.com wrote:

arjen=> select convert_to(chr(128512), 'GB18030');

Actual output:

ERROR: character with byte sequence 0xf0 0x9f 0x98 0x80 in encoding

"UTF8"

has no equivalent in encoding "GB18030"

Expected output:

convert_to
------------
\x9439fc36
(1 row)

Hmm, looks like our gb18030 <-> Unicode conversion table only contains

the Unicode BMP plane. Unicode points above 0xffff are not included.

If we added all the missing mappings as one to one mappings, like we've

done for the BMP, that would bloat the table horribly. There are over 1
million code points that are currently not mapped. Fortunately, the missing
mappings are in linear ranges that would be fairly simple to handle in
programmatically. See e.g.
https://ssl.icu-project.org/repos/icu/data/trunk/charset/source/gb18030/gb18030.html.
Someone needs to write the code (I'm not volunteering myself).

I can write a "uint32 UTF8toGB18030(uint32)" function, but I don't know
where to put it in the code.

The mapping functions are in
src/backend/utils/mb/conversion_procs/utf8_and_gb18030/utf8_and_gb18030.c.
They currently just consult the mapping table. You'd need to modify them
to also check if the codepoint is in one of those linear ranges, and do
the mapping for those programmatically.

Else I could also extend the map file. It would double in size if it only
needs to include valid code points.

The current mapping table contains about 63000 mappings, but there are
over a million valid code points that need to be mapped. If you just add
every one-to-one mapping to the table, it's going to blow up in size to
over 8 MB. I don't think we want that, handling the ranges with linear
mappings programmatically makes a lot more sense.

- Heikki

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Bruce Momjian

bruce@momjian.us

about 11 years ago

In reply to: Heikki Linnakangas (#4)

Re: BUG #12845: The GB18030 encoding doesn't support Unicode characters over 0xFFFF

On Tue, Mar 10, 2015 at 11:33:43PM +0100, Heikki Linnakangas wrote:

I can write a "uint32 UTF8toGB18030(uint32)" function, but I don't know
where to put it in the code.

The mapping functions are in src/backend/utils/mb/conversion_procs/utf8_and_gb18030/utf8_and_gb18030.c.
They currently just consult the mapping table. You'd need to modify
them to also check if the codepoint is in one of those linear
ranges, and do the mapping for those programmatically.

Else I could also extend the map file. It would double in size if it only
needs to include valid code points.

The current mapping table contains about 63000 mappings, but there
are over a million valid code points that need to be mapped. If you
just add every one-to-one mapping to the table, it's going to blow
up in size to over 8 MB. I don't think we want that, handling the
ranges with linear mappings programmatically makes a lot more sense.

Should this be a TODO entry?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 11 years ago

In reply to: Bruce Momjian (#5)

Re: BUG #12845: The GB18030 encoding doesn't support Unicode characters over 0xFFFF

On 04/30/2015 06:13 PM, Bruce Momjian wrote:

On Tue, Mar 10, 2015 at 11:33:43PM +0100, Heikki Linnakangas wrote:

I can write a "uint32 UTF8toGB18030(uint32)" function, but I don't know
where to put it in the code.

The mapping functions are in src/backend/utils/mb/conversion_procs/utf8_and_gb18030/utf8_and_gb18030.c.
They currently just consult the mapping table. You'd need to modify
them to also check if the codepoint is in one of those linear
ranges, and do the mapping for those programmatically.

Else I could also extend the map file. It would double in size if it only
needs to include valid code points.

The current mapping table contains about 63000 mappings, but there
are over a million valid code points that need to be mapped. If you
just add every one-to-one mapping to the table, it's going to blow
up in size to over 8 MB. I don't think we want that, handling the
ranges with linear mappings programmatically makes a lot more sense.

Should this be a TODO entry?

Yeah, I guess it should.

- Heikki

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Bruce Momjian

bruce@momjian.us

about 11 years ago

In reply to: Heikki Linnakangas (#6)

Re: BUG #12845: The GB18030 encoding doesn't support Unicode characters over 0xFFFF

On Thu, Apr 30, 2015 at 09:48:08PM -0700, Heikki Linnakangas wrote:

On 04/30/2015 06:13 PM, Bruce Momjian wrote:

On Tue, Mar 10, 2015 at 11:33:43PM +0100, Heikki Linnakangas wrote:

I can write a "uint32 UTF8toGB18030(uint32)" function, but I don't know
where to put it in the code.

The mapping functions are in src/backend/utils/mb/conversion_procs/utf8_and_gb18030/utf8_and_gb18030.c.
They currently just consult the mapping table. You'd need to modify
them to also check if the codepoint is in one of those linear
ranges, and do the mapping for those programmatically.

Else I could also extend the map file. It would double in size if it only
needs to include valid code points.

The current mapping table contains about 63000 mappings, but there
are over a million valid code points that need to be mapped. If you
just add every one-to-one mapping to the table, it's going to blow
up in size to over 8 MB. I don't think we want that, handling the
ranges with linear mappings programmatically makes a lot more sense.

Should this be a TODO entry?

Yeah, I guess it should.

Done.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs