PostgreSQL fails to convert decomposed utf-8 to other encodings

Started by Craig Ringeralmost 12 years ago5 messagesbugs

craig@2ndquadrant.com

almost 12 years ago

There's a bug in encoding conversions from utf-8 to other encodings that
results in corrupt output if decomposed utf-8 is used.

PostgreSQL doesn't process utf-8 to pre-composed form first, so
decomposed UTF-8 is not handled correctly.

Take á:

regress=> -- Decomposed - 'a' then 'acute'
regress=> SELECT E'\u0061\u0301';
' ?column?
----------
á
(1 row)

regress=> -- Precomposed - 'a-acute'
regress=> SELECT E'\u00E1';
?column?
----------
á
(1 row)

regress=> SELECT convert_to(E'\u0061\u0301', 'iso-8859-1');
ERROR: character with byte sequence 0xcc 0x81 in encoding "UTF8" has no
equivalent in encoding "LATIN1"

regress=> SELECT convert_to(E'\u00E1', 'iso-8859-1');
convert_to
------------
\xe1
(1 row)

This affects input from the client too:

regress=> SELECT convert_to('á', 'iso-8859-1');
ERROR: character with byte sequence 0xcc 0x81 in encoding "UTF8" has no
equivalent in encoding "LATIN1"

regress=> SELECT convert_to('á', 'iso-8859-1');
convert_to
------------
\xe1
(1 row)

... yes, that looks like the same function producing different results
on identical input. You might not be able to reproduce with copy and
paste from this mail if your client normalizes UTF-8, but you'll be able
to by printing the decomposed character to your terminal as an escape
string, then copying and pasting from there.

We should've probably been normalizing decomposed sequences to
precomposed as part of utf-8 validation wherever 'text' input occurs,
but it's too late for that now as DBs in the wild will contain
decomposed chars. Instead, conversion functions need to normalize
decomposed chars to precomposed before converting from utf-8 to another
encoding.

Comments?

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Tom Lane

tgl@sss.pgh.pa.us

almost 12 years ago

In reply to: Craig Ringer (#1)

Re: PostgreSQL fails to convert decomposed utf-8 to other encodings

Craig Ringer <craig@2ndquadrant.com> writes:

There's a bug in encoding conversions from utf-8 to other encodings that
results in corrupt output if decomposed utf-8 is used.

We don't actually support "decomposed" utf8; if there is any bug here,
it's that the input you show isn't rejected. But I think there was
some intentional choice to not check \u escapes fully.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Craig Ringer

craig@2ndquadrant.com

almost 12 years ago

In reply to: Tom Lane (#2)

Re: PostgreSQL fails to convert decomposed utf-8 to other encodings

On 08/06/2014 09:14 AM, Tom Lane wrote:

We don't actually support "decomposed" utf8; if there is any bug here,
it's that the input you show isn't rejected. But I think there was
some intentional choice to not check \u escapes fully.

Combining characters (i.e. decomposed utf-8 form, for chars where there
is a combined equivalent) are part of utf-8. They're not an optional add-on.

So if Pg doesn't support them, it doesn't fully support utf-8. Which is
fine as far as it goes, but must be documented as a limitation at
minimum. (I'll deal with that).

It also means that you get fun anomalies like:

regress=> SELECT 'á' = 'á';
?column?
----------
f
(1 row)

which is IMO insane.

Not only that, but we can't reject decomposed forms, because they will
already exist in live installs. That'd break dump and reload of such
installs and cause exciting problems with pg_upgrade.

The "we'll just reject part of utf-8" opportunity has flown. It needs to
be documented as a bug in existing versions, and I guess given that I'm
the one complaining I get to see if I can find a sane fix for 9.5...

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Craig Ringer

craig@2ndquadrant.com

almost 12 years ago

In reply to: Craig Ringer (#3)

Re: PostgreSQL fails to convert decomposed utf-8 to other encodings

On 08/06/2014 11:54 AM, Craig Ringer wrote:

On 08/06/2014 09:14 AM, Tom Lane wrote:

We don't actually support "decomposed" utf8; if there is any bug here,
it's that the input you show isn't rejected. But I think there was
some intentional choice to not check \u escapes fully.

Combining characters (i.e. decomposed utf-8 form, for chars where there
is a combined equivalent) are part of utf-8. They're not an optional add-on.

... though we can advertise partial Unicode support, saying that we
support UTF-8 for UCS (ISO 10646-1:2000 Annex D / RFC 3629)
implementation level 1 only, requiring Normalization Form C (NFC) input.

Given that Pg doesn't seem to understand \xf8 or \xfc utf-8 chars, so it
doesn't cover the full utf-8 range, it doesn't look like it meets Level
1 either. So it supports "mostly-utf8".

With level 1 we should really _reject_ combining chars, but can't do
that w/o breaking BC.

I guess I should turn this:

http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

into a regression test.

Possibly also parts of this:

http://www.columbia.edu/~fdc/utf8/

though it's more oriented toward rendering.

It's worth noting that Konsole and Thunderbird had no issues with
combining chars when I was testing this.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Tatsuo Ishii

t-ishii@sra.co.jp

almost 12 years ago

In reply to: Craig Ringer (#3)

Re: PostgreSQL fails to convert decomposed utf-8 to other encodings

On 08/06/2014 09:14 AM, Tom Lane wrote:

We don't actually support "decomposed" utf8; if there is any bug here,
it's that the input you show isn't rejected. But I think there was
some intentional choice to not check \u escapes fully.

Combining characters (i.e. decomposed utf-8 form, for chars where there
is a combined equivalent) are part of utf-8. They're not an optional add-on.

So if Pg doesn't support them, it doesn't fully support utf-8. Which is
fine as far as it goes, but must be documented as a limitation at
minimum. (I'll deal with that).

It also means that you get fun anomalies like:

regress=> SELECT 'á' = 'á';
?column?
----------
f
(1 row)

which is IMO insane.

Not only that, but we can't reject decomposed forms, because they will
already exist in live installs. That'd break dump and reload of such
installs and cause exciting problems with pg_upgrade.

The "we'll just reject part of utf-8" opportunity has flown. It needs to
be documented as a bug in existing versions, and I guess given that I'm
the one complaining I get to see if I can find a sane fix for 9.5...

I'm not sure what you mean by decomposed utf8 because there's no such
a thing in the Unicode standard. Maybe you mean "composite character"
or "precomposed character"?

Anywa in my understanding to handle composite characters, we should do
"Unicode normalization" in the first place. There's 4 types of
normalization:

NFD (Normalization Form Canonical Decomposition)
NFC (Normalization Form Canonical Composition)
NFKD (Normalization Form Compatibility Decomposition)
NFKC (Normalization Form Compatibility Composition)

I don't know how we could implement one of these without major
performance degradation.

Also some composite characters can be decomposed but after composed
again, they do not return to the original form of composite characters
(round trip conversion is impossible). Such characters are called
"Composition Exclusion" (see
http://www.unicode.org/Public/UNIDATA/CompositionExclusions.txt).
I have no idea how to deal with the issue.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs