chr() is still too loose about UTF8 code points

Started by Tom Laneover 11 years ago8 messages
#1Tom Lane
tgl@sss.pgh.pa.us

Quite some time ago, we made the chr() function accept Unicode code points
up to U+1FFFFF, which is the largest value that will fit in a 4-byte UTF8
string. It was pointed out to me though that RFC3629 restricted the
original definition of UTF8 to only allow code points up to U+10FFFF (for
compatibility with UTF16). While that might not be something we feel we
need to follow exactly, pg_utf8_islegal implements the checking algorithm
specified by RFC3629, and will therefore reject points above U+10FFFF.

This means you can use chr() to create values that will be rejected on
dump and reload:

u8=# create table tt (f1 text);
CREATE TABLE
u8=# insert into tt values(chr('x001fffff'::bit(32)::int));
INSERT 0 1
u8=# select * from tt;
f1
----

(1 row)

u8=# \copy tt to 'junk'
COPY 1
u8=# \copy tt from 'junk'
ERROR: 22021: invalid byte sequence for encoding "UTF8": 0xf7 0xbf 0xbf 0xbf
CONTEXT: COPY tt, line 1
LOCATION: report_invalid_encoding, wchar.c:2011

I think this probably means we need to change chr() to reject code points
above 10ffff. Should we back-patch that, or just do it in HEAD?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Tom Lane (#1)
Re: chr() is still too loose about UTF8 code points

On 05/16/2014 06:05 PM, Tom Lane wrote:

Quite some time ago, we made the chr() function accept Unicode code points
up to U+1FFFFF, which is the largest value that will fit in a 4-byte UTF8
string. It was pointed out to me though that RFC3629 restricted the
original definition of UTF8 to only allow code points up to U+10FFFF (for
compatibility with UTF16). While that might not be something we feel we
need to follow exactly, pg_utf8_islegal implements the checking algorithm
specified by RFC3629, and will therefore reject points above U+10FFFF.

This means you can use chr() to create values that will be rejected on
dump and reload:

u8=# create table tt (f1 text);
CREATE TABLE
u8=# insert into tt values(chr('x001fffff'::bit(32)::int));
INSERT 0 1
u8=# select * from tt;
f1
----

(1 row)

u8=# \copy tt to 'junk'
COPY 1
u8=# \copy tt from 'junk'
ERROR: 22021: invalid byte sequence for encoding "UTF8": 0xf7 0xbf 0xbf 0xbf
CONTEXT: COPY tt, line 1
LOCATION: report_invalid_encoding, wchar.c:2011

I think this probably means we need to change chr() to reject code points
above 10ffff. Should we back-patch that, or just do it in HEAD?

+1 for back-patching. A value that cannot be restored is bad, and I
can't imagine any legitimate use case for producing a Unicode character
larger than U+10FFFF with chr(x), when the rest of the system doesn't
handle it. Fully supporting such values might be useful, but that's a
different story.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Andrew Dunstan
andrew@dunslane.net
In reply to: Heikki Linnakangas (#2)
Re: chr() is still too loose about UTF8 code points

On 05/16/2014 12:43 PM, Heikki Linnakangas wrote:

On 05/16/2014 06:05 PM, Tom Lane wrote:

Quite some time ago, we made the chr() function accept Unicode code
points
up to U+1FFFFF, which is the largest value that will fit in a 4-byte
UTF8
string. It was pointed out to me though that RFC3629 restricted the
original definition of UTF8 to only allow code points up to U+10FFFF
(for
compatibility with UTF16). While that might not be something we feel we
need to follow exactly, pg_utf8_islegal implements the checking
algorithm
specified by RFC3629, and will therefore reject points above U+10FFFF.

This means you can use chr() to create values that will be rejected on
dump and reload:

u8=# create table tt (f1 text);
CREATE TABLE
u8=# insert into tt values(chr('x001fffff'::bit(32)::int));
INSERT 0 1
u8=# select * from tt;
f1
----

(1 row)

u8=# \copy tt to 'junk'
COPY 1
u8=# \copy tt from 'junk'
ERROR: 22021: invalid byte sequence for encoding "UTF8": 0xf7 0xbf
0xbf 0xbf
CONTEXT: COPY tt, line 1
LOCATION: report_invalid_encoding, wchar.c:2011

I think this probably means we need to change chr() to reject code
points
above 10ffff. Should we back-patch that, or just do it in HEAD?

+1 for back-patching. A value that cannot be restored is bad, and I
can't imagine any legitimate use case for producing a Unicode
character larger than U+10FFFF with chr(x), when the rest of the
system doesn't handle it. Fully supporting such values might be
useful, but that's a different story.

My understanding us that U+10FFFF is the highest legal Unicode code
point anyway. So this is really just tightening our routines to make
sure we don't produce an invalid value. We won't be disallowing anything
that is legal Unicode.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#2)
Re: chr() is still too loose about UTF8 code points

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

On 05/16/2014 06:05 PM, Tom Lane wrote:

I think this probably means we need to change chr() to reject code points
above 10ffff. Should we back-patch that, or just do it in HEAD?

+1 for back-patching. A value that cannot be restored is bad, and I
can't imagine any legitimate use case for producing a Unicode character
larger than U+10FFFF with chr(x), when the rest of the system doesn't
handle it. Fully supporting such values might be useful, but that's a
different story.

Well, AFAICT "the rest of the system" does handle any code point up to
U+1FFFFF. It's only pg_utf8_islegal that's being picky. So another
possible answer is to weaken the check in pg_utf8_islegal. However,
that could create interoperability concerns with other software, and
as you say the use-case for larger values seems pretty thin.

Actually, after re-reading the spec there's more to it than this:
chr() will allow creating utf8 sequences that correspond to the
surrogate-pair codes, which are expressly disallowed in UTF8 by
the RFCs. Maybe we should apply pg_utf8_islegal to the result
string rather than duplicating its checks?

BTW, there are various places that have comments or ifdefd-out code
anticipating possible future support of 5- or 6-byte UTF8 sequences,
which were specified in RFC2279 but then rescinded by RFC3629.
I guess as a matter of cleanup we should think about removing that
stuff.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Noah Misch
noah@leadboat.com
In reply to: Tom Lane (#1)
Re: chr() is still too loose about UTF8 code points

On Fri, May 16, 2014 at 11:05:08AM -0400, Tom Lane wrote:

I think this probably means we need to change chr() to reject code points
above 10ffff. Should we back-patch that, or just do it in HEAD?

The compatibility risks resemble those associated with the fixes for bug
#9210, so I recommend HEAD only:

/messages/by-id/20140220043940.GA3064539@tornado.leadboat.com

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noah Misch (#5)
Re: chr() is still too loose about UTF8 code points

Noah Misch <noah@leadboat.com> writes:

On Fri, May 16, 2014 at 11:05:08AM -0400, Tom Lane wrote:

I think this probably means we need to change chr() to reject code points
above 10ffff. Should we back-patch that, or just do it in HEAD?

The compatibility risks resemble those associated with the fixes for bug
#9210, so I recommend HEAD only:

/messages/by-id/20140220043940.GA3064539@tornado.leadboat.com

While I'd be willing to ignore that risk so far as code points above
10ffff go, if we want pg_utf8_islegal to be happy then we will also
have to reject surrogate-pair code points. It's not beyond the realm
of possibility that somebody is intentionally generating such code
points with chr(), despite the dump/reload hazard. So now I agree
that this is sounding more like a major-version-only behavioral change.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7David G Johnston
david.g.johnston@gmail.com
In reply to: Tom Lane (#6)
Re: chr() is still too loose about UTF8 code points

Tom Lane-2 wrote

Noah Misch &lt;

noah@

&gt; writes:

On Fri, May 16, 2014 at 11:05:08AM -0400, Tom Lane wrote:

I think this probably means we need to change chr() to reject code
points
above 10ffff. Should we back-patch that, or just do it in HEAD?

The compatibility risks resemble those associated with the fixes for bug
#9210, so I recommend HEAD only:

/messages/by-id/flat/

20140220043940.GA3064539@.leadboat

While I'd be willing to ignore that risk so far as code points above
10ffff go, if we want pg_utf8_islegal to be happy then we will also
have to reject surrogate-pair code points. It's not beyond the realm
of possibility that somebody is intentionally generating such code
points with chr(), despite the dump/reload hazard. So now I agree
that this is sounding more like a major-version-only behavioral change.

I would tend to agree on principle - though since this does fall in a
grey-area does 9.4 qualify for this bug-fix.

David J.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/chr-is-still-too-loose-about-UTF8-code-points-tp5804232p5804270.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: David G Johnston (#7)
Re: chr() is still too loose about UTF8 code points

David G Johnston <david.g.johnston@gmail.com> writes:

Tom Lane-2 wrote

While I'd be willing to ignore that risk so far as code points above
10ffff go, if we want pg_utf8_islegal to be happy then we will also
have to reject surrogate-pair code points. It's not beyond the realm
of possibility that somebody is intentionally generating such code
points with chr(), despite the dump/reload hazard. So now I agree
that this is sounding more like a major-version-only behavioral change.

I would tend to agree on principle - though since this does fall in a
grey-area does 9.4 qualify for this bug-fix.

I don't think it's too late to change this in 9.4. The discussion was
about whether to back-patch.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers