UTF16 surrogate pairs in UTF8 encoding

Started by Tom Laneover 15 years ago12 messages
#1Tom Lane
tgl@sss.pgh.pa.us

I just noticed that we are now advertising the ability to insert UTF16
surrogate pairs in strings and identifiers (see section 4.1.2.2 in
current docs, in particular). Is this really wise? I thought that
surrogate pairs were specifically prohibited in UTF8 strings, because
of the security hazards implicit in having more than one way to
represent the same code point.

regards, tom lane

#2Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#1)
Re: UTF16 surrogate pairs in UTF8 encoding

On sön, 2010-08-22 at 14:29 -0400, Tom Lane wrote:

I just noticed that we are now advertising the ability to insert UTF16
surrogate pairs in strings and identifiers (see section 4.1.2.2 in
current docs, in particular). Is this really wise? I thought that
surrogate pairs were specifically prohibited in UTF8 strings, because
of the security hazards implicit in having more than one way to
represent the same code point.

We combine the surrogate pair components to a single code point and
encode that in UTF-8. We don't encode the components separately; that
would be wrong.

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#2)
Re: UTF16 surrogate pairs in UTF8 encoding

Peter Eisentraut <peter_e@gmx.net> writes:

On sön, 2010-08-22 at 14:29 -0400, Tom Lane wrote:

I just noticed that we are now advertising the ability to insert UTF16
surrogate pairs in strings and identifiers (see section 4.1.2.2 in
current docs, in particular). Is this really wise? I thought that
surrogate pairs were specifically prohibited in UTF8 strings, because
of the security hazards implicit in having more than one way to
represent the same code point.

We combine the surrogate pair components to a single code point and
encode that in UTF-8. We don't encode the components separately; that
would be wrong.

Oh, OK. Should the docs make that a bit clearer?

regards, tom lane

#4Florian Weimer
fweimer@bfk.de
In reply to: Tom Lane (#1)
Re: UTF16 surrogate pairs in UTF8 encoding

* Tom Lane:

I just noticed that we are now advertising the ability to insert UTF16
surrogate pairs in strings and identifiers (see section 4.1.2.2 in
current docs, in particular). Is this really wise? I thought that
surrogate pairs were specifically prohibited in UTF8 strings, because
of the security hazards implicit in having more than one way to
represent the same code point.

There is relatively little risk because surrogate pairs cannot encode
characters in the BMP, and presumably, most of the critical characters
are located there.

However, if this is converted to regular UTF-8, I really question the
sense of this. Usually, people want CESU-8 to preserve ordering
between languages such as C# and Java and their database, and
conversion destroys this property.

--
Florian Weimer <fweimer@bfk.de>
BFK edv-consulting GmbH http://www.bfk.de/
Kriegsstraße 100 tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

#5Marko Kreen
markokr@gmail.com
In reply to: Peter Eisentraut (#2)
Re: UTF16 surrogate pairs in UTF8 encoding

On 8/22/10, Peter Eisentraut <peter_e@gmx.net> wrote:

On sön, 2010-08-22 at 14:29 -0400, Tom Lane wrote:

I just noticed that we are now advertising the ability to insert UTF16
surrogate pairs in strings and identifiers (see section 4.1.2.2 in
current docs, in particular). Is this really wise? I thought that
surrogate pairs were specifically prohibited in UTF8 strings, because
of the security hazards implicit in having more than one way to
represent the same code point.

We combine the surrogate pair components to a single code point and
encode that in UTF-8. We don't encode the components separately; that
would be wrong.

AFAICS our UTF8 validator (pg_utf8_islegal) detects and rejects
such sequences, if they are inserted via any means, eg. \x

Although it's not very obvious...

--
marko

#6Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#3)
Re: UTF16 surrogate pairs in UTF8 encoding

On sön, 2010-08-22 at 15:15 -0400, Tom Lane wrote:

We combine the surrogate pair components to a single code point and
encode that in UTF-8. We don't encode the components separately;

that

would be wrong.

Oh, OK. Should the docs make that a bit clearer?

Done.

#7Marko Kreen
markokr@gmail.com
In reply to: Peter Eisentraut (#6)
Re: UTF16 surrogate pairs in UTF8 encoding

On 9/7/10, Peter Eisentraut <peter_e@gmx.net> wrote:

On sön, 2010-08-22 at 15:15 -0400, Tom Lane wrote:

We combine the surrogate pair components to a single code point and
encode that in UTF-8. We don't encode the components separately;

that

would be wrong.

Oh, OK. Should the docs make that a bit clearer?

Done.

This is confusing:

(When surrogate
pairs are used when the server encoding is <literal>UTF8</>, they
are first combined into a single code point that is then encoded
in UTF-8.)

So something else happens if encoding is not UTF8?

I think this part can be simply removed, it does not add anything.

Or say that surrogate pairs are only allowed in UTF8 encoding.
Reason is that you cannot encode 0..7F codepoints with them,
and only those are allowed to be given numerically. But this is
already mentioned before.

--
marko

#8Peter Eisentraut
peter_e@gmx.net
In reply to: Marko Kreen (#7)
Re: UTF16 surrogate pairs in UTF8 encoding

On ons, 2010-09-08 at 10:18 +0300, Marko Kreen wrote:

On 9/7/10, Peter Eisentraut <peter_e@gmx.net> wrote:

On sön, 2010-08-22 at 15:15 -0400, Tom Lane wrote:

We combine the surrogate pair components to a single code point and
encode that in UTF-8. We don't encode the components separately;

that

would be wrong.

Oh, OK. Should the docs make that a bit clearer?

Done.

This is confusing:

(When surrogate
pairs are used when the server encoding is <literal>UTF8</>, they
are first combined into a single code point that is then encoded
in UTF-8.)

So something else happens if encoding is not UTF8?

Then you can't specify surrogate pairs because they are outside of the
ASCII range, per constraint mentioned earlier in the paragraph.

I think this part can be simply removed, it does not add anything.

Or say that surrogate pairs are only allowed in UTF8 encoding.
Reason is that you cannot encode 0..7F codepoints with them,
and only those are allowed to be given numerically. But this is
already mentioned before.

Well, Tom wanted an additional explanation. I personally agree with
you; this is not the place to explain encoding and Unicode internals,
when really the code only does what it's supposed to.

#9Marko Kreen
markokr@gmail.com
In reply to: Peter Eisentraut (#8)
Re: UTF16 surrogate pairs in UTF8 encoding

On 9/8/10, Peter Eisentraut <peter_e@gmx.net> wrote:

On ons, 2010-09-08 at 10:18 +0300, Marko Kreen wrote:

On 9/7/10, Peter Eisentraut <peter_e@gmx.net> wrote:

On sön, 2010-08-22 at 15:15 -0400, Tom Lane wrote:

We combine the surrogate pair components to a single code point and
encode that in UTF-8. We don't encode the components separately;

that

would be wrong.

Oh, OK. Should the docs make that a bit clearer?

Done.

This is confusing:

(When surrogate
pairs are used when the server encoding is <literal>UTF8</>, they
are first combined into a single code point that is then encoded
in UTF-8.)

So something else happens if encoding is not UTF8?

Then you can't specify surrogate pairs because they are outside of the
ASCII range, per constraint mentioned earlier in the paragraph.

I think this part can be simply removed, it does not add anything.

Or say that surrogate pairs are only allowed in UTF8 encoding.
Reason is that you cannot encode 0..7F codepoints with them,
and only those are allowed to be given numerically. But this is
already mentioned before.

Well, Tom wanted an additional explanation. I personally agree with
you; this is not the place to explain encoding and Unicode internals,
when really the code only does what it's supposed to.

Ah OK, I had the impression you changed wording before that too,
so then this addition seemed unnecessary. But seems you only changed
formatting.

Anyway, this "when" makes it weird. Maybe more concise version:

To repeat, surrogate pairs are combined to single character and then
encoded, not stored separately.

Although it does seem unnecessary.

--
marko

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Marko Kreen (#9)
Re: UTF16 surrogate pairs in UTF8 encoding

Marko Kreen <markokr@gmail.com> writes:

Although it does seem unnecessary.

The reason I asked for this to be spelled out is that ordinarily,
a backslash escape \nnn is a very low-level thing that will insert
exactly what you say. To me it's quite unexpected that the system
would editorialize on that to the extent of replacing two UTF16
surrogate characters by a single code point. That's necessary for
correctness because our underlying storage is UTF8, but it's not
obvious that it will happen. (As a counterexample, if our underlying
storage were UTF16, then very different things would need to happen
for the exact same SQL input.)

I think a lot of people will have this same question when reading
this para, which is why I asked for an explanation there.

regards, tom lane

#11Marko Kreen
markokr@gmail.com
In reply to: Tom Lane (#10)
Re: UTF16 surrogate pairs in UTF8 encoding

On 9/8/10, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Marko Kreen <markokr@gmail.com> writes:

Although it does seem unnecessary.

The reason I asked for this to be spelled out is that ordinarily,
a backslash escape \nnn is a very low-level thing that will insert
exactly what you say. To me it's quite unexpected that the system
would editorialize on that to the extent of replacing two UTF16
surrogate characters by a single code point. That's necessary for
correctness because our underlying storage is UTF8, but it's not
obvious that it will happen. (As a counterexample, if our underlying
storage were UTF16, then very different things would need to happen
for the exact same SQL input.)

I think a lot of people will have this same question when reading
this para, which is why I asked for an explanation there.

Ok, but I still don't like the "when"s. How about:

-    6-digit form technically makes this unnecessary.  (When surrogate
-    pairs are used when the server encoding is <literal>UTF8</>, they
-    are first combined into a single code point that is then encoded
-    in UTF-8.)
+    6-digit form technically makes this unnecessary.  (Surrogate
+    pairs are not stored directly, but combined into a single
+    code point that is then encoded in UTF-8.)

--
marko

#12Bruce Momjian
bruce@momjian.us
In reply to: Marko Kreen (#11)
Re: UTF16 surrogate pairs in UTF8 encoding

Marko Kreen wrote:

On 9/8/10, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Marko Kreen <markokr@gmail.com> writes:

Although it does seem unnecessary.

The reason I asked for this to be spelled out is that ordinarily,
a backslash escape \nnn is a very low-level thing that will insert
exactly what you say. To me it's quite unexpected that the system
would editorialize on that to the extent of replacing two UTF16
surrogate characters by a single code point. That's necessary for
correctness because our underlying storage is UTF8, but it's not
obvious that it will happen. (As a counterexample, if our underlying
storage were UTF16, then very different things would need to happen
for the exact same SQL input.)

I think a lot of people will have this same question when reading
this para, which is why I asked for an explanation there.

Ok, but I still don't like the "when"s. How about:

-    6-digit form technically makes this unnecessary.  (When surrogate
-    pairs are used when the server encoding is <literal>UTF8</>, they
-    are first combined into a single code point that is then encoded
-    in UTF-8.)
+    6-digit form technically makes this unnecessary.  (Surrogate
+    pairs are not stored directly, but combined into a single
+    code point that is then encoded in UTF-8.)

Applied, thanks.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +