Surrogate pairs in UTF-8

Started by Dave Rosckesabout 11 years ago3 messagesgeneral
Jump to latest
#1Dave Rosckes
dave.rosckes@gmail.com

I have written a test program using postgres that creates a string with a
surrogate pair. I then insert that string into a varchar property in a
table.

I then execute a select statement to pull the string out. But when I
evaluate the string the lead char of the pair is correct, but the following
pair value is mangled. I run this exact same code using DB2 and it works
just fine.

Is this a postgres limitation, or is there a specific way surrogate pairs
need to be handled?

Thanks,

#2Martijn van Oosterhout
kleptog@svana.org
In reply to: Dave Rosckes (#1)
Re: Surrogate pairs in UTF-8

On Fri, Jan 16, 2015 at 08:16:47AM -0600, Dave Rosckes wrote:

I have written a test program using postgres that creates a string with a
surrogate pair. I then insert that string into a varchar property in a
table.

I then execute a select statement to pull the string out. But when I
evaluate the string the lead char of the pair is correct, but the following
pair value is mangled. I run this exact same code using DB2 and it works
just fine.

Is this a postgres limitation, or is there a specific way surrogate pairs
need to be handled?

Sounds odd. Can you provide actual queries showing the problem (and
server version).

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

He who writes carelessly confesses thereby at the very outset that he does
not attach much importance to his own thoughts.

-- Arthur Schopenhauer

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Martijn van Oosterhout (#2)
Re: Surrogate pairs in UTF-8

Martijn van Oosterhout <kleptog@svana.org> writes:

On Fri, Jan 16, 2015 at 08:16:47AM -0600, Dave Rosckes wrote:

I have written a test program using postgres that creates a string with a
surrogate pair. I then insert that string into a varchar property in a
table.

I then execute a select statement to pull the string out. But when I
evaluate the string the lead char of the pair is correct, but the following
pair value is mangled. I run this exact same code using DB2 and it works
just fine.

Is this a postgres limitation, or is there a specific way surrogate pairs
need to be handled?

Sounds odd. Can you provide actual queries showing the problem (and
server version).

Surrogate pairs are illegal in UTF-8, per its specification at
http://www.faqs.org/rfcs/rfc3629.html
You're supposed to encode the underlying code point, not a surrogate pair
(those are a UTF-16ism). So if what you passed in was actually a
surrogate pair, it should have failed encoding validity check, or possibly
have gotten converted to the underlying single Unicode character depending
on exactly what code path is involved.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general