pgsql: Handle Unicode surrogate pairs correctly when processing JSON.

Started by Andrew Dunstanabout 13 years ago3 messagescomitters
Jump to latest
#1Andrew Dunstan
andrew@dunslane.net

Handle Unicode surrogate pairs correctly when processing JSON.

In 9.2, Unicode escape sequences are not analysed at all other than
to make sure that they are in the form \uXXXX. But in 9.3 many of the
new operators and functions try to turn JSON text values into text in
the server encoding, and this includes de-escaping Unicode escape
sequences. This processing had not taken into account the possibility
that this might contain a surrogate pair to designate a character
outside the BMP. That is now handled correctly.

This also enforces correct use of surrogate pairs, something that is not
done by the type's input routines. This fact is noted in the docs.

Branch
------
master

Details
-------
http://git.postgresql.org/pg/commitdiff/94e3311b97448324d67ba9a527854271373329d9

Modified Files
--------------
doc/src/sgml/func.sgml | 9 +++++++
src/backend/utils/adt/json.c | 52 ++++++++++++++++++++++++++++++++++++
src/test/regress/expected/json.out | 23 ++++++++++++++++
src/test/regress/sql/json.sql | 8 ++++++
4 files changed, 92 insertions(+)

--
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

#2Bruce Momjian
bruce@momjian.us
In reply to: Andrew Dunstan (#1)
Re: pgsql: Handle Unicode surrogate pairs correctly when processing JSON.

On Sat, Jun 8, 2013 at 01:21:20PM +0000, Andrew Dunstan wrote:

Handle Unicode surrogate pairs correctly when processing JSON.

In 9.2, Unicode escape sequences are not analysed at all other than
to make sure that they are in the form \uXXXX. But in 9.3 many of the
new operators and functions try to turn JSON text values into text in
the server encoding, and this includes de-escaping Unicode escape
sequences. This processing had not taken into account the possibility
that this might contain a surrogate pair to designate a character
outside the BMP. That is now handled correctly.

This also enforces correct use of surrogate pairs, something that is not
done by the type's input routines. This fact is noted in the docs.

Branch
------
master

Details
-------
http://git.postgresql.org/pg/commitdiff/94e3311b97448324d67ba9a527854271373329d9

Modified Files
--------------
doc/src/sgml/func.sgml | 9 +++++++
src/backend/utils/adt/json.c | 52 ++++++++++++++++++++++++++++++++++++
src/test/regress/expected/json.out | 23 ++++++++++++++++
src/test/regress/sql/json.sql | 8 ++++++
4 files changed, 92 insertions(+)

Does this affect any data already stored in PG 9.3 beta? Is it
something that should require a catalog bump?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

--
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

#3Andrew Dunstan
andrew@dunslane.net
In reply to: Bruce Momjian (#2)
Re: pgsql: Handle Unicode surrogate pairs correctly when processing JSON.

On 06/24/2013 11:50 AM, Bruce Momjian wrote:

On Sat, Jun 8, 2013 at 01:21:20PM +0000, Andrew Dunstan wrote:

Handle Unicode surrogate pairs correctly when processing JSON.

In 9.2, Unicode escape sequences are not analysed at all other than
to make sure that they are in the form \uXXXX. But in 9.3 many of the
new operators and functions try to turn JSON text values into text in
the server encoding, and this includes de-escaping Unicode escape
sequences. This processing had not taken into account the possibility
that this might contain a surrogate pair to designate a character
outside the BMP. That is now handled correctly.

This also enforces correct use of surrogate pairs, something that is not
done by the type's input routines. This fact is noted in the docs.

Branch
------
master

Details
-------
http://git.postgresql.org/pg/commitdiff/94e3311b97448324d67ba9a527854271373329d9

Modified Files
--------------
doc/src/sgml/func.sgml | 9 +++++++
src/backend/utils/adt/json.c | 52 ++++++++++++++++++++++++++++++++++++
src/test/regress/expected/json.out | 23 ++++++++++++++++
src/test/regress/sql/json.sql | 8 ++++++
4 files changed, 92 insertions(+)

Does this affect any data already stored in PG 9.3 beta? Is it
something that should require a catalog bump?

No and no. All it means is that where we previously extracted data
encoded with surrogate pairs incorrectly, now we do it correctly. Only
the processing functions enforce this - for legacy reasons the input
routines don't enforce correct use of surrogate pairs - or indeed any
unicode escapes, as long as they are in the form \uxxxx

cheers

andrew

--
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers