JSON and unicode surrogate pairs

Started by Andrew Dunstanalmost 13 years ago34 messageshackers

andrew@dunslane.net

almost 13 years ago

In 9.2, the JSON parser didn't check the validity of the use of unicode
escapes other than that it required 4 hex digits to follow '\u'. In 9.3,
that is still the case. However, the JSON accessor functions and
operators also try to turn JSON strings into text in the server
encoding, and this includes de-escaping \u sequences. This works fine
except when there is a pair of sequences representing a UTF-16 type
surrogate pair, something that is explicitly permitted in the JSON spec.

The attached patch is an attempt to remedy that, and a surrogate pair is
turned into the correct code point before converting it to whatever the
server encoding is.

Note that this would mean we can still put JSON with incorrect use of
surrogates into the database, as now (9.2 and later), and they will
cause almost all the accessor functions to raise an error, as now (9.3).
All this does is allow JSON that uses surrogates correctly not to fail
when applying the accessor functions and operators. That's a possible
violation of POLA, and at least worth of a note in the docs, but I'm not
sure what else we can do now - adding this check to the input lexer
would possibly cause restores to fail, which users might not thank us for.

cheers

andrew

Robert Haas

robertmhaas@gmail.com

almost 13 years ago

In reply to: Andrew Dunstan (#1)

Re: JSON and unicode surrogate pairs

On Wed, Jun 5, 2013 at 10:46 AM, Andrew Dunstan <andrew@dunslane.net> wrote:

In 9.2, the JSON parser didn't check the validity of the use of unicode
escapes other than that it required 4 hex digits to follow '\u'. In 9.3,
that is still the case. However, the JSON accessor functions and operators
also try to turn JSON strings into text in the server encoding, and this
includes de-escaping \u sequences. This works fine except when there is a
pair of sequences representing a UTF-16 type surrogate pair, something that
is explicitly permitted in the JSON spec.

The attached patch is an attempt to remedy that, and a surrogate pair is
turned into the correct code point before converting it to whatever the
server encoding is.

Note that this would mean we can still put JSON with incorrect use of
surrogates into the database, as now (9.2 and later), and they will cause
almost all the accessor functions to raise an error, as now (9.3). All this
does is allow JSON that uses surrogates correctly not to fail when applying
the accessor functions and operators. That's a possible violation of POLA,
and at least worth of a note in the docs, but I'm not sure what else we can
do now - adding this check to the input lexer would possibly cause restores
to fail, which users might not thank us for.

I think the approach you've proposed here is a good one.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andrew Dunstan

andrew@dunslane.net

almost 13 years ago

In reply to: Robert Haas (#2)

Re: JSON and unicode surrogate pairs

On 06/06/2013 12:53 PM, Robert Haas wrote:

On Wed, Jun 5, 2013 at 10:46 AM, Andrew Dunstan <andrew@dunslane.net> wrote:

In 9.2, the JSON parser didn't check the validity of the use of unicode
escapes other than that it required 4 hex digits to follow '\u'. In 9.3,
that is still the case. However, the JSON accessor functions and operators
also try to turn JSON strings into text in the server encoding, and this
includes de-escaping \u sequences. This works fine except when there is a
pair of sequences representing a UTF-16 type surrogate pair, something that
is explicitly permitted in the JSON spec.

The attached patch is an attempt to remedy that, and a surrogate pair is
turned into the correct code point before converting it to whatever the
server encoding is.

Note that this would mean we can still put JSON with incorrect use of
surrogates into the database, as now (9.2 and later), and they will cause
almost all the accessor functions to raise an error, as now (9.3). All this
does is allow JSON that uses surrogates correctly not to fail when applying
the accessor functions and operators. That's a possible violation of POLA,
and at least worth of a note in the docs, but I'm not sure what else we can
do now - adding this check to the input lexer would possibly cause restores
to fail, which users might not thank us for.

I think the approach you've proposed here is a good one.

I did that, but it's evident from the buildfarm that there's more work
to do. The problem is that we do the de-escaping as we lex the json to
construct the look ahead token, and at that stage we don't know whether
or not it's really going to be needed. That means we can cause errors to
be raised in far too many places. It's failing on this line:

converted = pg_any_to_server(utf8str, utf8len, PG_UTF8);

even though the operator in use ("->") doesn't even use the de-escaped
value.

The real solution is going to be to delay the de-escaping of the string
until it is known to be wanted. That's unfortunately going to be a bit
invasive, but I can't see a better solution. I'll work on it ASAP.
Getting it to work well without a small API change might be pretty hard,
though.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

almost 13 years ago

In reply to: Andrew Dunstan (#3)

Re: JSON and unicode surrogate pairs

Andrew Dunstan <andrew@dunslane.net> writes:

I did that, but it's evident from the buildfarm that there's more work
to do. The problem is that we do the de-escaping as we lex the json to
construct the look ahead token, and at that stage we don't know whether
or not it's really going to be needed. That means we can cause errors to
be raised in far too many places. It's failing on this line:
converted = pg_any_to_server(utf8str, utf8len, PG_UTF8);
even though the operator in use ("->") doesn't even use the de-escaped
value.

The real solution is going to be to delay the de-escaping of the string
until it is known to be wanted. That's unfortunately going to be a bit
invasive, but I can't see a better solution. I'll work on it ASAP.

Not sure that this idea isn't a dead end. IIUC, you're proposing to
jump through hoops in order to avoid complaining about illegal JSON
data, essentially just for backwards compatibility with 9.2's failure to
complain about it. If we switch over to a pre-parsed (binary) storage
format for JSON values, won't we be forced to throw these errors anyway?
If so, maybe we should just take the compatibility hit now while there's
still a relatively small amount of stored JSON data in the wild.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andrew Dunstan

andrew@dunslane.net

almost 13 years ago

In reply to: Tom Lane (#4)

Re: JSON and unicode surrogate pairs

On 06/09/2013 07:47 PM, Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

I did that, but it's evident from the buildfarm that there's more work
to do. The problem is that we do the de-escaping as we lex the json to
construct the look ahead token, and at that stage we don't know whether
or not it's really going to be needed. That means we can cause errors to
be raised in far too many places. It's failing on this line:
converted = pg_any_to_server(utf8str, utf8len, PG_UTF8);
even though the operator in use ("->") doesn't even use the de-escaped
value.
The real solution is going to be to delay the de-escaping of the string
until it is known to be wanted. That's unfortunately going to be a bit
invasive, but I can't see a better solution. I'll work on it ASAP.

Not sure that this idea isn't a dead end. IIUC, you're proposing to
jump through hoops in order to avoid complaining about illegal JSON
data, essentially just for backwards compatibility with 9.2's failure to
complain about it. If we switch over to a pre-parsed (binary) storage
format for JSON values, won't we be forced to throw these errors anyway?
If so, maybe we should just take the compatibility hit now while there's
still a relatively small amount of stored JSON data in the wild.

No, I probably haven't explained it very well. Here is the regression
diff from jacana:

       ERROR:  cannot call json_populate_recordset on a nested object
       -- handling of unicode surrogate pairs
       select json '{ "a":  "\ud83d\ude04\ud83d\udc36" }' -> 'a' as correct;
    !           correct
    ! ----------------------------
    !  "\ud83d\ude04\ud83d\udc36"
    ! (1 row)
    !
       select json '{ "a":  "\ud83d\ud83d" }' -> 'a'; -- 2 high surrogates in a row
       ERROR:  invalid input syntax for type json
       DETAIL:  high order surrogate must not follow a high order surrogate.
    --- 922,928 ----
       ERROR:  cannot call json_populate_recordset on a nested object
       -- handling of unicode surrogate pairs
       select json '{ "a":  "\ud83d\ude04\ud83d\udc36" }' -> 'a' as correct;
    ! ERROR:  character with byte sequence 0xf0 0x9f 0x98 0x84 in encoding "UTF8" has no equivalent in encoding "WIN1252"
       select json '{ "a":  "\ud83d\ud83d" }' -> 'a'; -- 2 high surrogates in a row
       ERROR:  invalid input syntax for type json
       DETAIL:  high order surrogate must not follow a high order surrogate.

The sequence in question is two perfectly valid surrogate pairs.

...

After thinking about this some more I have come to the conclusion that
we should only do any de-escaping of \uxxxx sequences, whether or not
they are for BMP characters, when the server encoding is utf8. For any
other encoding, which is already a violation of the JSON standard
anyway, and should be avoided if you're dealing with JSON, we should
just pass them through even in text output. This will be a simple and
very localized fix.

We'll still have to deal with this issue when we get to binary storage
of JSON, but that's not something we need to confront today.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

almost 13 years ago

In reply to: Andrew Dunstan (#5)

Re: JSON and unicode surrogate pairs

Andrew Dunstan <andrew@dunslane.net> writes:

After thinking about this some more I have come to the conclusion that
we should only do any de-escaping of \uxxxx sequences, whether or not
they are for BMP characters, when the server encoding is utf8. For any
other encoding, which is already a violation of the JSON standard
anyway, and should be avoided if you're dealing with JSON, we should
just pass them through even in text output. This will be a simple and
very localized fix.

Hmm. I'm not sure that users will like this definition --- it will seem
pretty arbitrary to them that conversion of \u sequences happens in some
databases and not others.

We'll still have to deal with this issue when we get to binary storage
of JSON, but that's not something we need to confront today.

Well, if we have to break backwards compatibility when we try to do
binary storage, we're not going to be happy either. So I think we'd
better have a plan in mind for what will happen then.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andrew Dunstan

andrew@dunslane.net

almost 13 years ago

In reply to: Tom Lane (#6)

Re: JSON and unicode surrogate pairs

On 06/10/2013 10:18 AM, Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

After thinking about this some more I have come to the conclusion that
we should only do any de-escaping of \uxxxx sequences, whether or not
they are for BMP characters, when the server encoding is utf8. For any
other encoding, which is already a violation of the JSON standard
anyway, and should be avoided if you're dealing with JSON, we should
just pass them through even in text output. This will be a simple and
very localized fix.

Hmm. I'm not sure that users will like this definition --- it will seem
pretty arbitrary to them that conversion of \u sequences happens in some
databases and not others.

Then what should we do when there is no matching codepoint in the
database encoding? First we'll have to delay the evaluation so it's not
done over-eagerly, and then we'll have to try the conversion and throw
an error if it doesn't work. The second part is what's happening now,
but the delayed evaluation is not.

Or we could abandon the conversion altogether, but that doesn't seem
very friendly either. I suspect the biggest case for people to use these
sequences is where the database is UTF8 but the client encoding is not.

Frankly, if you want to use Unicode escapes, you should really be using
a UTF8 encoded database if at all possible.

We'll still have to deal with this issue when we get to binary storage
of JSON, but that's not something we need to confront today.

Well, if we have to break backwards compatibility when we try to do
binary storage, we're not going to be happy either. So I think we'd
better have a plan in mind for what will happen then.

I don't see any reason why we couldn't store the JSON strings with the
Unicode escape sequences intact in the binary format. What the binary
format buys us is that it has decomposed the JSON into a tree structure,
so instead of parsing the JSON we can just walk the tree, but the leaf
nodes of the tree are still (in the case of the nodes under discussion)
text-like objects.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

almost 13 years ago

In reply to: Andrew Dunstan (#7)

Re: JSON and unicode surrogate pairs

Andrew Dunstan <andrew@dunslane.net> writes:

Or we could abandon the conversion altogether, but that doesn't seem
very friendly either. I suspect the biggest case for people to use these
sequences is where the database is UTF8 but the client encoding is not.

Well, if that's actually the biggest use-case, then maybe we should just
say we're *not* in the business of converting those escapes. That would
make things nice and consistent regardless of the DB encoding, and it
would avoid the problem of being able to input a value and then not
being able to output it again.

It's legal, is it not, to just write the equivalent Unicode character in
the JSON string and not use the escapes? If so I would think that that
would be the most common usage. If someone's writing an escape, they
probably had a reason for doing it that way, and might not appreciate
our overriding their decision.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andrew Dunstan

andrew@dunslane.net

almost 13 years ago

In reply to: Tom Lane (#8)

Re: JSON and unicode surrogate pairs

On 06/10/2013 11:43 AM, Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

Or we could abandon the conversion altogether, but that doesn't seem
very friendly either. I suspect the biggest case for people to use these
sequences is where the database is UTF8 but the client encoding is not.

Well, if that's actually the biggest use-case, then maybe we should just
say we're *not* in the business of converting those escapes. That would
make things nice and consistent regardless of the DB encoding, and it
would avoid the problem of being able to input a value and then not
being able to output it again.

It's legal, is it not, to just write the equivalent Unicode character in
the JSON string and not use the escapes? If so I would think that that
would be the most common usage. If someone's writing an escape, they
probably had a reason for doing it that way, and might not appreciate
our overriding their decision.

We never store the converted values in the JSON object, nor do we return
them from functions that return JSON. But many of the functions and
operators that process the JSON have variants that return text instead
of JSON, and in those cases, when the value returned is a JSON string,
we do the following to it:

* strip the outside quotes
* de-escape the various escaped characters (i.e. everything preceded
by a backslash in the railroad diagram for string at
<http://www.json.org/>)

Here's an example of the difference:

andrew=# select '{ "a": "\u00a9"}'::json -> 'a';
?column?
----------
"\u00a9"
(1 row)

andrew=# select '{ "a": "\u00a9"}'::json ->>'a';
?column?
----------
©
(1 row)

It's the process of producing the latter that is giving us a headache in
non-UTF8 databases.

... [ more caffeine is consumed ] ...

I have just realized that the problem is actually quite a lot bigger
than that. We also use this value for field name comparison. So, let us
suppose that we have a LATIN1 database and a piece of JSON with a field
name containing the Euro sign ("\u20ac"), a character that is not in
LATIN1. Making that processable so it doesn't blow up would be mighty
tricky and error prone. The non-orthogonality I suggested as a solution
upthread is, by contrast, very small and easy to manage, and not
terribly hard to explain - see attached.

cheers

andrew

#10

Robert Haas

robertmhaas@gmail.com

almost 13 years ago

In reply to: Tom Lane (#6)

Re: JSON and unicode surrogate pairs

On Mon, Jun 10, 2013 at 10:18 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Well, if we have to break backwards compatibility when we try to do
binary storage, we're not going to be happy either. So I think we'd
better have a plan in mind for what will happen then.

Who says we're ever going to do any such thing? This was extensively
debated when we added the original type, and I thought that it was
agreed that we might ultimately need both a type that stored JSON as
text and another that stored it as binary. And we might need an
XML-binary type as well. But there are also cases where storing the
data as text is *better*, and I don't see us ever getting rid of that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Hannu Krosing

hannu@tm.ee

almost 13 years ago

In reply to: Robert Haas (#10)

Re: JSON and unicode surrogate pairs

On 06/11/2013 12:07 AM, Robert Haas wrote:

On Mon, Jun 10, 2013 at 10:18 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Well, if we have to break backwards compatibility when we try to do
binary storage, we're not going to be happy either. So I think we'd
better have a plan in mind for what will happen then.

Who says we're ever going to do any such thing? This was extensively
debated when we added the original type, and I thought that it was
agreed that we might ultimately need both a type that stored JSON as
text and another that stored it as binary.

This is where the compatibility comes in - we do want both to
accept the same textual format.

And we might need an
XML-binary type as well. But there are also cases where storing the
data as text is *better*,

Then use text :)

and I don't see us ever getting rid of that.

While JSON is a "serialisation format" most things people want
to used it for are actually structured types, not their serialisation
to text. The serialisation should happen automatically.

--
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic Oï¿½

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Andrew Dunstan

andrew@dunslane.net

almost 13 years ago

In reply to: Robert Haas (#10)

Re: JSON and unicode surrogate pairs

On 06/10/2013 06:07 PM, Robert Haas wrote:

On Mon, Jun 10, 2013 at 10:18 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Well, if we have to break backwards compatibility when we try to do
binary storage, we're not going to be happy either. So I think we'd
better have a plan in mind for what will happen then.

Who says we're ever going to do any such thing? This was extensively
debated when we added the original type, and I thought that it was
agreed that we might ultimately need both a type that stored JSON as
text and another that stored it as binary. And we might need an
XML-binary type as well. But there are also cases where storing the
data as text is *better*, and I don't see us ever getting rid of that.

It was discussed at Pgcon as a result of Oleg and Teodor's talk, and at
the Unconference.

But in any case it's moot here. None of what I'm suggesting has anything
to do with the storage representation of JSON, only with how we process
it in whatever form. And none of it will break backwards compatibility
at all.

So, please, let's concentrate on the problem that's actually at hand.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Noah Misch

noah@leadboat.com

almost 13 years ago

In reply to: Andrew Dunstan (#7)

Re: JSON and unicode surrogate pairs

On Mon, Jun 10, 2013 at 11:20:13AM -0400, Andrew Dunstan wrote:

On 06/10/2013 10:18 AM, Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

After thinking about this some more I have come to the conclusion that
we should only do any de-escaping of \uxxxx sequences, whether or not
they are for BMP characters, when the server encoding is utf8. For any
other encoding, which is already a violation of the JSON standard
anyway, and should be avoided if you're dealing with JSON, we should
just pass them through even in text output. This will be a simple and
very localized fix.

Hmm. I'm not sure that users will like this definition --- it will seem
pretty arbitrary to them that conversion of \u sequences happens in some
databases and not others.

Yep. Suppose you have a LATIN1 database. Changing it to a UTF8 database
where everyone uses client_encoding = LATIN1 should not change the semantics
of successful SQL statements. Some statements that fail with one database
encoding will succeed in the other, but a user should not witness a changed
non-error result. (Except functions like decode() that explicitly expose byte
representations.) Having "SELECT '["\u00e4"]'::json ->> 0" emit 'ï¿½' in the
UTF8 database and '\u00e4' in the LATIN1 database would move PostgreSQL in the
wrong direction relative to that ideal.

Then what should we do when there is no matching codepoint in the
database encoding? First we'll have to delay the evaluation so it's not
done over-eagerly, and then we'll have to try the conversion and throw
an error if it doesn't work. The second part is what's happening now,
but the delayed evaluation is not.

+1 for doing it that way.

Thanks,
nm

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Andres Freund

andres@anarazel.de

almost 13 years ago

In reply to: Andrew Dunstan (#9)

Re: JSON and unicode surrogate pairs

On 2013-06-10 13:01:29 -0400, Andrew Dunstan wrote:

It's legal, is it not, to just write the equivalent Unicode character in
the JSON string and not use the escapes? If so I would think that that
would be the most common usage. If someone's writing an escape, they
probably had a reason for doing it that way, and might not appreciate
our overriding their decision.

We never store the converted values in the JSON object, nor do we return
them from functions that return JSON. But many of the functions and
operators that process the JSON have variants that return text instead of
JSON, and in those cases, when the value returned is a JSON string, we do
the following to it:

I have just realized that the problem is actually quite a lot bigger than
that. We also use this value for field name comparison. So, let us suppose
that we have a LATIN1 database and a piece of JSON with a field name
containing the Euro sign ("\u20ac"), a character that is not in LATIN1.
Making that processable so it doesn't blow up would be mighty tricky and
error prone. The non-orthogonality I suggested as a solution upthread is, by
contrast, very small and easy to manage, and not terribly hard to explain -
see attached.

I think this all shows pretty clearly that it was a mistake allowing
json data in the database that we cannot entirely display with the
database's encoding. All the proposed ugly workarounds are only
necessary because we don't throw an error when originally validating the
json.
Even in an utf-8 database you can get errors due to \u unescaping (at
attribute access time, *NOT* at json_in() time) due to invalidate
surrogate pairs.

I think this goes countrary to normal postgres approach of validating
data as strict as necessary. And I think we are going to regret not
fixing this while there are still relatively few users out there.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Hannu Krosing

hannu@tm.ee

almost 13 years ago

In reply to: Andres Freund (#14)

Re: JSON and unicode surrogate pairs

On 06/11/2013 10:47 AM, Andres Freund wrote:

On 2013-06-10 13:01:29 -0400, Andrew Dunstan wrote:

It's legal, is it not, to just write the equivalent Unicode character in
the JSON string and not use the escapes? If so I would think that that
would be the most common usage. If someone's writing an escape, they
probably had a reason for doing it that way, and might not appreciate
our overriding their decision.

We never store the converted values in the JSON object, nor do we return
them from functions that return JSON. But many of the functions and
operators that process the JSON have variants that return text instead of
JSON, and in those cases, when the value returned is a JSON string, we do
the following to it:

I have just realized that the problem is actually quite a lot bigger than
that. We also use this value for field name comparison. So, let us suppose
that we have a LATIN1 database and a piece of JSON with a field name
containing the Euro sign ("\u20ac"), a character that is not in LATIN1.
Making that processable so it doesn't blow up would be mighty tricky and
error prone. The non-orthogonality I suggested as a solution upthread is, by
contrast, very small and easy to manage, and not terribly hard to explain -
see attached.

I think this all shows pretty clearly that it was a mistake allowing
json data in the database that we cannot entirely display with the
database's encoding. All the proposed ugly workarounds are only
necessary because we don't throw an error when originally validating the
json.
Even in an utf-8 database you can get errors due to \u unescaping (at
attribute access time, *NOT* at json_in() time) due to invalidate
surrogate pairs.

I think this goes countrary to normal postgres approach of validating
data as strict as necessary. And I think we are going to regret not
fixing this while there are still relatively few users out there.

Exactly -

* allow in only valid JSON.
* Validate all utf8 strings for valid unicode.
* have one canonic way of outputting unicode - utf8 for utf8 databases,
escaped for all other encodings
* If you need to store anything else, use text.

Requiring preserving "original text" in json data field is Not Good!

I fully expect '{"a":1, "a":none, "a":true, "a":"b"}'::json to come out
as '{"a":b"}'

(I know that currently this is noty true and will happen only once I
read in the json value in client)

For anything else - don't use json, use any text type

If you really need a simple text-validated-as-valid-json-input datatype
then add this.

Call it jsontext or somesuch :)

--
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic Oï¿½

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Andrew Dunstan

andrew@dunslane.net

almost 13 years ago

In reply to: Hannu Krosing (#15)

Re: JSON and unicode surrogate pairs

On 06/11/2013 06:53 AM, Hannu Krosing wrote:

On 06/11/2013 10:47 AM, Andres Freund wrote:

On 2013-06-10 13:01:29 -0400, Andrew Dunstan wrote:

It's legal, is it not, to just write the equivalent Unicode character in
the JSON string and not use the escapes? If so I would think that that
would be the most common usage. If someone's writing an escape, they
probably had a reason for doing it that way, and might not appreciate
our overriding their decision.

We never store the converted values in the JSON object, nor do we return
them from functions that return JSON. But many of the functions and
operators that process the JSON have variants that return text instead of
JSON, and in those cases, when the value returned is a JSON string, we do
the following to it:

I have just realized that the problem is actually quite a lot bigger than
that. We also use this value for field name comparison. So, let us suppose
that we have a LATIN1 database and a piece of JSON with a field name
containing the Euro sign ("\u20ac"), a character that is not in LATIN1.
Making that processable so it doesn't blow up would be mighty tricky and
error prone. The non-orthogonality I suggested as a solution upthread is, by
contrast, very small and easy to manage, and not terribly hard to explain -
see attached.

I think this all shows pretty clearly that it was a mistake allowing
json data in the database that we cannot entirely display with the
database's encoding. All the proposed ugly workarounds are only
necessary because we don't throw an error when originally validating the
json.
Even in an utf-8 database you can get errors due to \u unescaping (at
attribute access time, *NOT* at json_in() time) due to invalidate
surrogate pairs.

I think this goes countrary to normal postgres approach of validating
data as strict as necessary. And I think we are going to regret not
fixing this while there are still relatively few users out there.

Exactly -

* allow in only valid JSON.
* Validate all utf8 strings for valid unicode.
* have one canonic way of outputting unicode - utf8 for utf8 databases,
escaped for all other encodings
* If you need to store anything else, use text.

Requiring preserving "original text" in json data field is Not Good!

I fully expect '{"a":1, "a":none, "a":true, "a":"b"}'::json to come out
as '{"a":b"}'

(I know that currently this is noty true and will happen only once I
read in the json value in client)

For anything else - don't use json, use any text type

If you really need a simple text-validated-as-valid-json-input datatype
then add this.

Call it jsontext or somesuch :)

Most of this is irrelevant to solving the current problem. We don't have
a greenfields development of JSON - we have an existing type with which
we have to work. Note too that your "utf8 for utf8 databases, escaped
for all other encodings" is exactly the opposite of what Noah and Andres
and Tom are arguing, namely that the database encoding should not affect
the behaviour, and is very close to what I proposed yesterday.

It's a pity that we don't have a non-error producing conversion function
(or if we do that I haven't found it). Then we might adopt a rule for
processing unicode escapes that said "convert unicode escapes to the
database encoding if possible, and if not then emit them unchanged."
which might be a reasonable compromise.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Stefan Drees

stefan@drees.name

almost 13 years ago

In reply to: Hannu Krosing (#15)

Re: JSON and unicode surrogate pairs

On 2013-06-11 12:53 CEST, Hannu Krosing wrote:

On 06/11/2013 10:47 AM, Andres Freund wrote:

On 2013-06-10 13:01:29 -0400, Andrew Dunstan wrote:

It's legal, is it not, to just write the equivalent Unicode character in
the JSON string and not use the escapes? If so I would think that that
would be the most common usage. If someone's writing an escape, they
probably had a reason for doing it that way, and might not appreciate
our overriding their decision.

We never store the converted values in the JSON object, nor do we return
them from functions that return JSON. But many of the functions and
operators that process the JSON have variants that return text instead of
JSON, and in those cases, when the value returned is a JSON string, we do
the following to it:

I have just realized that the problem is actually quite a lot bigger than
that. We also use this value for field name comparison. So, let us suppose
that we have a LATIN1 database and a piece of JSON with a field name
containing the Euro sign ("\u20ac"), a character that is not in LATIN1.
Making that processable so it doesn't blow up would be mighty tricky and
error prone. The non-orthogonality I suggested as a solution upthread is, by
contrast, very small and easy to manage, and not terribly hard to explain -
see attached.

I think this all shows pretty clearly that it was a mistake allowing
json data in the database that we cannot entirely display with the
database's encoding. All the proposed ugly workarounds are only
necessary because we don't throw an error when originally validating the
json.
Even in an utf-8 database you can get errors due to \u unescaping (at
attribute access time, *NOT* at json_in() time) due to invalidate
surrogate pairs.

I think this goes countrary to normal postgres approach of validating
data as strict as necessary. And I think we are going to regret not
fixing this while there are still relatively few users out there.

Exactly -

* allow in only valid JSON.
* Validate all utf8 strings for valid unicode.
* have one canonic way of outputting unicode - utf8 for utf8 databases,
escaped for all other encodings
* If you need to store anything else, use text.

Requiring preserving "original text" in json data field is Not Good!

I fully expect '{"a":1, "a":none, "a":true, "a":"b"}'::json to come out
as '{"a":b"}'

ahem, do you mean instead to give (none -> null and missing '"' inserted
in "answer"):
=# SELECT '{"a":1, "a":null, "a":true, "a":"b"}'::json;
json
--------------------------------------
{"a":"b"}

or only when "stored" in database and subsequently retrieved? The
"original text" in this case was perfectly valid JSON text.

(I know that currently this is noty true and will happen only once I
read in the json value in client)

Isn't this a good situation and doesn't this also depend on the storage
representation in the client?

What about this:
=# SELECT '{"measure":"seconds", "measure":42}'::json;
json
--------------------------------------
{"measure":42}

I presume people being used to store metadata in "preceding" json object
members with duplicate names, would want to decide in the client
requesting the data what to do with the metadata information and at what
point to "drop", wouldn't they :-?)

For anything else - don't use json, use any text type

If you really need a simple text-validated-as-valid-json-input datatype
then add this.

Call it jsontext or somesuch :)

All the best,
Stefan.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Hannu Krosing

hannu@tm.ee

almost 13 years ago

In reply to: Andrew Dunstan (#16)

Re: JSON and unicode surrogate pairs

On 06/11/2013 02:41 PM, Andrew Dunstan wrote:

On 06/11/2013 06:53 AM, Hannu Krosing wrote:

On 06/11/2013 10:47 AM, Andres Freund wrote:

On 2013-06-10 13:01:29 -0400, Andrew Dunstan wrote:

It's legal, is it not, to just write the equivalent Unicode
character in
the JSON string and not use the escapes? If so I would think that
that
would be the most common usage. If someone's writing an escape, they
probably had a reason for doing it that way, and might not appreciate
our overriding their decision.

We never store the converted values in the JSON object, nor do we
return
them from functions that return JSON. But many of the functions and
operators that process the JSON have variants that return text
instead of
JSON, and in those cases, when the value returned is a JSON string,
we do
the following to it:

I have just realized that the problem is actually quite a lot
bigger than
that. We also use this value for field name comparison. So, let us
suppose
that we have a LATIN1 database and a piece of JSON with a field name
containing the Euro sign ("\u20ac"), a character that is not in
LATIN1.
Making that processable so it doesn't blow up would be mighty
tricky and
error prone. The non-orthogonality I suggested as a solution
upthread is, by
contrast, very small and easy to manage, and not terribly hard to
explain -
see attached.

I think this all shows pretty clearly that it was a mistake allowing
json data in the database that we cannot entirely display with the
database's encoding. All the proposed ugly workarounds are only
necessary because we don't throw an error when originally validating
the
json.
Even in an utf-8 database you can get errors due to \u unescaping (at
attribute access time, *NOT* at json_in() time) due to invalidate
surrogate pairs.

I think this goes countrary to normal postgres approach of validating
data as strict as necessary. And I think we are going to regret not
fixing this while there are still relatively few users out there.

Exactly -

* allow in only valid JSON.
* Validate all utf8 strings for valid unicode.
* have one canonic way of outputting unicode - utf8 for utf8
databases,
escaped for all other encodings
* If you need to store anything else, use text.

Requiring preserving "original text" in json data field is Not Good!

I fully expect '{"a":1, "a":none, "a":true, "a":"b"}'::json to come out
as '{"a":b"}'

(I know that currently this is noty true and will happen only once I
read in the json value in client)

For anything else - don't use json, use any text type

If you really need a simple text-validated-as-valid-json-input datatype
then add this.

Call it jsontext or somesuch :)

Most of this is irrelevant to solving the current problem. We don't
have a
greenfields development of JSON - we have an existing type with which
we have to work.

We do have current (postgresql v9.2) "json" data type, which is loosely
defined as
"a string which can be fed to a JSON deserializer without errors".

But then we also have a standard.

Sadly the standard is really not about a type, but about serialising a
subset of JavaScript
structures to a utf8 string.

This kind of implies an underlying type which is a structure consisting of
a few basic types and lists and dictionaries/classes based on these but
does not really
specify any type.

Note too that your "utf8 for utf8 databases, escaped for all other
encodings"
is exactly the opposite of what Noah and Andres and Tom are arguing,
namely that the database encoding should not affect the behaviour,

It does not affect behaviour. It makes it possible to encode valid JSON
(utf8) in client
encoding which may not have full set of unicode characters.

We can do it here, as we know that by definition JSON _is_ unicode, so
clients have
to be able to process any unicode charcter. We can not do this for plain
text, which
has no defined charset outside what client_encoding postulates.

and is very close to what I proposed yesterday.

It's a pity that we don't have a non-error producing conversion function
(or if we do that I haven't found it). Then we might adopt a rule for
processing
unicode escapes that said "convert unicode escapes to the database
encoding

only when extracting JSON keys or values to text makes it sense to unescape
to database encoding.

strings inside JSON itself are by definition utf8

if possible, and if not then emit them unchanged." which might be a
reasonable
compromise.

I'd opt for "... and if not then emit them quoted". The default should
be not loosing
any data.

--
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic Oï¿½

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Hannu Krosing

hannu@tm.ee

almost 13 years ago

In reply to: Stefan Drees (#17)

Re: JSON and unicode surrogate pairs

On 06/11/2013 03:08 PM, Stefan Drees wrote:

quiring preserving "original text" in json data field is Not Good!

I fully expect '{"a":1, "a":none, "a":true, "a":"b"}'::json to come out
as '{"a":b"}'

ahem, do you mean instead to give (none -> null and missing '"'
inserted in "answer"):

yeah, mixed up none and null

=# SELECT '{"a":1, "a":null, "a":true, "a":"b"}'::json;
json
--------------------------------------
{"a":"b"}

or only when "stored" in database and subsequently retrieved? The
"original text" in this case was perfectly valid JSON text.

(I know that currently this is noty true and will happen only once I
read in the json value in client)

Isn't this a good situation and doesn't this also depend on the
storage representation in the client?

What about this:
=# SELECT '{"measure":"seconds", "measure":42}'::json;
json
--------------------------------------
{"measure":42}

I presume people being used to store metadata in "preceding" json
object members with duplicate names, would want to decide in the
client requesting the data what to do with the metadata information
and at what point to "drop", wouldn't they :-?)

Seems like blatant misuse of JSON format :)

I assume that as JSON is _serialisation_ format, it should represent a
data structure, not processing instructions.

I can see no possible JavaScript structure which could produce duplicate
key when serialised.

And I don't think that any standard JSON reader supports this either.

Of you want to store any JavaScript snippets in database use text.

Or perhaps pl/v8 :)

--
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic Oï¿½

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Andrew Dunstan

andrew@dunslane.net

almost 13 years ago

In reply to: Hannu Krosing (#18)

Re: JSON and unicode surrogate pairs

On 06/11/2013 09:16 AM, Hannu Krosing wrote:

It's a pity that we don't have a non-error producing conversion function
(or if we do that I haven't found it). Then we might adopt a rule for
processing
unicode escapes that said "convert unicode escapes to the database
encoding

only when extracting JSON keys or values to text makes it sense to unescape
to database encoding.

That's exactly the scenario we are talking about. When emitting JSON the
functions have always emitted unicode escapes as they are in the text,
and will continue to do so.

strings inside JSON itself are by definition utf8

We have deliberately extended that to allow JSON strings to be in any
database server encoding. That was argued back in the 9.2 timeframe and
I am not interested in re-litigating it.

The only issue at hand is how to handle unicode escapes (which in their
string form are pure ASCII) when emitting text strings.

if possible, and if not then emit them unchanged." which might be a
reasonable
compromise.

I'd opt for "... and if not then emit them quoted". The default should
be not loosing
any data.

I don't know what this means at all. Quoted how? Let's say I have a
Latin1 database and have the following JSON string: "\u20AC2.00". In a
UTF8 database the text representation of this is €2.00 - what are you
saying it should be in the Latin1 database?

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Andrew Dunstan

andrew@dunslane.net

almost 13 years ago

In reply to: Hannu Krosing (#19)

#22

Hannu Krosing

hannu@tm.ee

almost 13 years ago

In reply to: Andrew Dunstan (#20)

#23

Stefan Drees