jsonb, unicode escapes and escaped backslashes

Started by Andrew Dunstanabout 11 years ago45 messageshackers

Jump to latest

Andrew Dunstan

andrew@dunslane.net

about 11 years ago

The following case has just been brought to my attention (look at the
differing number of backslashes):

andrew=# select jsonb '"\\u0000"';
jsonb
----------
"\u0000"
(1 row)

andrew=# select jsonb '"\u0000"';
jsonb
----------
"\u0000"
(1 row)

andrew=# select json '"\u0000"';
json
----------
"\u0000"
(1 row)

andrew=# select json '"\\u0000"';
json
-----------
"\\u0000"
(1 row)

The problem is that jsonb uses the parsed, unescaped value of the
string, while json does not. when the string parser sees the input with
the 2 backslashes, it outputs a single backslash, and then it encounters
the remaining chareacters and emits them as is, resulting in a token of
'\u0000'. When it encounters the input with one backslash, it recognizes
a unicode escape, and because it's for u+0000 emits '\u0000'. All other
unicode escapes are resolved, so the only abiguity on input concerns
this case.

Things get worse, though. On output, '\uabcd' for any four hex digits is
recognized as a unicode escape, and thus the backslash is not escaped,
so that we get:

andrew=# select jsonb '"\\uabcd"';
jsonb
----------
"\uabcd"
(1 row)

We could probably fix this fairly easily for non- U+0000 cases by having
jsonb_to_cstring use a different escape_json routine.

But it's a mess, sadly, and I'm not sure what a good fix for the U+0000
case would look like. Maybe we should detect such input and emit a
warning of ambiguity? It's likely to be rare enough, but clearly not as
rare as we'd like, since this is a report from the field.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Noah Misch

noah@leadboat.com

about 11 years ago

In reply to: Andrew Dunstan (#1)

Re: jsonb, unicode escapes and escaped backslashes

On Wed, Jan 21, 2015 at 06:51:34PM -0500, Andrew Dunstan wrote:

The following case has just been brought to my attention (look at the
differing number of backslashes):

andrew=# select jsonb '"\\u0000"';
jsonb
----------
"\u0000"
(1 row)

andrew=# select jsonb '"\u0000"';
jsonb
----------
"\u0000"
(1 row)

A mess indeed. The input is unambiguous, but the jsonb representation can't
distinguish "\u0000" from "\\u0000". Some operations on the original json
type have similar problems, since they use an in-memory binary representation
with the same shortcoming:

[local] test=# select json_array_element_text($$["\u0000"]$$, 0) =
test-# json_array_element_text($$["\\u0000"]$$, 0);
?column?
----------
t
(1 row)

Things get worse, though. On output, '\uabcd' for any four hex digits is
recognized as a unicode escape, and thus the backslash is not escaped, so
that we get:

andrew=# select jsonb '"\\uabcd"';
jsonb
----------
"\uabcd"
(1 row)

We could probably fix this fairly easily for non- U+0000 cases by having
jsonb_to_cstring use a different escape_json routine.

Sounds reasonable. For 9.4.1, before more people upgrade?

But it's a mess, sadly, and I'm not sure what a good fix for the U+0000 case
would look like.

Agreed. When a string unescape algorithm removes some kinds of backslash
escapes and not others, it's nigh inevitable that two semantically-distinct
inputs can yield the same output. json_lex_string() fell into that trap by
making an exception for \u0000. To fix this, the result needs to be fully
unescaped (\u0000 converted to the NUL byte) or retain all backslash escapes.
(Changing that either way is no fun now that an on-disk format is at stake.)

Maybe we should detect such input and emit a warning of
ambiguity? It's likely to be rare enough, but clearly not as rare as we'd
like, since this is a report from the field.

Perhaps. Something like "WARNING: jsonb cannot represent \\u0000; reading as
\u0000"? Alas, but I do prefer that to silent data corruption.

Thanks,
nm

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andrew Dunstan

andrew@dunslane.net

about 11 years ago

In reply to: Noah Misch (#2)

Re: jsonb, unicode escapes and escaped backslashes

On 01/23/2015 02:18 AM, Noah Misch wrote:

On Wed, Jan 21, 2015 at 06:51:34PM -0500, Andrew Dunstan wrote:

The following case has just been brought to my attention (look at the
differing number of backslashes):

andrew=# select jsonb '"\\u0000"';
jsonb
----------
"\u0000"
(1 row)

andrew=# select jsonb '"\u0000"';
jsonb
----------
"\u0000"
(1 row)

A mess indeed. The input is unambiguous, but the jsonb representation can't
distinguish "\u0000" from "\\u0000". Some operations on the original json
type have similar problems, since they use an in-memory binary representation
with the same shortcoming:

[local] test=# select json_array_element_text($$["\u0000"]$$, 0) =
test-# json_array_element_text($$["\\u0000"]$$, 0);
?column?
----------
t
(1 row)

Things get worse, though. On output, '\uabcd' for any four hex digits is
recognized as a unicode escape, and thus the backslash is not escaped, so
that we get:

andrew=# select jsonb '"\\uabcd"';
jsonb
----------
"\uabcd"
(1 row)

We could probably fix this fairly easily for non- U+0000 cases by having
jsonb_to_cstring use a different escape_json routine.

Sounds reasonable. For 9.4.1, before more people upgrade?

But it's a mess, sadly, and I'm not sure what a good fix for the U+0000 case
would look like.

Agreed. When a string unescape algorithm removes some kinds of backslash
escapes and not others, it's nigh inevitable that two semantically-distinct
inputs can yield the same output. json_lex_string() fell into that trap by
making an exception for \u0000. To fix this, the result needs to be fully
unescaped (\u0000 converted to the NUL byte) or retain all backslash escapes.
(Changing that either way is no fun now that an on-disk format is at stake.)

Maybe we should detect such input and emit a warning of
ambiguity? It's likely to be rare enough, but clearly not as rare as we'd
like, since this is a report from the field.

Perhaps. Something like "WARNING: jsonb cannot represent \\u0000; reading as
\u0000"? Alas, but I do prefer that to silent data corruption.

Maybe something like this patch. I have two concerns about it, though.
The first is the possible performance impact of looking for the
offending string in every jsonb input, and the second is that the test
isn't quite right, since input of \\\u0000 doesn't raise this issue -
i.e. the problem arises when u0000 is preceded by an even number of
backslashes.

For the moment, maybe I could commit the fix for the non U+0000 case for
escape_json, and we could think some more about detecting and warning
about the problem strings.

cheers

andrew

Noah Misch

noah@leadboat.com

about 11 years ago

In reply to: Andrew Dunstan (#3)

Re: jsonb, unicode escapes and escaped backslashes

On Mon, Jan 26, 2015 at 09:20:54AM -0500, Andrew Dunstan wrote:

On 01/23/2015 02:18 AM, Noah Misch wrote:

On Wed, Jan 21, 2015 at 06:51:34PM -0500, Andrew Dunstan wrote:

We could probably fix this fairly easily for non- U+0000 cases by having
jsonb_to_cstring use a different escape_json routine.

Maybe we should detect such input and emit a warning of
ambiguity? It's likely to be rare enough, but clearly not as rare as we'd
like, since this is a report from the field.

Perhaps. Something like "WARNING: jsonb cannot represent \\u0000; reading as
\u0000"? Alas, but I do prefer that to silent data corruption.

Maybe something like this patch. I have two concerns about it, though. The
first is the possible performance impact of looking for the offending string
in every jsonb input, and the second is that the test isn't quite right,
since input of \\\u0000 doesn't raise this issue - i.e. the problem arises
when u0000 is preceded by an even number of backslashes.

I share your second concern. It is important that this warning not cry wolf;
it should never fire for an input that is actually unaffected.

For the moment, maybe I could commit the fix for the non U+0000 case for
escape_json, and we could think some more about detecting and warning about
the problem strings.

+1 for splitting development that way. Fixing the use of escape_json() is
objective, but the choices around the warning are more subtle.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andrew Dunstan

andrew@dunslane.net

about 11 years ago

In reply to: Noah Misch (#4)

Re: jsonb, unicode escapes and escaped backslashes

On 01/27/2015 12:24 AM, Noah Misch wrote:

For the moment, maybe I could commit the fix for the non U+0000 case for
escape_json, and we could think some more about detecting and warning about
the problem strings.

+1 for splitting development that way. Fixing the use of escape_json() is
objective, but the choices around the warning are more subtle.

OK, so something like this patch? I'm mildly concerned that you and I
are the only ones commenting on this.

cheers

andrew

Tom Lane

tgl@sss.pgh.pa.us

about 11 years ago

In reply to: Andrew Dunstan (#5)

Re: jsonb, unicode escapes and escaped backslashes

Andrew Dunstan <andrew@dunslane.net> writes:

On 01/27/2015 12:24 AM, Noah Misch wrote:

+1 for splitting development that way. Fixing the use of escape_json() is
objective, but the choices around the warning are more subtle.

OK, so something like this patch? I'm mildly concerned that you and I
are the only ones commenting on this.

Doesn't seem to me like this fixes anything. If the content of a jsonb
value is correct, the output will be the same with or without this patch;
and if it's not, this isn't going to really improve matters.

I think coding anything is premature until we decide how we're going to
deal with the fundamental ambiguity.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andrew Dunstan

andrew@dunslane.net

about 11 years ago

In reply to: Tom Lane (#6)

Re: jsonb, unicode escapes and escaped backslashes

On 01/27/2015 12:23 PM, Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

On 01/27/2015 12:24 AM, Noah Misch wrote:

+1 for splitting development that way. Fixing the use of escape_json() is
objective, but the choices around the warning are more subtle.

OK, so something like this patch? I'm mildly concerned that you and I
are the only ones commenting on this.

Doesn't seem to me like this fixes anything. If the content of a jsonb
value is correct, the output will be the same with or without this patch;
and if it's not, this isn't going to really improve matters.

I think coding anything is premature until we decide how we're going to
deal with the fundamental ambiguity.

The input \\uabcd will be stored correctly as \uabcd, but this will in
turn be rendered as \uabcd, whereas it should be rendered as \\uabcd.
That's what the patch fixes.

There are two problems here and this addresses one of them. The other
problem is the ambiguity regarding \\u0000 and \u0000.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

about 11 years ago

In reply to: Andrew Dunstan (#7)

Re: jsonb, unicode escapes and escaped backslashes

Andrew Dunstan <andrew@dunslane.net> writes:

On 01/27/2015 12:23 PM, Tom Lane wrote:

I think coding anything is premature until we decide how we're going to
deal with the fundamental ambiguity.

The input \\uabcd will be stored correctly as \uabcd, but this will in
turn be rendered as \uabcd, whereas it should be rendered as \\uabcd.
That's what the patch fixes.
There are two problems here and this addresses one of them. The other
problem is the ambiguity regarding \\u0000 and \u0000.

It's the same problem really, and until we have an answer about
what to do with \u0000, I think any patch is half-baked and possibly
counterproductive.

In particular, I would like to suggest that the current representation of
\u0000 is fundamentally broken and that we have to change it, not try to
band-aid around it. This will mean an on-disk incompatibility for jsonb
data containing U+0000, but hopefully there is very little of that out
there yet. If we can get a fix into 9.4.1, I think it's reasonable to
consider such solutions.

The most obvious way to store such data unambiguously is to just go ahead
and store U+0000 as a NUL byte (\000). The only problem with that is that
then such a string cannot be considered to be a valid value of type TEXT,
which would mean that we'd need to throw an error if we were asked to
convert a JSON field containing such a character to text. I don't
particularly have a problem with that, except possibly for the time cost
of checking for \000 before allowing a conversion to occur. While a
memchr() check might be cheap enough, we could also consider inventing a
new JEntry type code for string-containing-null, so that there's a
distinction in the type system between strings that are coercible to text
and those that are not.

If we went down a path like that, the currently proposed patch would be
quite useless.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andrew Dunstan

andrew@dunslane.net

about 11 years ago

In reply to: Tom Lane (#8)

Re: jsonb, unicode escapes and escaped backslashes

On 01/27/2015 01:40 PM, Tom Lane wrote:

In particular, I would like to suggest that the current representation of
\u0000 is fundamentally broken and that we have to change it, not try to
band-aid around it. This will mean an on-disk incompatibility for jsonb
data containing U+0000, but hopefully there is very little of that out
there yet. If we can get a fix into 9.4.1, I think it's reasonable to
consider such solutions.

Hmm, OK. I had thought we'd be ruling that out, but I agree if it's on
the table what I suggested is unnecessary.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Tom Lane

tgl@sss.pgh.pa.us

about 11 years ago

In reply to: Andrew Dunstan (#9)

Re: jsonb, unicode escapes and escaped backslashes

Andrew Dunstan <andrew@dunslane.net> writes:

On 01/27/2015 01:40 PM, Tom Lane wrote:

In particular, I would like to suggest that the current representation of
\u0000 is fundamentally broken and that we have to change it, not try to
band-aid around it. This will mean an on-disk incompatibility for jsonb
data containing U+0000, but hopefully there is very little of that out
there yet. If we can get a fix into 9.4.1, I think it's reasonable to
consider such solutions.

Hmm, OK. I had thought we'd be ruling that out, but I agree if it's on
the table what I suggested is unnecessary.

Well, we can either fix it now or suffer with a broken representation
forever. I'm not wedded to the exact solution I described, but I think
we'll regret it if we don't change the representation.

The only other plausible answer seems to be to flat out reject \u0000.
But I assume nobody likes that.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Andrew Dunstan

andrew@dunslane.net

about 11 years ago

In reply to: Tom Lane (#10)

Re: jsonb, unicode escapes and escaped backslashes

On 01/27/2015 02:28 PM, Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

On 01/27/2015 01:40 PM, Tom Lane wrote:

In particular, I would like to suggest that the current representation of
\u0000 is fundamentally broken and that we have to change it, not try to
band-aid around it. This will mean an on-disk incompatibility for jsonb
data containing U+0000, but hopefully there is very little of that out
there yet. If we can get a fix into 9.4.1, I think it's reasonable to
consider such solutions.

Hmm, OK. I had thought we'd be ruling that out, but I agree if it's on
the table what I suggested is unnecessary.

Well, we can either fix it now or suffer with a broken representation
forever. I'm not wedded to the exact solution I described, but I think
we'll regret it if we don't change the representation.

The only other plausible answer seems to be to flat out reject \u0000.
But I assume nobody likes that.

I don't think we can be in the business of rejecting valid JSON.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Tom Lane

tgl@sss.pgh.pa.us

about 11 years ago

In reply to: Andrew Dunstan (#11)

Re: jsonb, unicode escapes and escaped backslashes

Andrew Dunstan <andrew@dunslane.net> writes:

On 01/27/2015 02:28 PM, Tom Lane wrote:

Well, we can either fix it now or suffer with a broken representation
forever. I'm not wedded to the exact solution I described, but I think
we'll regret it if we don't change the representation.

The only other plausible answer seems to be to flat out reject \u0000.
But I assume nobody likes that.

I don't think we can be in the business of rejecting valid JSON.

Actually, after studying the code a bit, I wonder if we wouldn't be best
off to do exactly that, at least for 9.4.x. At minimum we're talking
about an API change for JsonSemAction functions (which currently get the
already-de-escaped string as a C string; not gonna work for embedded
nulls). I'm not sure if there are already third-party extensions using
that API, but it seems possible, in which case changing it in a minor
release wouldn't be nice. Even ignoring that risk, making sure
we'd fixed everything seems like more than a day's work, which is as
much as I for one could spare before 9.4.1.

Also, while the idea of throwing error only when a \0 needs to be
converted to text seems logically clean, it looks like that might pretty
much cripple the usability of such values anyhow, because we convert to
text at the drop of a hat. So some investigation and probably additional
work would be needed to ensure you could do at least basic things with
such values. (A function for direct conversion to bytea might be useful
too.)

I think the "it would mean rejecting valid JSON" argument is utter
hogwash. We already reject, eg, "\u00A0" if you're not using a UTF8
encoding. And we reject "1e10000", not because that's invalid JSON
but because of an implementation restriction of our underlying numeric
type. I don't see any moral superiority of that over rejecting "\u0000"
because of an implementation restriction of our underlying text type.

So at this point I propose that we reject \u0000 when de-escaping JSON.
Anybody who's seriously unhappy with that can propose a patch to fix it
properly in 9.5 or later.

We probably need to rethink the re-escaping behavior as well; I'm not
sure if your latest patch is the right answer for that.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Merlin Moncure

mmoncure@gmail.com

about 11 years ago

In reply to: Tom Lane (#8)

Re: jsonb, unicode escapes and escaped backslashes

On Tue, Jan 27, 2015 at 12:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

On 01/27/2015 12:23 PM, Tom Lane wrote:

I think coding anything is premature until we decide how we're going to
deal with the fundamental ambiguity.

The input \\uabcd will be stored correctly as \uabcd, but this will in
turn be rendered as \uabcd, whereas it should be rendered as \\uabcd.
That's what the patch fixes.
There are two problems here and this addresses one of them. The other
problem is the ambiguity regarding \\u0000 and \u0000.

It's the same problem really, and until we have an answer about
what to do with \u0000, I think any patch is half-baked and possibly
counterproductive.

In particular, I would like to suggest that the current representation of
\u0000 is fundamentally broken and that we have to change it, not try to
band-aid around it. This will mean an on-disk incompatibility for jsonb
data containing U+0000, but hopefully there is very little of that out
there yet. If we can get a fix into 9.4.1, I think it's reasonable to
consider such solutions.

The most obvious way to store such data unambiguously is to just go ahead
and store U+0000 as a NUL byte (\000). The only problem with that is that
then such a string cannot be considered to be a valid value of type TEXT,
which would mean that we'd need to throw an error if we were asked to
convert a JSON field containing such a character to text.

Hm, does this include text out operations for display purposes? If
so, then any query selecting jsonb objects with null bytes would fail.
How come we have to error out? How about a warning indicating the
string was truncated?

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Tom Lane

tgl@sss.pgh.pa.us

about 11 years ago

In reply to: Merlin Moncure (#13)

Re: jsonb, unicode escapes and escaped backslashes

Merlin Moncure <mmoncure@gmail.com> writes:

On Tue, Jan 27, 2015 at 12:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

In particular, I would like to suggest that the current representation of
\u0000 is fundamentally broken and that we have to change it, not try to
band-aid around it. This will mean an on-disk incompatibility for jsonb
data containing U+0000, but hopefully there is very little of that out
there yet. If we can get a fix into 9.4.1, I think it's reasonable to
consider such solutions.

The most obvious way to store such data unambiguously is to just go ahead
and store U+0000 as a NUL byte (\000). The only problem with that is that
then such a string cannot be considered to be a valid value of type TEXT,
which would mean that we'd need to throw an error if we were asked to
convert a JSON field containing such a character to text.

Hm, does this include text out operations for display purposes? If
so, then any query selecting jsonb objects with null bytes would fail.
How come we have to error out? How about a warning indicating the
string was truncated?

No, that's not a problem, because jsonb_out would produce an escaped
textual representation, so any null would come out as \u0000 again.
The trouble comes up when you do something that's supposed to produce
a *non escaped* text equivalent of a JSON string value, such as
the ->> operator.

Arguably, ->> is broken already with the current coding, in that
these results are entirely inconsistent:

regression=# select '{"a":"foo\u0040bar"}'::jsonb ->> 'a';
?column?
----------
foo@bar
(1 row)

regression=# select '{"a":"foo\u0000bar"}'::jsonb ->> 'a';
?column?
--------------
foo\u0000bar
(1 row)

regression=# select '{"a":"foo\\u0000bar"}'::jsonb ->> 'a';
?column?
--------------
foo\u0000bar
(1 row)

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Noah Misch

noah@leadboat.com

about 11 years ago

In reply to: Tom Lane (#12)

Re: jsonb, unicode escapes and escaped backslashes

On Tue, Jan 27, 2015 at 03:56:22PM -0500, Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

On 01/27/2015 02:28 PM, Tom Lane wrote:

Well, we can either fix it now or suffer with a broken representation
forever. I'm not wedded to the exact solution I described, but I think
we'll regret it if we don't change the representation.

So at this point I propose that we reject \u0000 when de-escaping JSON.

I would have agreed on 2014-12-09, and this release is the last chance to make
such a change. It is a bold wager that could pay off, but -1 from me anyway.
I can already envision the blog post from the DBA staying on 9.4.0 because
9.4.1 pulled his ability to store U+0000 in jsonb. jsonb was *the* top-billed
9.4 feature, and this thread started with Andrew conveying a field report of a
scenario more obscure than storing U+0000. Therefore, we have to assume many
users will notice the change. This move would also add to the growing
evidence that our .0 releases are really beta(N+1) releases in disguise.

Anybody who's seriously unhappy with that can propose a patch to fix it
properly in 9.5 or later.

Someone can still do that by introducing a V2 of the jsonb binary format and
preserving the ability to read both formats. (Too bad Andres's proposal to
include a format version didn't inform the final format, but we can wing it.)
I agree that storing U+0000 as 0x00 is the best end state.

We probably need to rethink the re-escaping behavior as well; I'm not
sure if your latest patch is the right answer for that.

Yes, we do. No change to the representation of U+0000 is going to fix the
following bug, but that patch does fix it:

[local] test=# select
test-# $$"\\u05e2"$$::jsonb = $$"\\u05e2"$$::jsonb,
test-# $$"\\u05e2"$$::jsonb = $$"\\u05e2"$$::jsonb::text::jsonb;
?column? | ?column?
----------+----------
t | f
(1 row)

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Andrew Dunstan

andrew@dunslane.net

about 11 years ago

In reply to: Noah Misch (#15)

Re: jsonb, unicode escapes and escaped backslashes

On 01/28/2015 12:50 AM, Noah Misch wrote:

On Tue, Jan 27, 2015 at 03:56:22PM -0500, Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

On 01/27/2015 02:28 PM, Tom Lane wrote:

Well, we can either fix it now or suffer with a broken representation
forever. I'm not wedded to the exact solution I described, but I think
we'll regret it if we don't change the representation.

So at this point I propose that we reject \u0000 when de-escaping JSON.

I would have agreed on 2014-12-09, and this release is the last chance to make
such a change. It is a bold wager that could pay off, but -1 from me anyway.
I can already envision the blog post from the DBA staying on 9.4.0 because
9.4.1 pulled his ability to store U+0000 in jsonb. jsonb was *the* top-billed
9.4 feature, and this thread started with Andrew conveying a field report of a
scenario more obscure than storing U+0000. Therefore, we have to assume many
users will notice the change. This move would also add to the growing
evidence that our .0 releases are really beta(N+1) releases in disguise.

Anybody who's seriously unhappy with that can propose a patch to fix it
properly in 9.5 or later.

Someone can still do that by introducing a V2 of the jsonb binary format and
preserving the ability to read both formats. (Too bad Andres's proposal to
include a format version didn't inform the final format, but we can wing it.)
I agree that storing U+0000 as 0x00 is the best end state.

We need to make up our minds about this pretty quickly. The more radical
move is likely to involve quite a bit of work, ISTM.

It's not clear to me how we should represent a unicode null. i.e. given
a json of '["foo\u0000bar"]', I get that we'd store the element as
'foo\x00bar', but what is the result of

(jsonb '["foo\u0000bar"')->>0

It's defined to be text so we can't just shove a binary null in the
middle of it. Do we throw an error?

And I still want to hear more voices on the whole direction we want to
take this.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Tom Lane

tgl@sss.pgh.pa.us

about 11 years ago

In reply to: Andrew Dunstan (#16)

Re: jsonb, unicode escapes and escaped backslashes

Andrew Dunstan <andrew@dunslane.net> writes:

It's not clear to me how we should represent a unicode null. i.e. given
a json of '["foo\u0000bar"]', I get that we'd store the element as
'foo\x00bar', but what is the result of

(jsonb '["foo\u0000bar"')->>0

It's defined to be text so we can't just shove a binary null in the
middle of it. Do we throw an error?

Yes, that is what I was proposing upthread. Obviously, this needs some
thought to ensure that there's *something* useful you can do with a field
containing a nul, but we'd have little choice but to throw an error if
the user asks us to convert such a field to unescaped text.

I'd be a bit inclined to reject nuls in object field names even if we
allow them in field values, since just about everything you can usefully
do with a field name involves regarding it as text.

Another interesting implementation problem is what does indexing do with
such values --- ISTR there's an implicit conversion to C strings in there
too, at least in GIN indexes.

Anyway, there is a significant amount of work involved here, and there's
no way we're getting it done for 9.4.1, or probably 9.4.anything. I think
our only realistic choice right now is to throw error for \u0000 so that
we can preserve our options for doing something useful with it later.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Tom Lane

tgl@sss.pgh.pa.us

about 11 years ago

In reply to: Noah Misch (#15)

Re: jsonb, unicode escapes and escaped backslashes

Noah Misch <noah@leadboat.com> writes:

On Tue, Jan 27, 2015 at 03:56:22PM -0500, Tom Lane wrote:

So at this point I propose that we reject \u0000 when de-escaping JSON.

I would have agreed on 2014-12-09, and this release is the last chance to make
such a change. It is a bold wager that could pay off, but -1 from me anyway.

You only get to vote -1 if you have a credible alternative. I don't see
one.

I can already envision the blog post from the DBA staying on 9.4.0 because
9.4.1 pulled his ability to store U+0000 in jsonb.

Those will be more or less the same people who bitch about text not
storing NULs; the world has not fallen.

jsonb was *the* top-billed
9.4 feature, and this thread started with Andrew conveying a field report of a
scenario more obscure than storing U+0000.

There is a separate issue, which is that our earlier attempt to make the
world safe for \u0000 actually broke other things. We still need to fix
that, but I think the fix probably consists of reverting that patch and
instead disallowing \u0000.

Anybody who's seriously unhappy with that can propose a patch to fix it
properly in 9.5 or later.

Someone can still do that by introducing a V2 of the jsonb binary format and
preserving the ability to read both formats. (Too bad Andres's proposal to
include a format version didn't inform the final format, but we can wing it.)
I agree that storing U+0000 as 0x00 is the best end state.

We will not need a v2 format, merely values that contain NULs. Existing
data containing \u0000 will be read as those six ASCII characters, which
is not really wrong since that might well be what it was anyway.

We probably need to rethink the re-escaping behavior as well; I'm not
sure if your latest patch is the right answer for that.

Yes, we do. No change to the representation of U+0000 is going to fix the
following bug, but that patch does fix it:

[local] test=# select
test-# $$"\\u05e2"$$::jsonb = $$"\\u05e2"$$::jsonb,
test-# $$"\\u05e2"$$::jsonb = $$"\\u05e2"$$::jsonb::text::jsonb;

The cause of this bug is commit 0ad1a816320a2b539a51628e2a0b1e83ff096b1d,
which I'm inclined to think we need to simply revert, not render even
more squirrely.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

David G. Johnston

david.g.johnston@gmail.com

about 11 years ago

In reply to: Tom Lane (#12)

Re: jsonb, unicode escapes and escaped backslashes

Tom Lane-2 wrote

Andrew Dunstan <

andrew@

> writes:

On 01/27/2015 02:28 PM, Tom Lane wrote:

Well, we can either fix it now or suffer with a broken representation
forever. I'm not wedded to the exact solution I described, but I think
we'll regret it if we don't change the representation.

The only other plausible answer seems to be to flat out reject \u0000.
But I assume nobody likes that.

I don't think we can be in the business of rejecting valid JSON.

Actually, after studying the code a bit, I wonder if we wouldn't be best
off to do exactly that, at least for 9.4.x. At minimum we're talking
about an API change for JsonSemAction functions (which currently get the
already-de-escaped string as a C string; not gonna work for embedded
nulls). I'm not sure if there are already third-party extensions using
that API, but it seems possible, in which case changing it in a minor
release wouldn't be nice. Even ignoring that risk, making sure
we'd fixed everything seems like more than a day's work, which is as
much as I for one could spare before 9.4.1.

So at this point I propose that we reject \u0000 when de-escaping JSON.
Anybody who's seriously unhappy with that can propose a patch to fix it
properly in 9.5 or later.

The hybrid option is to reject the values for 9.4.1 but then commit to
removing that hack and fixing this properly in 9.4.2; we can always call
that release 9.5...

I think the "it would mean rejecting valid JSON" argument is utter
hogwash. We already reject, eg, "\u00A0" if you're not using a UTF8
encoding. And we reject "1e10000", not because that's invalid JSON
but because of an implementation restriction of our underlying numeric
type. I don't see any moral superiority of that over rejecting "\u0000"
because of an implementation restriction of our underlying text type.

Am I missing something or has there been no consideration in this "forbid"
plan on whether users will be able to retrieve, even if partially
incorrectly, any jsonb data that has already been stored? If we mangled
their data on input we should at least return the data and provide them a
chance to manually (or automatically depending on their data) fix our
mistake.

Given we already disallow NUL in text ISTM that allowing said data in other
text-like areas is asking for just the kind of trouble we are seeing here.
I'm OK with the proposition that those wishing to utilize NUL are relegated
to working with bytea.

From the commit Tom references down-thread:

"However, this led to some perverse results in the case of Unicode
sequences."

Given that said results are not documented in the commit its hard to judge
whether a complete revert is being overly broad...

Anyway, just some observations since I'm not currently a user of JSON.
Tom's arguments and counter-arguments ring true to me in the general sense.
The DBA staying on 9.4.0 because of this change probably just needs to be
told to go back to using "json" and then run the update. Their data has
issues even they stay on 9.4.0 with the more accepting version of jsonb.

David J.

--
View this message in context: http://postgresql.nabble.com/jsonb-unicode-escapes-and-escaped-backslashes-tp5834962p5835824.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Tom Lane

tgl@sss.pgh.pa.us

about 11 years ago

In reply to: David G. Johnston (#19)

Re: jsonb, unicode escapes and escaped backslashes

David G Johnston <david.g.johnston@gmail.com> writes:

Am I missing something or has there been no consideration in this "forbid"
plan on whether users will be able to retrieve, even if partially
incorrectly, any jsonb data that has already been stored?

Data that's already been stored would look like the six characters \u0000,
which indeed might have been what it was anyway, since part of the
problem here is that we fail to distinguish "\\u0000" from "\u0000".

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21