Unicode escapes with any backend encoding

Started by Tom Laneabout 6 years ago16 messageshackers
Jump to latest
#1Tom Lane
tgl@sss.pgh.pa.us

I threatened to do this in another thread [1]/messages/by-id/CACPNZCvaoa3EgVWm5yZhcSTX6RAtaLgniCPcBVOCwm8h3xpWkw@mail.gmail.com, so here it is.

This patch removes the restriction that the server encoding must
be UTF-8 in order to write any Unicode escape with a value outside
the ASCII range. Instead, we'll allow the notation and convert to
the server encoding if that's possible. (If it isn't, of course
you get an encoding conversion failure.)

In the cases that were already supported, namely ASCII characters
or UTF-8 server encoding, this should be only immeasurably slower
than before. Otherwise, it calls the appropriate encoding conversion
procedure, which of course will take a little time. But that's
better than failing, surely.

One way in which this is slightly less good than before is that
you no longer get a syntax error cursor pointing at the problematic
escape when conversion fails. If we were really excited about that,
something could be done with setting up an errcontext stack entry.
But that would add a few cycles, so I wasn't sure whether to do it.

Grepping for other direct uses of unicode_to_utf8(), I notice that
there are a couple of places in the JSON code where we have a similar
restriction that you can only write a Unicode escape in UTF8 server
encoding. I'm not sure whether these same semantics could be
applied there, so I didn't touch that.

Thoughts?

regards, tom lane

[1]: /messages/by-id/CACPNZCvaoa3EgVWm5yZhcSTX6RAtaLgniCPcBVOCwm8h3xpWkw@mail.gmail.com

Attachments:

unicode-escapes-with-other-server-encodings-1.patchtext/x-diff; charset=us-ascii; name=unicode-escapes-with-other-server-encodings-1.patchDownload+193-105
#2Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#1)
Re: Unicode escapes with any backend encoding

On Tue, Jan 14, 2020 at 10:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Grepping for other direct uses of unicode_to_utf8(), I notice that
there are a couple of places in the JSON code where we have a similar
restriction that you can only write a Unicode escape in UTF8 server
encoding. I'm not sure whether these same semantics could be
applied there, so I didn't touch that.

Off the cuff I'd be inclined to say we should keep the text escape
rules the same. We've already extended the JSON standard y allowing
non-UTF8 encodings.

cheers

andrew

--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#2)
Re: Unicode escapes with any backend encoding

Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:

On Tue, Jan 14, 2020 at 10:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Grepping for other direct uses of unicode_to_utf8(), I notice that
there are a couple of places in the JSON code where we have a similar
restriction that you can only write a Unicode escape in UTF8 server
encoding. I'm not sure whether these same semantics could be
applied there, so I didn't touch that.

Off the cuff I'd be inclined to say we should keep the text escape
rules the same. We've already extended the JSON standard y allowing
non-UTF8 encodings.

Right. I'm just thinking though that if you can write "é" literally
in a JSON string, even though you're using LATIN1 not UTF8, then why
not allow writing that as "\u00E9" instead? The latter is arguably
truer to spec.

However, if JSONB collapses "\u00E9" to LATIN1 "é", that would be bad,
unless we have a way to undo it on printout. So there might be
some more moving parts here than I thought.

regards, tom lane

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#3)
Re: Unicode escapes with any backend encoding

I wrote:

Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:

On Tue, Jan 14, 2020 at 10:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Grepping for other direct uses of unicode_to_utf8(), I notice that
there are a couple of places in the JSON code where we have a similar
restriction that you can only write a Unicode escape in UTF8 server
encoding. I'm not sure whether these same semantics could be
applied there, so I didn't touch that.

Off the cuff I'd be inclined to say we should keep the text escape
rules the same. We've already extended the JSON standard y allowing
non-UTF8 encodings.

Right. I'm just thinking though that if you can write "é" literally
in a JSON string, even though you're using LATIN1 not UTF8, then why
not allow writing that as "\u00E9" instead? The latter is arguably
truer to spec.
However, if JSONB collapses "\u00E9" to LATIN1 "é", that would be bad,
unless we have a way to undo it on printout. So there might be
some more moving parts here than I thought.

On third thought, what would be so bad about that? Let's suppose
I write:

INSERT ... values('{"x": "\u00E9"}'::jsonb);

and the jsonb parsing logic chooses to collapse the backslash to
the represented character, i.e., "é". Why should it matter whether
the database encoding is UTF8 or LATIN1? If I am using UTF8
client encoding, I will see the "é" in UTF8 encoding either way,
because of output encoding conversion. If I am using LATIN1
client encoding, I will see the "é" in LATIN1 either way --- or
at least, I will if the database encoding is UTF8. Right now I get
an error for that when the database encoding is LATIN1 ... but if
I store the "é" as literal "é", it works, either way. So it seems
to me that this error is just useless pedantry. As long as the DB
encoding can represent the desired character, it should be transparent
to users.

regards, tom lane

#5Chapman Flack
chap@anastigmatix.net
In reply to: Tom Lane (#4)
Re: Unicode escapes with any backend encoding

On 1/14/20 10:10 AM, Tom Lane wrote:

to me that this error is just useless pedantry. As long as the DB
encoding can represent the desired character, it should be transparent
to users.

That's my position too.

Regards,
-Chap

#6Andrew Dunstan
andrew@dunslane.net
In reply to: Chapman Flack (#5)
Re: Unicode escapes with any backend encoding

On Wed, Jan 15, 2020 at 4:25 AM Chapman Flack <chap@anastigmatix.net> wrote:

On 1/14/20 10:10 AM, Tom Lane wrote:

to me that this error is just useless pedantry. As long as the DB
encoding can represent the desired character, it should be transparent
to users.

That's my position too.

and mine.

cheers

andrew

--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#6)
Re: Unicode escapes with any backend encoding

Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:

On Wed, Jan 15, 2020 at 4:25 AM Chapman Flack <chap@anastigmatix.net> wrote:

On 1/14/20 10:10 AM, Tom Lane wrote:

to me that this error is just useless pedantry. As long as the DB
encoding can represent the desired character, it should be transparent
to users.

That's my position too.

and mine.

I'm confused --- yesterday you seemed to be against this idea.
Have you changed your mind?

I'll gladly go change the patch if people are on board with this.

regards, tom lane

#8Chapman Flack
chap@anastigmatix.net
In reply to: Tom Lane (#7)
Re: Unicode escapes with any backend encoding

On 1/14/20 4:25 PM, Tom Lane wrote:

Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:

On Wed, Jan 15, 2020 at 4:25 AM Chapman Flack <chap@anastigmatix.net> wrote:

On 1/14/20 10:10 AM, Tom Lane wrote:

to me that this error is just useless pedantry. As long as the DB
encoding can represent the desired character, it should be transparent
to users.

That's my position too.

and mine.

I'm confused --- yesterday you seemed to be against this idea.
Have you changed your mind?

I'll gladly go change the patch if people are on board with this.

Hmm, well, let me clarify for my own part what I think I'm agreeing
with ... perhaps it's misaligned with something further upthread.

In an ideal world (which may be ideal in more ways than are in scope
for the present discussion) I would expect to see these principles:

1. On input, whether a Unicode escape is or isn't allowed should
not depend on any encoding settings. It should be lexically
allowed always, and if it represents a character that exists
in the server encoding, it should mean that character. If it's
not representable in the storage format, it should produce an
error that says that.

2. If it happens that the character is representable in both the
storage encoding and the client encoding, it shouldn't matter
whether it arrives literally as an é or as an escape. Either
should get stored on disk as the same bytes.

3. On output, as long as the character is representable in the client
encoding, there is nothing to worry about. It will be sent as its
representation in the client encoding (which may be different bytes
than its representation in the server encoding).

4. If a character to be output isn't in the client encoding, it
will be datatype-dependent whether there is any way to escape.
For example, xml_out could produce &#x????; forms, and json_out
could produce \u???? forms.

5. If the datatype being output has no escaping rules available
(as would be the case for an ordinary text column, say), then
the unrepresentable character has to be reported in an error.
(Encoding conversions often have the option of substituting
a replacement character like ? but I don't believe a DBMS has
any business making such changes to data, unless by explicit
opt-in. If it can't give you the data you wanted, it should
say "here's why I can't give you that.")

6. While 'text' in general provides no escaping mechanism, some
functions that produce text may still have that option. For
example, quote_literal and quote_ident could conceivably
produce the U&'...' or U&"..." forms, respectively, if
the argument contains characters that won't go in the client
encoding.

I understand that on the way from 1 to 6 I will have drifted
further from what's discussed in this thread; for example, I bet
that quote_literal/quote_ident never produce U& forms now, and
that no one is proposing to change that, and I'm pretending not
to notice the question of how astonishing such behavior could be.
(Not to mention, how would they know whether they are returning
a value that's destined to go across the client encoding, rather
than to be used in a purely server-side expression? Maybe distinct
versions of those functions could take an encoding argument, and
produce the U& forms when the content won't go in the specified
encoding. That would avoid astonishing changes to existing functions.)

Regards,
-Chap

#9Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#7)
Re: Unicode escapes with any backend encoding

On Wed, Jan 15, 2020 at 7:55 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:

On Wed, Jan 15, 2020 at 4:25 AM Chapman Flack <chap@anastigmatix.net> wrote:

On 1/14/20 10:10 AM, Tom Lane wrote:

to me that this error is just useless pedantry. As long as the DB
encoding can represent the desired character, it should be transparent
to users.

That's my position too.

and mine.

I'm confused --- yesterday you seemed to be against this idea.
Have you changed your mind?

I'll gladly go change the patch if people are on board with this.

Perhaps I expressed myself badly. What I meant was that we should keep
the json and text escape rules in sync, as they are now. Since we're
changing the text rules to allow resolvable non-ascii unicode escapes
in non-utf8 locales, we should do the same for json.

cheers

andrew

--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#9)
Re: Unicode escapes with any backend encoding

Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:

Perhaps I expressed myself badly. What I meant was that we should keep
the json and text escape rules in sync, as they are now. Since we're
changing the text rules to allow resolvable non-ascii unicode escapes
in non-utf8 locales, we should do the same for json.

Got it. I'll make the patch do that in a little bit.

regards, tom lane

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#10)
Re: Unicode escapes with any backend encoding

I wrote:

Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:

Perhaps I expressed myself badly. What I meant was that we should keep
the json and text escape rules in sync, as they are now. Since we're
changing the text rules to allow resolvable non-ascii unicode escapes
in non-utf8 locales, we should do the same for json.

Got it. I'll make the patch do that in a little bit.

OK, here's v2, which brings JSONB into the fold and also makes some
effort to produce an accurate error cursor for invalid Unicode escapes.
As it's set up, we only pay the extra cost of setting up an error
context callback when we're actually processing a Unicode escape,
so I think that's an acceptable cost. (It's not much of a cost,
anyway.)

The callback support added here is pretty much a straight copy-and-paste
of the existing functions setup_parser_errposition_callback() and friends.
That's slightly annoying --- we could perhaps merge those into one.
But I didn't see a good common header to put such a thing into, so
I just did it like this.

Another note is that we could use the additional scanner infrastructure
to produce more accurate error pointers for other cases where we're
whining about a bad escape sequence, or some other sub-part of a lexical
token. I think that'd likely be a good idea, since the existing cursor
placement at the start of the token isn't too helpful if e.g. you're
dealing with a very long string constant. But to keep this focused,
I only touched the behavior for Unicode escapes. The rest could be
done as a separate patch.

This also mops up after 7f380c59 by making use of the new pg_wchar.c
exports is_utf16_surrogate_first() etc everyplace that they're relevant
(which is just the JSON code I was touching anyway, as it happens).
I also made a bit of an effort to ensure test coverage of all the
code touched in that patch and this one.

regards, tom lane

Attachments:

unicode-escapes-with-other-server-encodings-2.patchtext/x-diff; charset=us-ascii; name=unicode-escapes-with-other-server-encodings-2.patchDownload+578-235
#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#11)
Re: Unicode escapes with any backend encoding

I wrote:

[ unicode-escapes-with-other-server-encodings-2.patch ]

I see this patch got sideswiped by the recent refactoring of JSON
lexing. Here's an attempt at fixing it up. Since the frontend
code isn't going to have access to encoding conversion facilities,
this creates a difference between frontend and backend handling
of JSON Unicode escapes, which is mildly annoying but probably
isn't going to bother anyone in the real world. Outside of
jsonapi.c, there are no changes from v2.

regards, tom lane

Attachments:

unicode-escapes-with-other-server-encodings-3.patchtext/x-diff; charset=us-ascii; name=unicode-escapes-with-other-server-encodings-3.patchDownload+604-222
#13Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#12)
Re: Unicode escapes with any backend encoding

On Mon, Feb 24, 2020 at 11:19 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I see this patch got sideswiped by the recent refactoring of JSON
lexing. Here's an attempt at fixing it up. Since the frontend
code isn't going to have access to encoding conversion facilities,
this creates a difference between frontend and backend handling
of JSON Unicode escapes, which is mildly annoying but probably
isn't going to bother anyone in the real world. Outside of
jsonapi.c, there are no changes from v2.

For the record, as far as JSON goes, I think I'm responsible for the
current set of restrictions, and I'm not attached to them. I believe I
was uncertain of my ability to implement anything better than what we
have now and also slightly unclear on what the semantics ought to be.
I'm happy to see it improved, though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#14John Naylor
john.naylor@enterprisedb.com
In reply to: Tom Lane (#12)
Re: Unicode escapes with any backend encoding

On Tue, Feb 25, 2020 at 1:49 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

[ unicode-escapes-with-other-server-encodings-2.patch ]

I see this patch got sideswiped by the recent refactoring of JSON
lexing. Here's an attempt at fixing it up. Since the frontend
code isn't going to have access to encoding conversion facilities,
this creates a difference between frontend and backend handling
of JSON Unicode escapes, which is mildly annoying but probably
isn't going to bother anyone in the real world. Outside of
jsonapi.c, there are no changes from v2.

With v3, I successfully converted escapes using a database with EUC-KR
encoding, from strings, json, and jsonpath expressions.

Then I ran a raw parsing microbenchmark with ASCII unicode escapes in
UTF-8 to verify no significant regression. I also tried the same with
EUC-KR, even though that's not really apples-to-apples since it
doesn't work on HEAD. It seems to give the same numbers. (median of 3,
done 3 times with postmaster restart in between)

master, UTF-8 ascii
1.390s
1.405s
1.406s

v3, UTF-8 ascii
1.396s
1.388s
1.390s

v3, EUC-KR non-ascii
1.382s
1.401s
1.394s

Not this patch's job perhaps, but now that check_unicode_value() only
depends on the input, maybe it can be put into pgwchar.h with other
static inline helper functions? That test is duplicated in
addunicode() and pg_unicode_to_server(). Maybe:

static inline bool
codepoint_is_valid(pgwchar c)
{
return (c > 0 && c <= 0x10FFFF);
}

Maybe Chapman has a use case in mind he can test with? Barring that,
the patch seems ready for commit.

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: John Naylor (#14)
Re: Unicode escapes with any backend encoding

John Naylor <john.naylor@2ndquadrant.com> writes:

Not this patch's job perhaps, but now that check_unicode_value() only
depends on the input, maybe it can be put into pgwchar.h with other
static inline helper functions? That test is duplicated in
addunicode() and pg_unicode_to_server(). Maybe:

static inline bool
codepoint_is_valid(pgwchar c)
{
return (c > 0 && c <= 0x10FFFF);
}

Seems reasonable, done.

Maybe Chapman has a use case in mind he can test with? Barring that,
the patch seems ready for commit.

I went ahead and pushed this, just to get it out of my queue.
Chapman's certainly welcome to kibitz some more of course.

regards, tom lane

#16Chapman Flack
chap@anastigmatix.net
In reply to: Tom Lane (#15)
Re: Unicode escapes with any backend encoding

On 3/6/20 2:19 PM, Tom Lane wrote:

Maybe Chapman has a use case in mind he can test with? Barring that,
the patch seems ready for commit.

I went ahead and pushed this, just to get it out of my queue.
Chapman's certainly welcome to kibitz some more of course.

Sorry, yeah, I don't think I had any kibitzing to do. My use case
was for an automated SQL generator to confidently emit Unicode-
escaped forms with few required assumptions about the database they'll
be loaded in, subject of course to the natural limitation that its
encoding contain the characters being used, but not to arbitrary
other limits. And unless I misunderstand the patch, it accomplishes
that, thereby depriving me of stuff to kibitz about.

Regards,
-Chap