Unicode escapes in literals

Started by Peter Eisentrautalmost 18 years ago7 messageshackers

peter_e@gmx.net

almost 18 years ago

I would like to add an escape mechanism to PostgreSQL for entering
arbitrary Unicode characters into string literals. We currently only
have the option of entering the character directly via the keyboard or
cut-and-paste, which is difficult for a number of reasons, such as when
the font doesn't have the character, and entering the UTF8-encoded bytes
using the E'...' strings, which is hardly usable.

SQL has the following escape syntax for it:

U&'special character: \xxxx' [ UESCAPE '\' ]

where xxxx is the hexadecimal Unicode codepoint. So this is pretty much
just another variant on what the E'...' syntax does.

The trick is that since we have user-definable encoding conversion
routines, we can't convert the Unicode codepoint to the server encoding
in the scanner stage. I imagine there are two ways to address this:

1. Only support this syntax when the server encoding is UTF8. This
would probably cover most use cases anyway. We could have limited
support for characters in the ASCII range for all server encodings.

2. Convert this syntax to a function call. But that would then create a
lot of inconsistencies, such as needing functional indexes for matches
against what should really be a literal.

I'd be happy to start with UTF8 support only. Other ideas?

Tom Lane

tgl@sss.pgh.pa.us

almost 18 years ago

In reply to: Peter Eisentraut (#1)

Re: Unicode escapes in literals

Peter Eisentraut <peter_e@gmx.net> writes:

SQL has the following escape syntax for it:
U&'special character: \xxxx' [ UESCAPE '\' ]

Man that's ugly. Why the ampersand? How do you propose to distinguish
this from a perfectly legitimate use of the & operator?

2. Convert this syntax to a function call. But that would then create a
lot of inconsistencies, such as needing functional indexes for matches
against what should really be a literal.

Uh, why do you think that? The function could surely be stable, even
immutable if you grant that a database's encoding can't change.

regards, tom lane

Peter Eisentraut

peter_e@gmx.net

almost 18 years ago

In reply to: Tom Lane (#2)

Re: Unicode escapes in literals

Tom Lane wrote:

Peter Eisentraut <peter_e@gmx.net> writes:

SQL has the following escape syntax for it:
U&'special character: \xxxx' [ UESCAPE '\' ]

Man that's ugly. Why the ampersand?

Yeah, excellent question. It seems completely unnecessary, but it is
surely there in the syntax diagram.

How do you propose to distinguish
this from a perfectly legitimate use of the & operator?

Well, technically, there is going to be some conflict, but the practical
impact should be minimal because:

- There are no spaces allowed between U&' . We typically suggest spaces
around binary operators.

- Naming a column "u" might not be terribly common.

- Binary-and with an undecorated string literal is not very common.

Of course, I have no data for these assertions. An inquiry on -general
might give more insight.

2. Convert this syntax to a function call. But that would then create a
lot of inconsistencies, such as needing functional indexes for matches
against what should really be a literal.

Uh, why do you think that? The function could surely be stable, even
immutable if you grant that a database's encoding can't change.

Yeah, true, that would work.

There are some other disadvantages for making a function call. You
couldn't use that kind of literal in any other place where the parser
calls for a string constant: role names, tablespace locations,
passwords, copy delimiters, enum values, function body, file names.

There is also a related feature for Unicode escapes in identifiers, and
it might be good to keep the door open on that.

We could to a dual approach: Convert in the scanner when server encoding
is UTF8, and pass on as function call otherwise. Surely ugly though.

Or pass it on as a separate token type to the analyze phase, but that is
a lot more work.

Others: What use cases do you envision, and what requirements would they
create for this feature?

Tom Lane

tgl@sss.pgh.pa.us

almost 18 years ago

In reply to: Peter Eisentraut (#3)

Re: Unicode escapes in literals

Peter Eisentraut <peter_e@gmx.net> writes:

There are some other disadvantages for making a function call. You
couldn't use that kind of literal in any other place where the parser
calls for a string constant: role names, tablespace locations,
passwords, copy delimiters, enum values, function body, file names.

Good point. I'm okay with supporting the feature only when database
encoding is UTF8.

regards, tom lane

Andrew Sullivan

ajs@commandprompt.com

almost 18 years ago

In reply to: Peter Eisentraut (#3)

Re: Unicode escapes in literals

On Thu, Oct 23, 2008 at 06:04:43PM +0300, Peter Eisentraut wrote:

Man that's ugly. Why the ampersand?

Yeah, excellent question. It seems completely unnecessary, but it is
surely there in the syntax diagram.

Probably because many Unicode representations are done with "U+"
followed by 4-6 hexadecimal units, but "+" is problematic for other
reasons (in some vendor's implementation)?

--
Andrew Sullivan
ajs@commandprompt.com
+1 503 667 4564 x104
http://www.commandprompt.com/

Tom Lane

tgl@sss.pgh.pa.us

almost 18 years ago

In reply to: Andrew Sullivan (#5)

Re: Unicode escapes in literals

Andrew Sullivan <ajs@commandprompt.com> writes:

On Thu, Oct 23, 2008 at 06:04:43PM +0300, Peter Eisentraut wrote:

Yeah, excellent question. It seems completely unnecessary, but it is
surely there in the syntax diagram.

Probably because many Unicode representations are done with "U+"
followed by 4-6 hexadecimal units, but "+" is problematic for other
reasons (in some vendor's implementation)?

They could hardly ignore the conflict with the operator interpretation
for +. The committee has now cut themselves off from ever having a
standard operator named &, but I suppose they didn't think ahead to that.

regards, tom lane

Peter Eisentraut

peter_e@gmx.net

over 17 years ago

In reply to: Peter Eisentraut (#1)

Re: Unicode escapes in literals

I wrote:

SQL has the following escape syntax for it:

U&'special character: \xxxx' [ UESCAPE '\' ]

Here is an in-progress patch for this. It still needs updates in the
psql scanner and possibly other scanners. But the server-side
functionality works.

Unicode escapes in literals

Attachments: