[rfc] unicode escapes for extended strings

Started by Marko Kreenalmost 17 years ago26 messageshackers
Jump to latest
#1Marko Kreen
markokr@gmail.com

Seems I'm bad at communicating in english, so here is C variant of
my proposal to bring \u escaping into extended strings. Reasons:

- More people are familiar with \u escaping, as it's standard
in Java/C#/Python, probably more..
- U& strings will not work when stdstr=off.

Syntax:

\uXXXX - 16-bit value
\UXXXXXXXX - 32-bit value

Additionally, both \u and \U can be used to specify UTF-16 surrogate
pairs to encode characters with value > 0xFFFF. This is exact behaviour
used by Java/C#/Python. (except that Java does not have \U)

I'm ok with this patch left to 8.5.

--
marko

Attachments:

unicode.escape.difftext/x-patch; charset=US-ASCII; name=unicode.escape.diffDownload+70-0
#2Sam Mason
sam@samason.me.uk
In reply to: Marko Kreen (#1)
Re: [rfc] unicode escapes for extended strings

On Thu, Apr 16, 2009 at 08:48:58PM +0300, Marko Kreen wrote:

Seems I'm bad at communicating in english,

I hope you're not saying this because of my misunderstandings!

so here is C variant of
my proposal to bring \u escaping into extended strings. Reasons:

- More people are familiar with \u escaping, as it's standard
in Java/C#/Python, probably more..
- U& strings will not work when stdstr=off.

Syntax:

\uXXXX - 16-bit value
\UXXXXXXXX - 32-bit value

Additionally, both \u and \U can be used to specify UTF-16 surrogate
pairs to encode characters with value > 0xFFFF. This is exact behaviour
used by Java/C#/Python. (except that Java does not have \U)

Are you sure that this handling of surrogates is correct? The best
answer I've managed to find on the Unicode consortium's site is:

http://unicode.org/faq/utf_bom.html#utf16-7

it says:

They are invalid in interchange, but may be freely used internal to an
implementation.

I think this means they consider the handling of them you noted above,
in other languages, to be an error.

--
Sam http://samason.me.uk/

#3Andrew Dunstan
andrew@dunslane.net
In reply to: Sam Mason (#2)
Re: [rfc] unicode escapes for extended strings

Sam Mason wrote:

Are you sure that this handling of surrogates is correct? The best
answer I've managed to find on the Unicode consortium's site is:

http://unicode.org/faq/utf_bom.html#utf16-7

it says:

They are invalid in interchange, but may be freely used internal to an
implementation.

It says that about non-characters, not about the use of surrogate pairs,
unless I am misreading it.

cheers

andrew

#4Sam Mason
sam@samason.me.uk
In reply to: Andrew Dunstan (#3)
Re: [rfc] unicode escapes for extended strings

On Thu, Apr 16, 2009 at 03:04:37PM -0400, Andrew Dunstan wrote:

Sam Mason wrote:

Are you sure that this handling of surrogates is correct? The best
answer I've managed to find on the Unicode consortium's site is:

http://unicode.org/faq/utf_bom.html#utf16-7

it says:

They are invalid in interchange, but may be freely used internal to an
implementation.

It says that about non-characters, not about the use of surrogate pairs,
unless I am misreading it.

No, I think you're probably right and I was misreading it. I went
back and forth several times to explicitly check I was interpreting
this correctly and still failed to get it right. Not sure what I was
thinking and sorry for the hassle Marko!

I've already asked on the Unicode list about this (no response yet), but
I have a feeling I'm getting worked up over nothing.

--
Sam http://samason.me.uk/

#5Marko Kreen
markokr@gmail.com
In reply to: Sam Mason (#2)
Re: [rfc] unicode escapes for extended strings

On 4/16/09, Sam Mason <sam@samason.me.uk> wrote:

On Thu, Apr 16, 2009 at 08:48:58PM +0300, Marko Kreen wrote:

Seems I'm bad at communicating in english,

I hope you're not saying this because of my misunderstandings!

so here is C variant of
my proposal to bring \u escaping into extended strings. Reasons:

- More people are familiar with \u escaping, as it's standard
in Java/C#/Python, probably more..
- U& strings will not work when stdstr=off.

Syntax:

\uXXXX - 16-bit value
\UXXXXXXXX - 32-bit value

Additionally, both \u and \U can be used to specify UTF-16 surrogate
pairs to encode characters with value > 0xFFFF. This is exact behaviour
used by Java/C#/Python. (except that Java does not have \U)

Are you sure that this handling of surrogates is correct? The best
answer I've managed to find on the Unicode consortium's site is:

http://unicode.org/faq/utf_bom.html#utf16-7

it says:

They are invalid in interchange, but may be freely used internal to an
implementation.

I think this means they consider the handling of them you noted above,
in other languages, to be an error.

It's up to UTF8 validator whether to consider non-characters as error.

--
marko

#6Marko Kreen
markokr@gmail.com
In reply to: Marko Kreen (#5)
Re: [rfc] unicode escapes for extended strings

On 4/16/09, Marko Kreen <markokr@gmail.com> wrote:

It's up to UTF8 validator whether to consider non-characters as error.

I checked, and it did not work well, as addunicode() did not set
the saw_high_bit variable.when outputting UTF8. Attached patch fixes it.

Currently is would be NOP as pg_verifymbstr() only checks for invalid UTF8,
and addunicode cannot output it, but in the future we may want to reject
some codes, so now it can.

Btw, is there any good reason why we don't reject \000, \x00
in text strings?

Currently I made addunicode() do it, because it seems sensible.

--
marko

Attachments:

unicode.escape.v2.difftext/x-patch; charset=US-ASCII; name=unicode.escape.v2.diffDownload+73-0
#7Martijn van Oosterhout
kleptog@svana.org
In reply to: Marko Kreen (#6)
Re: [rfc] unicode escapes for extended strings

On Fri, Apr 17, 2009 at 07:07:31PM +0300, Marko Kreen wrote:

Btw, is there any good reason why we don't reject \000, \x00
in text strings?

Why forbid nulls in text strings?

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Please line up in a tree and maintain the heap invariant while
boarding. Thank you for flying nlogn airlines.

#8Sam Mason
sam@samason.me.uk
In reply to: Martijn van Oosterhout (#7)
Re: [rfc] unicode escapes for extended strings

On Fri, Apr 17, 2009 at 07:01:47PM +0200, Martijn van Oosterhout wrote:

On Fri, Apr 17, 2009 at 07:07:31PM +0300, Marko Kreen wrote:

Btw, is there any good reason why we don't reject \000, \x00
in text strings?

Why forbid nulls in text strings?

As far as I know, PG assumes, like most C code, that strings don't
contain embedded NUL characters. The manual[1]http://www.postgresql.org/docs/current/static/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS has this to says:

The character with the code zero cannot be in a string constant.

I believe you're supposed to use values of type "bytea" when you're
expecting to deal with NUL characters.

--
Sam http://samason.me.uk/

[1]: http://www.postgresql.org/docs/current/static/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS

#9Andrew Dunstan
andrew@dunslane.net
In reply to: Marko Kreen (#6)
Re: [rfc] unicode escapes for extended strings

Marko Kreen wrote:

+	if (c > 0x7F)
+	{
+		if (GetDatabaseEncoding() != PG_UTF8)
+			yyerror("Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8");
+		saw_high_bit = true;
+	}

Is that really what we want to do? ISTM that one of the uses of this is
to say "store the character that corresponds to this Unicode code point
in whatever the database encoding is", so that \u00a9 would become an
encoding independent way of designating the copyright symbol, for instance.

cheers

andrew

#10Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Andrew Dunstan (#9)
Re: [rfc] unicode escapes for extended strings

Andrew Dunstan <andrew@dunslane.net> wrote:

ISTM that one of the uses of this is to say "store the character
that corresponds to this Unicode code point in whatever the database
encoding is"

I would think you're right. As long as the given character is in the
user's character set, we should allow it. Presumably we've already
confirmed that they have an encoding scheme which allows them to store
everything in their character set.

-Kevin

#11Marko Kreen
markokr@gmail.com
In reply to: Kevin Grittner (#10)
Re: [rfc] unicode escapes for extended strings

On 4/17/09, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:

Andrew Dunstan <andrew@dunslane.net> wrote:

ISTM that one of the uses of this is to say "store the character
that corresponds to this Unicode code point in whatever the database
encoding is"

I would think you're right. As long as the given character is in the
user's character set, we should allow it. Presumably we've already
confirmed that they have an encoding scheme which allows them to store
everything in their character set.

It is probably good idea, but currently I just followed what the U&
strings do.

I can change my patch to do it, but it is probably more urgent in U&
case to decide whether they should work in other encodings too.

--
marko

#12Andrew Dunstan
andrew@dunslane.net
In reply to: Marko Kreen (#11)
Re: [rfc] unicode escapes for extended strings

Marko Kreen wrote:

On 4/17/09, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:

Andrew Dunstan <andrew@dunslane.net> wrote:

ISTM that one of the uses of this is to say "store the character
that corresponds to this Unicode code point in whatever the database
encoding is"

I would think you're right. As long as the given character is in the
user's character set, we should allow it. Presumably we've already
confirmed that they have an encoding scheme which allows them to store
everything in their character set.

It is probably good idea, but currently I just followed what the U&
strings do.

I can change my patch to do it, but it is probably more urgent in U&
case to decide whether they should work in other encodings too.

Indeed. What does the standard say about the behaviour of U&'' ?

cheers

andrew

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kevin Grittner (#10)
Re: [rfc] unicode escapes for extended strings

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

Andrew Dunstan <andrew@dunslane.net> wrote:

ISTM that one of the uses of this is to say "store the character
that corresponds to this Unicode code point in whatever the database
encoding is"

I would think you're right. As long as the given character is in the
user's character set, we should allow it. Presumably we've already
confirmed that they have an encoding scheme which allows them to store
everything in their character set.

This is a good way to get your patch rejected altogether. The lexer
is *not* allowed to invoke any database operations (such as
pg_conversion lookups) so it cannot perform arbitrary encoding
conversions.

If this sort of facility is what you want, the previously suggested
approach via a decode-like runtime function is a better fit.

regards, tom lane

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Sam Mason (#8)
Re: [rfc] unicode escapes for extended strings

Sam Mason <sam@samason.me.uk> writes:

On Fri, Apr 17, 2009 at 07:01:47PM +0200, Martijn van Oosterhout wrote:

On Fri, Apr 17, 2009 at 07:07:31PM +0300, Marko Kreen wrote:

Btw, is there any good reason why we don't reject \000, \x00
in text strings?

Why forbid nulls in text strings?

As far as I know, PG assumes, like most C code, that strings don't
contain embedded NUL characters.

Yeah; we should reject them because nothing will behave very sensibly
with them, eg

regression=# select E'abc\000xyz';
?column?
----------
abc
(1 row)

The point has come up before, and I kinda thought we *had* changed the
lexer to reject \000. I see we haven't though. Curiously, this
does fail:

regression=# select U&'abc\0000xyz';
ERROR: invalid byte sequence for encoding "SQL_ASCII": 0x00
HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".

though that's not quite the message I'd have expected to see.

regards, tom lane

#15Marko Kreen
markokr@gmail.com
In reply to: Tom Lane (#13)
Re: [rfc] unicode escapes for extended strings

On 4/18/09, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

Andrew Dunstan <andrew@dunslane.net> wrote:

ISTM that one of the uses of this is to say "store the character
that corresponds to this Unicode code point in whatever the database
encoding is"

I would think you're right. As long as the given character is in the
user's character set, we should allow it. Presumably we've already
confirmed that they have an encoding scheme which allows them to store
everything in their character set.

This is a good way to get your patch rejected altogether. The lexer
is *not* allowed to invoke any database operations (such as
pg_conversion lookups) so it cannot perform arbitrary encoding
conversions.

Ok. I was just thinking that if such conversion can be provided easily,
it should be done. But if not, then no need to make things complex.

Seems the proper way to look at it is that unicode escapes have
straightforward meaning only in UTF8 encoding. So it should be
fine to limit them in other encodings to ascii.

If this sort of facility is what you want, the previously suggested
approach via a decode-like runtime function is a better fit.

I'm a UTF8-only kind on guy, so people who actually have experience
of using other encodings must comment on that one.

--
marko

#16Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Tom Lane (#13)
Re: [rfc] unicode escapes for extended strings

Tom Lane <tgl@sss.pgh.pa.us> wrote:

The lexer is *not* allowed to invoke any database operations
(such as pg_conversion lookups)

I certainly hope it's not!

so it cannot perform arbitrary encoding conversions.

I was more questioning whether we should be looking at character
encodings at all at that point, rather than suggesting conversions
between different ones. If committing the escape sequence to a
particular encoding is unavoidable at that point, then I suppose the
code in question is about as good as it gets.

-Kevin

#17Marko Kreen
markokr@gmail.com
In reply to: Tom Lane (#14)
Re: [rfc] unicode escapes for extended strings

On 4/18/09, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Sam Mason <sam@samason.me.uk> writes:

On Fri, Apr 17, 2009 at 07:01:47PM +0200, Martijn van Oosterhout wrote:

On Fri, Apr 17, 2009 at 07:07:31PM +0300, Marko Kreen wrote:

Btw, is there any good reason why we don't reject \000, \x00
in text strings?

Why forbid nulls in text strings?

As far as I know, PG assumes, like most C code, that strings don't
contain embedded NUL characters.

Yeah; we should reject them because nothing will behave very sensibly
with them, eg

regression=# select E'abc\000xyz';
?column?
----------
abc
(1 row)

The point has come up before, and I kinda thought we *had* changed the
lexer to reject \000. I see we haven't though. Curiously, this
does fail:

regression=# select U&'abc\0000xyz';
ERROR: invalid byte sequence for encoding "SQL_ASCII": 0x00
HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".

though that's not quite the message I'd have expected to see.

I think that's because out verifier actually *does* reject \0,
only problem is that \0 does not set saw_high_bit flag,
so the verifier simply does not get executed.
But U& executes it always.

unicode=# SELECT e'\xc3\xa4';
?column?
----------
ä
(1 row)

unicode=# SELECT e'\xc3\xa4\x00';
ERROR: invalid byte sequence for encoding "UTF8": 0x00
HINT: This error can also happen if the byte sequence does not match
the encoding expected by the server, which is controlled by
"client_encoding".

Heh.

--
marko

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Marko Kreen (#17)
Re: [rfc] unicode escapes for extended strings

Marko Kreen <markokr@gmail.com> writes:

On 4/18/09, Tom Lane <tgl@sss.pgh.pa.us> wrote:

The point has come up before, and I kinda thought we *had* changed the
lexer to reject \000. I see we haven't though. Curiously, this
does fail:

regression=# select U&'abc\0000xyz';
ERROR: invalid byte sequence for encoding "SQL_ASCII": 0x00

I think that's because out verifier actually *does* reject \0,
only problem is that \0 does not set saw_high_bit flag,
so the verifier simply does not get executed.
But U& executes it always.

I fixed this in HEAD.

regards, tom lane

#19Marko Kreen
markokr@gmail.com
In reply to: Marko Kreen (#1)
Re: [rfc] unicode escapes for extended strings

Unicode escapes for extended strings.

On 4/16/09, Marko Kreen <markokr@gmail.com> wrote:

Reasons:

- More people are familiar with \u escaping, as it's standard
in Java/C#/Python, probably more..
- U& strings will not work when stdstr=off.

Syntax:

\uXXXX - 16-bit value
\UXXXXXXXX - 32-bit value

Additionally, both \u and \U can be used to specify UTF-16 surrogate
pairs to encode characters with value > 0xFFFF. This is exact behaviour
used by Java/C#/Python. (except that Java does not have \U)

v3 of the patch:

- convert to new reentrant lexer API
- add lexer targets to avoid fallback to default
- completely disallow \U\u without proper number of hex values
- fix logic bug in surrogate pair handling

--
marko

Attachments:

unicode-escapes-v3.difftext/x-diff; charset=US-ASCII; name=unicode-escapes-v3.diffDownload+88-0
#20Peter Eisentraut
peter_e@gmx.net
In reply to: Marko Kreen (#19)
Re: [rfc] unicode escapes for extended strings

On Wed, 2009-09-09 at 18:26 +0300, Marko Kreen wrote:

Unicode escapes for extended strings.

On 4/16/09, Marko Kreen <markokr@gmail.com> wrote:

Reasons:

- More people are familiar with \u escaping, as it's standard
in Java/C#/Python, probably more..
- U& strings will not work when stdstr=off.

Syntax:

\uXXXX - 16-bit value
\UXXXXXXXX - 32-bit value

Additionally, both \u and \U can be used to specify UTF-16 surrogate
pairs to encode characters with value > 0xFFFF. This is exact behaviour
used by Java/C#/Python. (except that Java does not have \U)

v3 of the patch:

- convert to new reentrant lexer API
- add lexer targets to avoid fallback to default
- completely disallow \U\u without proper number of hex values
- fix logic bug in surrogate pair handling

This looks good to me. I'm implementing the surrogate pair handling for
the U& syntax for consistency. Then I'll apply this.

#21Peter Eisentraut
peter_e@gmx.net
In reply to: Marko Kreen (#19)
#22Marko Kreen
markokr@gmail.com
In reply to: Peter Eisentraut (#21)
#23Peter Eisentraut
peter_e@gmx.net
In reply to: Marko Kreen (#22)
#24tomas@tuxteam.de
tomas@tuxteam.de
In reply to: Peter Eisentraut (#23)
#25Marko Kreen
markokr@gmail.com
In reply to: tomas@tuxteam.de (#24)
#26Andrew Dunstan
andrew@dunslane.net
In reply to: Marko Kreen (#25)