Fix parsing of identifiers in jsonpath

Started by Nikita Glukhovover 6 years ago4 messageshackers

n.gluhov@postgrespro.ru

over 6 years ago

Hi!

Unfortunately, jsonpath lexer, in contrast to jsonpath parser, was written by
Teodor and me without a proper attention to the stanard. JSON path lexics is
is borrowed from the external ECMAScript [1]https://www.ecma-international.org/ecma-262/10.0/index.html#sec-ecmascript-language-lexical-grammar, and we did not study it carefully.

There were numerous deviations from the ECMAScript standard in our jsonpath
implementation that were mostly fixed in the attached patch:

1. Identifiers (unquoted JSON key names) should start from the one of (see [2]https://www.ecma-international.org/ecma-262/10.0/index.html#sec-names-and-keywords):
- Unicode symbol having Unicode property "ID_Start" (see [3]https://unicode.org/reports/tr31/)
- Unicode escape sequence '\uXXXX' or '\u{X...}'
- '$'
- '_'

And they should continue with the one of:
- Unicode symbol having Unicode property "ID_Continue" (see [3]https://unicode.org/reports/tr31/)
- Unicode escape sequence
- '$'
- ZWNJ
- ZWJ

2. '$' is also allowed inside the identifiers, so it is possible to write
something like '$.a$$b'.

3. Variable references '$var' are regular identifiers simply starting from the
'$' sign, and there is no syntax like '$"var"', because quotes are not
allowed in identifiers.

4. Even if the Unicode escape sequence '\uXXXX' is used, it cannot produce
special symbols or whitespace, because the identifiers are displayed without
quoting (i.e. '$\u{20}' is not possible to display as '$" "' or even more as
string '"$ "').

5. All codepoints in '\u{XXXXXX}' greater than 0x10FFFF should be forbidden.

6. 6 single-character escape sequences (\b \t \r \f \n \v) should only be
supported inside quoted strings.

I don't know if it is possible to check Unicode properties "ID_Start" and
"ID_Continue" in Postgres, and what ZWNJ/ZWJ is. Now, identifier's starting
character set is simply determined by the exclusion of all recognized special
characters.

The patch is not so simple, but I believe that it's not too late to fix v12.

[1]: https://www.ecma-international.org/ecma-262/10.0/index.html#sec-ecmascript-language-lexical-grammar
[2]: https://www.ecma-international.org/ecma-262/10.0/index.html#sec-names-and-keywords
[3]: https://unicode.org/reports/tr31/

--
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Chapman Flack

chap@anastigmatix.net

over 6 years ago

In reply to: Nikita Glukhov (#1)

Re: Fix parsing of identifiers in jsonpath

On 9/18/19 11:10 AM, Nikita Glukhov wrote:

4. Even if the Unicode escape sequence '\uXXXX' is used, it cannot produce
special symbols or whitespace, because the identifiers are displayed
...
I don't know if it is possible to check Unicode properties "ID_Start" and
"ID_Continue" in Postgres, and what ZWNJ/ZWJ is.

ZWNJ and ZWJ are U+200C and U+200D (mentioned in [1]).

Also, it's not just that a Unicode escape sequence can't make a
special symbol or whitespace; it can't make any character that's
not allowed there by the other rules:

"A UnicodeEscapeSequence cannot be used to put a code point into an
IdentifierName that would otherwise be illegal. In other words, if a \
UnicodeEscapeSequence sequence were replaced by the SourceCharacter it
contributes, the result must still be a valid IdentifierName that has
the exact same sequence of SourceCharacter elements as the original
IdentifierName. All interpretations of IdentifierName within this
specification are based upon their actual code points regardless of
whether or not an escape sequence was used to contribute any particular
code point."

A brief glance through src/backend/utils/mb/Unicode shows that the
Makefile does download a bunch of stuff, but maybe not the Unicode
character data that would allow testing ID_Start and ID_Continue?
I'm not sure.

Regards,
-Chap

Tom Lane

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Nikita Glukhov (#1)

Re: Fix parsing of identifiers in jsonpath

Nikita Glukhov <n.gluhov@postgrespro.ru> writes:

I don't know if it is possible to check Unicode properties "ID_Start" and
"ID_Continue" in Postgres, and what ZWNJ/ZWJ is. Now, identifier's starting
character set is simply determined by the exclusion of all recognized special
characters.

TBH, I think you should simply ignore any aspect of any of these standards
that is defined by reference to Unicode. We are not necessarily dealing
with a Unicode character set, so at best, references to things like ZWNJ
are unreachable no-ops in a lot of environments.

As a relevant example, modern SQL defines whitespace in terms of Unicode[1]cf 4.2.4 "Character repertoires" in SQL:2011,
a fact that we have ignored from the start and will likely continue to
do so.

You could do a lot worse than to just consider identifiers to be the same
strings as our SQL lexer would do (modulo things like "$" that have
special status in the path language).

regards, tom lane

[1]: cf 4.2.4 "Character repertoires" in SQL:2011

Nikita Glukhov

n.gluhov@postgrespro.ru

over 6 years ago

In reply to: Nikita Glukhov (#1)

Re: Fix parsing of identifiers in jsonpath

Attached v2 patch rebased onto current master.

On 18.09.2019 18:10, Nikita Glukhov wrote:

Unfortunately, jsonpath lexer, in contrast to jsonpath parser, was written by
Teodor and me without a proper attention to the stanard. JSON path lexics is
is borrowed from the external ECMAScript [1], and we did not study it carefully.

There were numerous deviations from the ECMAScript standard in our jsonpath
implementation that were mostly fixed in the attached patch:

1. Identifiers (unquoted JSON key names) should start from the one of (see [2]):
- Unicode symbol having Unicode property "ID_Start" (see [3])
- Unicode escape sequence '\uXXXX' or '\u{X...}'
- '$'
- '_'

And they should continue with the one of:
- Unicode symbol having Unicode property "ID_Continue" (see [3])
- Unicode escape sequence
- '$'
- ZWNJ
- ZWJ

2. '$' is also allowed inside the identifiers, so it is possible to write
something like '$.a$$b'.

3. Variable references '$var' are regular identifiers simply starting from the
'$' sign, and there is no syntax like '$"var"', because quotes are not
allowed in identifiers.

4. Even if the Unicode escape sequence '\uXXXX' is used, it cannot produce
special symbols or whitespace, because the identifiers are displayed without
quoting (i.e. '$\u{20}' is not possible to display as '$" "' or even more as
string '"$ "').

5. All codepoints in '\u{XXXXXX}' greater than 0x10FFFF should be forbidden.

6. 6 single-character escape sequences (\b \t \r \f \n \v) should only be
supported inside quoted strings.

I don't know if it is possible to check Unicode properties "ID_Start" and
"ID_Continue" in Postgres, and what ZWNJ/ZWJ is. Now, identifier's starting
character set is simply determined by the exclusion of all recognized special
characters.

The patch is not so simple, but I believe that it's not too late to fix v12.

[1]https://www.ecma-international.org/ecma-262/10.0/index.html#sec-ecmascript-language-lexical-grammar
[2]https://www.ecma-international.org/ecma-262/10.0/index.html#sec-names-and-keywords
[3]https://unicode.org/reports/tr31/

--
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Fix parsing of identifiers in jsonpath

Attachments:

Attachments: