Define jsonpath functions as stable
Hi!
During my work on bringing jsonpath patchset to commit, I was always
keeping in mind that we need to make jsonb_path_*() functions
immutable. Having these functions immutable, users can build
expression indexes over them. Naturally, in majority of cases one
doesn't need to index whole json documents, but only some parts of
them. jsonpath provide great facilities to extract indexable parts of
document, much more powerful than our current operator set.
However, we've spotted some deviations between standard and our implementation.
* like_regex predicate uses our regular expression engine, which
deviates from standard.
* We always do numeric computations using numeric datatype. Even if
user explicitly calls .double() method. Probably, our current
implementation still fits standard. But in future we may like to use
floating point computation in some cases for performance optimization.
These deviations don't look critical by itself. But immutable
functions make problematic fixing them in future. Also, I'm not sure
this is complete list of deviations we have. We might have, for
example, hidden deviations in handling strict/lax modes, which are
hard to detect and understand.
Therefore, I'm going to mark jsonb_path_*() functions stable, not
immutable. Nevertheless users will still have multiple options for
indexing:
1) jsonb_path_ops supports jsonpath matching operators in some cases.
2) One can wrap jsonb_path_*() in pl/* function and mark it as
immutable on his own risk. This approach is widely used to build
indexes over to_date()/to_timestamp().
3) We're going to provide support of jsonpath operators in jsquery
extension before release of PostgreSQL 12.
I'd like to note I don't mean we wouldn't ever have immutable
functions for jsonpath evaluation. I think once we sure enough that
we know immutable subset of jsonpath, we may define immutable
functions for its evaluation.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
switch_jsonpath_functions_to_stable.patchapplication/octet-stream; name=switch_jsonpath_functions_to_stable.patchDownload+8-8
Alexander Korotkov <a.korotkov@postgrespro.ru> writes:
During my work on bringing jsonpath patchset to commit, I was always
keeping in mind that we need to make jsonb_path_*() functions
immutable. Having these functions immutable, users can build
expression indexes over them.
Right.
However, we've spotted some deviations between standard and our implementation.
* like_regex predicate uses our regular expression engine, which
deviates from standard.
* We always do numeric computations using numeric datatype. Even if
user explicitly calls .double() method. Probably, our current
implementation still fits standard. But in future we may like to use
floating point computation in some cases for performance optimization.
...
Therefore, I'm going to mark jsonb_path_*() functions stable, not
immutable.
I dunno, I think you are applying a far more rigorous definition of
"immutable" than we ever have in the past. The possibility that we
might change the implementation in the future should not be enough
to disqualify a function from being immutable --- if that were the
criterion, nothing more complex than int4pl could be immutable.
Wouldn't it be better that, in the hypothetical major version where
we change the implementation, we tell users that they must reindex
any affected indexes?
As a comparison point, we allow people to build indexes on tsvector
results, which are *easy* to change just by adjusting configuration
files. The fact that this might force the need for reindexing hasn't
made it unworkable.
regards, tom lane
Hi,
On 7/29/19 10:25 AM, Alexander Korotkov wrote:
* like_regex predicate uses our regular expression engine, which
deviates from standard.
I still favor adding some element to the syntax (like a 'posix' or 'pg'
keyword in the grammar for like_regex) that identifies it as using
a different regexp flavor, so the way forward to a possible compliant
version later is not needlessly blocked (or consigned to a
standard_conforming_strings-like experience).
That would also resolve much of the case against calling that
predicate immutable.
It looks as if, in my first implementation of XQuery regexps, there
will have to be a "not-quite-standard" flag for those too, because
it turns out the SQL committee made some tweaks to XQuery regexps[1]https://wiki.postgresql.org/wiki/PostgreSQL_vs_SQL/XML_Standards#XML_Query_regular_expressions,
whereas any XQuery library one relies on is going to provide untweaked
XQuery regexps out of the box. (The differences only affect ^ $ . \s \S)
Regards,
-Chap
[1]: https://wiki.postgresql.org/wiki/PostgreSQL_vs_SQL/XML_Standards#XML_Query_regular_expressions
https://wiki.postgresql.org/wiki/PostgreSQL_vs_SQL/XML_Standards#XML_Query_regular_expressions
On Mon, Jul 29, 2019 at 5:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
However, we've spotted some deviations between standard and our implementation.
* like_regex predicate uses our regular expression engine, which
deviates from standard.
* We always do numeric computations using numeric datatype. Even if
user explicitly calls .double() method. Probably, our current
implementation still fits standard. But in future we may like to use
floating point computation in some cases for performance optimization.
...
Therefore, I'm going to mark jsonb_path_*() functions stable, not
immutable.I dunno, I think you are applying a far more rigorous definition of
"immutable" than we ever have in the past. The possibility that we
might change the implementation in the future should not be enough
to disqualify a function from being immutable --- if that were the
criterion, nothing more complex than int4pl could be immutable.Wouldn't it be better that, in the hypothetical major version where
we change the implementation, we tell users that they must reindex
any affected indexes?As a comparison point, we allow people to build indexes on tsvector
results, which are *easy* to change just by adjusting configuration
files. The fact that this might force the need for reindexing hasn't
made it unworkable.
Thank you for the explanation. Given that there is no need to mark
existing json_path_*() functions as stable. We can just advise users
to rebuild their indexes if we have incompatible changes.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Mon, Jul 29, 2019 at 5:55 PM Chapman Flack <chap@anastigmatix.net> wrote:
On 7/29/19 10:25 AM, Alexander Korotkov wrote:
* like_regex predicate uses our regular expression engine, which
deviates from standard.I still favor adding some element to the syntax (like a 'posix' or 'pg'
keyword in the grammar for like_regex) that identifies it as using
a different regexp flavor, so the way forward to a possible compliant
version later is not needlessly blocked (or consigned to a
standard_conforming_strings-like experience).
What do you think about renaming existing operator from like_regex to
pg_like_regex? Or introducing special flag indicating that PostgreSQL
regex engine is used ('p' for instance)?
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On 07/29/19 18:27, Alexander Korotkov wrote:
What do you think about renaming existing operator from like_regex to
pg_like_regex? Or introducing special flag indicating that PostgreSQL
regex engine is used ('p' for instance)?
Renaming the operator is simple and certainly solves the problem.
I don't have a strong technical argument for or against any of:
$.** ? (@ pg_like_regex "O(w|v)" flag "i")
$.** ? (@ pg_like_regex "O(w|v)")
$.** ? (@ like_regex "O(w|v)" pg flag "i")
$.** ? (@ like_regex "O(w|v)" pg)
$.** ? (@ like_regex "O(w|v)" flag "ip")
$.** ? (@ like_regex "O(w|v)" flag "p")
It seems more of an aesthetic judgment (on which I am no particular
authority).
I think I would be -0.3 on the third approach just because of the need
to still spell out ' flag "p"' when there is no other flag you want.
I assume the first two approaches would be about equally easy to
implement, assuming there's a parser that already has an optional
production for "flag" STRING.
Both of the first two seem pretty safe from colliding with a
future addition to the standard.
To my aesthetic sense, pg_like_regex feels like "another operator
to remember" while like_regex ... pg feels like "ok, a slight variant
on the operator from the spec".
Later on, if a conformant version is added, the grammar might be a bit
simpler with just one name and an optional pg.
Going with a flag, there is some question of the likelihood of
the chosen flag letter being usurped by the standard at some point.
I'm leaning toward a flag for now in my own effort to provide the five SQL
functions (like_regex, occurrences_regex, position_regex, substring_regex,
and translate_regex), as for the time being it will be as an extension,
so no custom grammar for me, and I don't really want to make five
pg_* variant function names, and have that expand to ten function names
someday if the real ones are implemented. (Hmm, I suppose I could add
an optional function argument, distinct from flags; that would be
analogous to adding a pg in the grammar ... avoids overloading the flags,
avoids renaming the functions.)
I see in the Saxon library there is already a convention where it
allows a few flags undefined by the standard, after a semicolon in the
flag string. That has no official status; the XQuery spec
defines [smixq] and requires an error for anything else. But it
does have the advantage that the flag string can just be chopped
at the semicolon to eliminate all but the standard flags, and the
advantage (?) that at least one thing is already doing it.
Regards,
-Chap
Hi,
On 7/29/19 8:33 PM, Chapman Flack wrote:
On 07/29/19 18:27, Alexander Korotkov wrote:
What do you think about renaming existing operator from like_regex to
pg_like_regex? Or introducing special flag indicating that PostgreSQL
regex engine is used ('p' for instance)?Renaming the operator is simple and certainly solves the problem.
I don't have a strong technical argument for or against any of:
$.** ? (@ pg_like_regex "O(w|v)" flag "i")
$.** ? (@ pg_like_regex "O(w|v)")$.** ? (@ like_regex "O(w|v)" pg flag "i")
$.** ? (@ like_regex "O(w|v)" pg)$.** ? (@ like_regex "O(w|v)" flag "ip")
$.** ? (@ like_regex "O(w|v)" flag "p")It seems more of an aesthetic judgment (on which I am no particular
authority).I think I would be -0.3 on the third approach just because of the need
to still spell out ' flag "p"' when there is no other flag you want.I assume the first two approaches would be about equally easy to
implement, assuming there's a parser that already has an optional
production for "flag" STRING.Both of the first two seem pretty safe from colliding with a
future addition to the standard.To my aesthetic sense, pg_like_regex feels like "another operator
to remember" while like_regex ... pg feels like "ok, a slight variant
on the operator from the spec".Later on, if a conformant version is added, the grammar might be a bit
simpler with just one name and an optional pg.Going with a flag, there is some question of the likelihood of
the chosen flag letter being usurped by the standard at some point.I'm leaning toward a flag for now in my own effort to provide the five SQL
functions (like_regex, occurrences_regex, position_regex, substring_regex,
and translate_regex), as for the time being it will be as an extension,
so no custom grammar for me, and I don't really want to make five
pg_* variant function names, and have that expand to ten function names
someday if the real ones are implemented. (Hmm, I suppose I could add
an optional function argument, distinct from flags; that would be
analogous to adding a pg in the grammar ... avoids overloading the flags,
avoids renaming the functions.)
Looking at this thread and[1]/messages/by-id/5CF28EA0.80902@anastigmatix.net and the current state of open items[2]https://wiki.postgresql.org/wiki/PostgreSQL_12_Open_Items, a
few thoughts:
It sounds like the easiest path to completion without potentially adding
futures headaches pushing back the release too far would be that, e.g.
these examples:
$.** ? (@ like_regex "O(w|v)" pg flag "i")
$.** ? (@ like_regex "O(w|v)" pg)
If it's using POSIX regexp, I would +1 using "posix" instead of "pg"
That said, from a user standpoint, it's slightly annoying to have to
include that keyword every time, and could potentially mean changing /
testing quite a bit of code once we do support XQuery regexps. Based on
how we currently handle regular expressions, we've already condition
user's to expect a certain behavior, and it would be inconsistent if we
do one thing in one place, and another thing here, so I would like for
us to be cognizant of that.
Reading the XQuery spec that Chapman provided[3]https://wiki.postgresql.org/wiki/PostgreSQL_vs_SQL/XML_Standards#XML_Query_regular_expressions, it sounds like there
are some challenges present if we were to try to implement XQuery-based
regexps.
I do agree with Alvaro's comment ("We have an opportunity to do
better")[4]/messages/by-id/20190618154907.GA6049@alvherre.pgsql, but I think we have to weigh the likelihood of actually
supporting the XQuery behaviors before we add more burden to our users.
Based on what needs to be done, it does not sound like it is any time soon.
My first choice would be to leave it as is. We can make it abundantly
clear that if we make changes in a future version we advise our users on
what actions to take, and counsel on any behavior changes.
My second choice is to have a flag that makes it clear what kind of
regex's are being used, in which case "posix" -- this is abundantly
clearer to the user, but still default, at present, to using "posix"
expressions. If we ever do add the XQuery ones, we can debate whether we
default to the standard at that time, and if we do, we treat it like we
treat other deprecation issues and make abundantly clear what the
behavior is now.
Thanks,
Jonathan
[1]: /messages/by-id/5CF28EA0.80902@anastigmatix.net
/messages/by-id/5CF28EA0.80902@anastigmatix.net
[2]: https://wiki.postgresql.org/wiki/PostgreSQL_12_Open_Items
[3]: https://wiki.postgresql.org/wiki/PostgreSQL_vs_SQL/XML_Standards#XML_Query_regular_expressions
https://wiki.postgresql.org/wiki/PostgreSQL_vs_SQL/XML_Standards#XML_Query_regular_expressions
[4]: /messages/by-id/20190618154907.GA6049@alvherre.pgsql
/messages/by-id/20190618154907.GA6049@alvherre.pgsql
"Jonathan S. Katz" <jkatz@postgresql.org> writes:
It sounds like the easiest path to completion without potentially adding
futures headaches pushing back the release too far would be that, e.g.
these examples:
$.** ? (@ like_regex "O(w|v)" pg flag "i")
$.** ? (@ like_regex "O(w|v)" pg)
If it's using POSIX regexp, I would +1 using "posix" instead of "pg"
I agree that we'd be better off to say "POSIX". However, having just
looked through the references Chapman provided, it seems to me that
the regex language Henry Spencer's library provides is awful darn
close to what XPath is asking for. The main thing I see in the XML/XPath
specs that we don't have is a bunch of character class escapes that are
specifically tied to Unicode character properties. We could possibly
add code to implement those, but I'm not sure how it'd work in non-UTF8
database encodings. There may also be subtle differences in the behavior
of character class escapes that we do have in common, such as "\s" for
white space; but again I'm not sure that those are any different than
what you get naturally from encoding or locale variations.
I think we could possibly get away with not having any special marker
on regexes, but just explaining in the documentation that "features
so-and-so are not implemented". Writing that text would require closer
analysis than I've seen in this thread as to exactly what the differences
are.
regards, tom lane
On 9/16/19 11:20 AM, Tom Lane wrote:
"Jonathan S. Katz" <jkatz@postgresql.org> writes:
It sounds like the easiest path to completion without potentially adding
futures headaches pushing back the release too far would be that, e.g.
these examples:$.** ? (@ like_regex "O(w|v)" pg flag "i")
$.** ? (@ like_regex "O(w|v)" pg)If it's using POSIX regexp, I would +1 using "posix" instead of "pg"
I agree that we'd be better off to say "POSIX". However, having just
looked through the references Chapman provided, it seems to me that
the regex language Henry Spencer's library provides is awful darn
close to what XPath is asking for. The main thing I see in the XML/XPath
specs that we don't have is a bunch of character class escapes that are
specifically tied to Unicode character properties. We could possibly
add code to implement those, but I'm not sure how it'd work in non-UTF8
database encodings.
Maybe taking a page from the pg_saslprep implementation. For some cases
where the string in question would issue a "reject" under normal
SASLprep[1]https://tools.ietf.org/html/rfc4013 considerations (really stringprep[2]https://www.ietf.org/rfc/rfc3454.txt), PostgreSQL just lets
the string passthrough to the next step, without alteration.
What's implied here is if the string is UTF-8, it goes through SASLprep,
but if not, it is just passed through.
So perhaps the answer is that if we implement XQuery, the escape for
UTF-8 character properties are only honored if the encoding is set to be
UTF-8, and ignored otherwise. We would have to document that said
escapes only work on UTF-8 encodings.
There may also be subtle differences in the behavior
of character class escapes that we do have in common, such as "\s" for
white space; but again I'm not sure that those are any different than
what you get naturally from encoding or locale variations.I think we could possibly get away with not having any special marker
on regexes, but just explaining in the documentation that "features
so-and-so are not implemented". Writing that text would require closer
analysis than I've seen in this thread as to exactly what the differences
are.
+1, and likely would need some example strings too that highlight the
difference in how they are processed.
And again, if we end up updating the behavior in the future, it becomes
a part of our standard deprecation notice at the beginning of the
release notes, though one that could require a lot of explanation.
Jonathan
[1]: https://tools.ietf.org/html/rfc4013
[2]: https://www.ietf.org/rfc/rfc3454.txt
"Jonathan S. Katz" <jkatz@postgresql.org> writes:
On 9/16/19 11:20 AM, Tom Lane wrote:
I think we could possibly get away with not having any special marker
on regexes, but just explaining in the documentation that "features
so-and-so are not implemented". Writing that text would require closer
analysis than I've seen in this thread as to exactly what the differences
are.
+1, and likely would need some example strings too that highlight the
difference in how they are processed.
I spent an hour digging through these specs. I was initially troubled
by the fact that XML Schema regexps are implicitly anchored, ie must
match the whole string; that's a huge difference from POSIX. However,
19075-6 says that jsonpath like_regex works the same as the LIKE_REGEX
predicate in SQL; and SQL:2011 "9.18 XQuery regular expression matching"
defines LIKE_REGEX to work exactly like XQuery's fn:matches function,
except for some weirdness around newline matching; and that spec
clearly says that fn:matches treats its pattern argument as NOT anchored.
So it looks like we end up in the same place as POSIX for this.
Otherwise, the pattern language differences I could find are all details
of character class expressions (bracket expressions, such as "[a-z0-9]")
and escapes that are character class shorthands:
* We don't have "character class subtraction". I'd be pretty hesitant
to add that to our regexp language because it seems to change "-" into
a metacharacter, which would break an awful lot of regexps. I might
be misunderstanding their syntax for it, because elsewhere that spec
explicitly claims that "-" is not a metacharacter.
* Character class elements can be #xNN (NN being hex digits), which seems
equivalent to POSIX \xNN as long as you're using UTF8 encoding. Again,
the compatibility costs of allowing that don't seem attractive, since #
isn't a metacharacter today.
* Character class elements can be \p{UnicodeProperty} or
the complement \P{UnicodeProperty}, where there are a bunch of different
possible properties. Perhaps we could add that someday; since there's no
reason to escape "p" or "P" today, this doesn't seem like it'd be a huge
compatibility hit. But I'm content to document this as unimplemented
for now.
* XQuery adds character class shorthands \i (complement \I) for "initial
name characters" and \c (complement \C) for "NameChar". Same as above;
maybe add someday, but no hurry.
* It looks like XQuery's \w class might allow more characters than our
interpretation does, and hence \W allows fewer. But since \w devolves
to what libc thinks the "alnum" class is, it's at least possible that
some locales might do the same thing XQuery calls for.
* Likewise, any other discrepancies between the Unicode-centric character
class definitions in XQuery and what our stuff does are well within the
boundaries of locale variances. So I don't feel too bad about that.
* The SQL-spec newline business mentioned above is a possible exception:
it appears to require that when '.' is allowed to match newlines, a
single '.' should match a '\r\n' Windows newline. I think we can
document that and move on.
* The x flag in XQuery is defined as ignoring all whitespace in
the pattern except within character class expressions. Spencer's
x flag does mostly that, but it thinks that "\ " means a literal space
whereas XQuery explicitly says that the space is ignored and the
backslash applies to the next non-space character. (That's just
weird, in my book.) Also, Spencer's x mode causes # to begin
a comment extending to EOL, which is a nice thing XQuery hasn't
got, and it says you can't put spaces within multi-character
symbols like "(?:", which presumably is allowed with XQuery's "x".
I feel a bit uncomfortable with these inconsistencies in x-flag
rules. We could probably teach the regexp library to have an
alternate expanded mode that matches XQuery's rules, but that's
not a project to tackle for v12. I tentatively recommend that
we remove the jsonpath "x" flag for the time being.
Also, I noted some things that seem to be flat out sloppiness
in the XQuery flag conversions:
* The newline-matching flags (m and s flags) can be mapped to
features of Spencer's library, but jsonpath_gram.y does so
incorrectly.
* XQuery says that the q flag overrides m, s, and x flags, which is
exactly the opposite of what our code does; besides which the code
is flag-order-sensitive which is just wrong.
These last two are simple to fix and we should just go do it.
Otherwise, I think we're okay with regarding Spencer's library
as being a sufficiently close approximation to LIKE_REGEX.
We need some documentation work though.
regards, tom lane
On 9/16/19 5:10 PM, Tom Lane wrote:
"Jonathan S. Katz" <jkatz@postgresql.org> writes:
On 9/16/19 11:20 AM, Tom Lane wrote:
I think we could possibly get away with not having any special marker
on regexes, but just explaining in the documentation that "features
so-and-so are not implemented". Writing that text would require closer
analysis than I've seen in this thread as to exactly what the differences
are.+1, and likely would need some example strings too that highlight the
difference in how they are processed.I spent an hour digging through these specs.
Thanks! That sounds like quite the endeavor...
I was initially troubled
by the fact that XML Schema regexps are implicitly anchored, ie must
match the whole string; that's a huge difference from POSIX. However,
19075-6 says that jsonpath like_regex works the same as the LIKE_REGEX
predicate in SQL; and SQL:2011 "9.18 XQuery regular expression matching"
defines LIKE_REGEX to work exactly like XQuery's fn:matches function,
except for some weirdness around newline matching; and that spec
clearly says that fn:matches treats its pattern argument as NOT anchored.
So it looks like we end up in the same place as POSIX for this.Otherwise, the pattern language differences I could find are all details
of character class expressions (bracket expressions, such as "[a-z0-9]")
and escapes that are character class shorthands:* We don't have "character class subtraction". I'd be pretty hesitant
to add that to our regexp language because it seems to change "-" into
a metacharacter, which would break an awful lot of regexps. I might
be misunderstanding their syntax for it, because elsewhere that spec
explicitly claims that "-" is not a metacharacter.
Using something I could understand[1]https://www.regular-expressions.info/charclasssubtract.html it looks like the syntax is like:
[a-z-[aeiou]
e.g. all the consonants of the alphabet. I don't believe that would
break many, if any, regexps. I also don't know what kind of effort it
would take to add that in given I had not looked at the regexp code
until today (and only at some of the amusing comments in the header
file, which seemed like it wasn't expected the code would be read 20
years later), but it would likely not be a v12 problem.
* Character class elements can be #xNN (NN being hex digits), which seems
equivalent to POSIX \xNN as long as you're using UTF8 encoding. Again,
the compatibility costs of allowing that don't seem attractive, since #
isn't a metacharacter today.
Seems reasonable.
* Character class elements can be \p{UnicodeProperty} or
the complement \P{UnicodeProperty}, where there are a bunch of different
possible properties. Perhaps we could add that someday; since there's no
reason to escape "p" or "P" today, this doesn't seem like it'd be a huge
compatibility hit. But I'm content to document this as unimplemented
for now.
+1.
* XQuery adds character class shorthands \i (complement \I) for "initial
name characters" and \c (complement \C) for "NameChar". Same as above;
maybe add someday, but no hurry.
+1.
* It looks like XQuery's \w class might allow more characters than our
interpretation does, and hence \W allows fewer. But since \w devolves
to what libc thinks the "alnum" class is, it's at least possible that
some locales might do the same thing XQuery calls for.
I'd still add this to the "to document" list.
* The SQL-spec newline business mentioned above is a possible exception:
it appears to require that when '.' is allowed to match newlines, a
single '.' should match a '\r\n' Windows newline. I think we can
document that and move on.
+1.
* The x flag in XQuery is defined as ignoring all whitespace in
the pattern except within character class expressions. Spencer's
x flag does mostly that, but it thinks that "\ " means a literal space
whereas XQuery explicitly says that the space is ignored and the
backslash applies to the next non-space character. (That's just
weird, in my book.) Also, Spencer's x mode causes # to begin
a comment extending to EOL, which is a nice thing XQuery hasn't
got, and it says you can't put spaces within multi-character
symbols like "(?:", which presumably is allowed with XQuery's "x".I feel a bit uncomfortable with these inconsistencies in x-flag
rules. We could probably teach the regexp library to have an
alternate expanded mode that matches XQuery's rules, but that's
not a project to tackle for v12.
That does not sound fun by any means. But likely that would be a part of
an overall effort to implement XQuery rules.
I tentatively recommend that
we remove the jsonpath "x" flag for the time being.
I would add an alternative suggestion of just removing that "x" is
supported in the documentation...but likely better to just remove the
flag + docs.
Also, I noted some things that seem to be flat out sloppiness
in the XQuery flag conversions:* The newline-matching flags (m and s flags) can be mapped to
features of Spencer's library, but jsonpath_gram.y does so
incorrectly
* XQuery says that the q flag overrides m, s, and x flags, which is
exactly the opposite of what our code does; besides which the code
is flag-order-sensitive which is just wrong.These last two are simple to fix and we should just go do it.
+1.
Otherwise, I think we're okay with regarding Spencer's library
as being a sufficiently close approximation to LIKE_REGEX.
We need some documentation work though.
My main question is "where" -- I'm thinking somewhere in the JSON
path[2]https://www.postgresql.org/docs/12/functions-json.html#FUNCTIONS-SQLJSON-PATH section. After reading your email 3 times, I may have enough
knowledge to attempt some documentation on the regexp in JSON path.
Jonathan
[1]: https://www.regular-expressions.info/charclasssubtract.html
[2]: https://www.postgresql.org/docs/12/functions-json.html#FUNCTIONS-SQLJSON-PATH
https://www.postgresql.org/docs/12/functions-json.html#FUNCTIONS-SQLJSON-PATH
On 09/16/19 17:10, Tom Lane wrote:
I was initially troubled
by the fact that XML Schema regexps are implicitly anchored, ie must
match the whole string; that's a huge difference from POSIX. However,
19075-6 says that jsonpath like_regex works the same as the LIKE_REGEX
predicate in SQL; and SQL:2011 "9.18 XQuery regular expression matching"
defines LIKE_REGEX to work exactly like XQuery's fn:matches function,
except for some weirdness around newline matching; and that spec
clearly says that fn:matches treats its pattern argument as NOT anchored.
Yeah, it's a layer cake. XML Schema regexps[1]https://www.w3.org/TR/xmlschema-2/#regexs are implicitly anchored and
don't have any metacharacters devoted to anchoring.
XQuery regexps layer onto[2]https://www.w3.org/TR/xpath-functions-31/#regex-syntax XML Schema regexps, adding ^ and $ anchors,
rescinding the implicit anchored-ness, adding reluctant quantifiers,
capturing groups, and back-references, and defining flags.
Then ISO SQL adds a third layer changing the newline semantics, affecting
^, $, ., \s, and \S.
Regards,
-Chap
[1]: https://www.w3.org/TR/xmlschema-2/#regexs
[2]: https://www.w3.org/TR/xpath-functions-31/#regex-syntax
On 9/16/19 6:39 PM, Jonathan S. Katz wrote:
My main question is "where" -- I'm thinking somewhere in the JSON
path[2] section. After reading your email 3 times, I may have enough
knowledge to attempt some documentation on the regexp in JSON path.
Here is said attempt to document. Notes:
- I centered it around the specification for LIKE_REGEX, which uses
XQuery, but primarily noted where our implementation of POSIX regex's
differs from what is specified for LIKE_REGEX vis-a-vis XQuery
- I put the pith of the documentation in a subsection off of "POSIX
regular expressions"
- I noted that LIKE_REGEX is specified in SQL:2008, which I read on the
Internet(tm) but was not able to confirm in the spec as I do not have a copy
- For my explanation about the "x" flag differences, I talked about how
we extended it, but I could not capture how Tom described the nuances above.
- From the SQL/JSON path docs, I added a section on regular expressions
stating what the behavior is, and referring back to the main regex docs
- I removed the "x" flag being supported for like_regex in JSON path
I also presume it needs a bit of wordsmithing / accuracy checks, but
hope it's a good start and does not require a massive rewrite.
Thanks,
Jonathan
Attachments:
regex.patchtext/plain; charset=UTF-8; name=regex.patch; x-mac-creator=0; x-mac-type=0Download+118-3
On 2019-09-17 17:38, Jonathan S. Katz wrote:
On 9/16/19 6:39 PM, Jonathan S. Katz wrote:
[regex.patch]
A few things/typos caught my eye:
1.
'implementation' seems the wrong word in sentence:
"Several other parts of the SQL standard
also define LIKE_REGEX equivalents that refer
to this implementation, including the
SQL/JSON path like_regex filter."
As I understand this text, 'concept' seems better.
I'd drop 'also', too.
2.
'whereas the POSIX will those' should be
'whereas POSIX will regard those'
or maybe 'read those'
3.
+ The SQL/JSON standard borrows its definition for how regular
expressions
+ from the <literal>LIKE_REGEX</literal> operator, which in turns
uses the
+ XQuery standard.
That sentence needs the verb 'work', no? 'for how regular expressions
work [..]'
Or alternatively drop 'how'.
thanks,
Erik Rijkers
On 9/17/19 12:09 PM, Erik Rijkers wrote:
On 2019-09-17 17:38, Jonathan S. Katz wrote:
[regex.patch]
Thanks for the review!
"Several other parts of the SQL standard
also define LIKE_REGEX equivalents that refer
to this implementation, including the
SQL/JSON path like_regex filter."As I understand this text, 'concept' seems better.
I'd drop 'also', too.
I rewrote this to be:
"Several other parts of the SQL standard refer to the LIKE_REGEX
specification to define similar operations, including..."
2.
'whereas the POSIX will those' should be
'whereas POSIX will regard those'
or maybe 'read those'
I used "treat those"
3. + The SQL/JSON standard borrows its definition for how regular expressions + from the <literal>LIKE_REGEX</literal> operator, which in turns uses the + XQuery standard. That sentence needs the verb 'work', no? 'for how regular expressions work [..]' Or alternatively drop 'how'.
I dropped the "how".
v2 attached. Thanks!
Jonathan
Attachments:
regex-v2.patchtext/plain; charset=UTF-8; name=regex-v2.patch; x-mac-creator=0; x-mac-type=0Download+118-3
"Jonathan S. Katz" <jkatz@postgresql.org> writes:
v2 attached. Thanks!
I whacked this around some (well, quite a bit actually); notably,
I thought we'd better describe things that are in our engine but
not XQuery, as well as vice-versa.
After a re-read of the XQuery spec, it seems to me that the character
entry form that they have and we don't is actually "&#NNNN;" like
HTML, rather than just "#NN". Can anyone double-check that? Does
it work outside bracket expressions, or only inside?
regards, tom lane
Attachments:
jsonpath-regex-doc-v3.patchtext/x-diff; charset=us-ascii; name=jsonpath-regex-doc-v3.patchDownload+173-4
On 9/17/19 6:40 PM, Tom Lane wrote:
"Jonathan S. Katz" <jkatz@postgresql.org> writes:
v2 attached. Thanks!
I whacked this around some (well, quite a bit actually);
So I see :) Thanks.
notably,
I thought we'd better describe things that are in our engine but
not XQuery, as well as vice-versa.
Yeah, that makes sense. Overall it reads really well. One question I had
in my head (and probably should have asked) was answered around the \w
character class wrt collation.
After a re-read of the XQuery spec, it seems to me that the character
entry form that they have and we don't is actually "&#NNNN;" like
HTML, rather than just "#NN". Can anyone double-check that?
Clicking through the XQuery spec eventual got me to here[1]https://www.w3.org/TR/2000/WD-xml-2e-20000814#dt-charref (which warns
me that its out of date, but that is what its "current" specs linked me
to), which describes being able to use "&#[0-9]+;" and "&#[0-9a-fA-F]+;"
to specify characters (which I recognize as a character escape from
HTML, XML et al.).
So based on that, my answer is "yes."
Does
it work outside bracket expressions, or only inside?
Looking at the parse tree (start with the "atom"[2]https://www.w3.org/TR/xmlschema-2/#nt-atom), I read it as being
able to use that syntax both inside and outside the bracket expressions.
Here is a v4. I added some more paragraphs the bullet point that
explains the different flags to make it feel a bit less dense.
Thanks,
Jonathan
[1]: https://www.w3.org/TR/2000/WD-xml-2e-20000814#dt-charref
[2]: https://www.w3.org/TR/xmlschema-2/#nt-atom
Attachments:
jsonpath-regex-doc-v4.patchtext/plain; charset=UTF-8; name=jsonpath-regex-doc-v4.patch; x-mac-creator=0; x-mac-type=0Download+177-4
On 09/17/19 21:13, Jonathan S. Katz wrote:
to), which describes being able to use "&#[0-9]+;" and "&#[0-9a-fA-F]+;"
Er, that is, "&#[0-9]+;" and "&#x[0-9a-fA-F]+;" (with x for the hex case).
Does
it work outside bracket expressions, or only inside?Looking at the parse tree (start with the "atom"[2]), I read it as being
able to use that syntax both inside and outside the bracket expressions.
Maybe I can plug a really handy environment for messin'-around-in-XQuery,
BaseX: http://basex.org/
All the buzzwords on the landing page make it seem as if it's going to be
some monstrous thing to download and set up, but on the downloads page,
the "Core Package" option is a single standalone 3.8 MB jar file:
http://files.basex.org/releases/9.2.4/BaseX924.jar
"java -jar BaseX924.jar" is all it takes to start up, and wham, you're
in a nice IDE-like environment where the editor pane is syntax-aware
for XQuery and will run your code and show results with a click of the
go button.
Regards,
-Chap
On 9/17/19 10:00 PM, Chapman Flack wrote:
On 09/17/19 21:13, Jonathan S. Katz wrote:
to), which describes being able to use "&#[0-9]+;" and "&#[0-9a-fA-F]+;"
Er, that is, "&#[0-9]+;" and "&#x[0-9a-fA-F]+;" (with x for the hex case).
Correct, I missed the "x".
Thanks,
Jonathan
On Wed, Sep 18, 2019 at 4:13 AM Jonathan S. Katz <jkatz@postgresql.org> wrote:
Here is a v4. I added some more paragraphs the bullet point that
explains the different flags to make it feel a bit less dense.
Sorry that I didn't participate this discussion till now. FWIW, I
agree with selected approach to document differences with XQuery regex
and and forbid 'x' from jsonpath like_regex. Patch also looks good
for me at the first glance.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company