PostgreSQL 9.3.5 substring(text from pattern for escape) bug
I believe I have come across a bug in the substring(text from pattern
for escape) function.
What I am attempting to do is take a string like 'QMy NameQ' , strip off
leading/and/or/trailing Qs and return 'My Name'.
The substring() call that I have coded is substring(xxx from 'Q?#"%#"Q?'
FOR '#')
My understand of this is:
Q? -- matches 0 or 1 occurrences of Q (the leading one, if present)
#" -- starts data collection
% -- matches any number of characters
#" -- ends data collection
Q? -- matches 0 or 1 occurrences of Q (the trialing one, if present)
What appears to be happening is that the _Q? on the tail_ is being
matched, but _the Q is also being included in the collected data.__
_
The attached PSQL test.sql file creates a table, populates it with 7
test cases with my expected results, and then executes substring()
against it.
There is an attached test_output.txt file is the PSQL output that
illustrates the problem. The column "Error" is true when the actual
value disagrees with what I expected.
Robert Schreiber
410-392-9553
PostgreSQL 9.3.5, compiled by Visual C++ build 1600, 64-bit
Running under Windows 10 64-bit.
Import Notes
Reply to msg id not found: 08036370-e3d3-3a28-ab2a-da13db8be65e@charter.netReference msg id not found: 08036370-e3d3-3a28-ab2a-da13db8be65e@charter.net
Robert Schreiber wrote:
My understand of this is:
Q? -- matches 0 or 1 occurrences of Q (the leading one, if present)
#" -- starts data collection
% -- matches any number of characters
#" -- ends data collection
Q? -- matches 0 or 1 occurrences of Q (the trialing one, if present)What appears to be happening is that the _Q? on the tail_ is being
matched, but _the Q is also being included in the collected data.__
The expression as a whole is being matched, otherwise this form of
substring() would return NULL, as told in
https://www.postgresql.org/docs/current/functions-matching.html :
"As with SIMILAR TO, the specified pattern must match the entire
data string, or else the function fails and returns null. "
Considering your example, where 'My Q NameQ' is the string and
'Q?#"%#"Q?' the regexp. The engine can either see the ending
'Q' as part of the % match, or as a match for the 'Q?' that
follows. Both cases are valid matches.
You seem to expect that % must be non-greedy and let the final Q?
match 1 Q instead of 0, but there doesn't appear to be anything
in the doc that supports this interpretation.
In fact, it mentions that "%" is comparable to ".*" in POSIX
regular expressions, and the latter _is_ greedy.
Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite
"Daniel Verite" <daniel@manitou-mail.org> writes:
Robert Schreiber wrote:
What appears to be happening is that the _Q? on the tail_ is being
matched, but _the Q is also being included in the collected data.__
You seem to expect that % must be non-greedy and let the final Q?
match 1 Q instead of 0, but there doesn't appear to be anything
in the doc that supports this interpretation.
In fact, it mentions that "%" is comparable to ".*" in POSIX
regular expressions, and the latter _is_ greedy.
Right. You could get the behavior you want using a non-greedy quantifier,
but you'd have to use the POSIX regexp functions, not substring().
regards, tom lane
"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
What appears to be happening is that the _Q? on the tail_ is being
matched, but _the Q is also being included in the collected data.__
You seem to expect that % must be non-greedy and let the final Q?
match 1 Q instead of 0, but there doesn't appear to be anything in
the doc that supports this interpretation. In fact, it mentions that
"%" is comparable to ".*" in POSIX regular expressions, and the
latter _is_ greedy.
Tom> Right. You could get the behavior you want using a non-greedy
Tom> quantifier, but you'd have to use the POSIX regexp functions, not
Tom> substring().
I looked up the spec on this point. As far as I can see, we're not
following it, but neither does the spec do what the OP wanted; in fact
the result should have included the _leading_ Q as well as the trailing
one.
The relevant part of SQL2016 seems to be this:
6.32 <string value function>
General Rules
6) If <regular expression substring function> is specified, then:
[...rules that split the pattern into 'R1#"R2#"R3' omitted...]
h) Otherwise, the result S of the <regular expression substring
function> is computed as follows:
i) Let S1 be the shortest initial substring of C such that there is
a substring S23 of C such that the value of the following <search
condition> is True:
'C' = 'S1' || 'S23' AND
'S1' SIMILAR TO 'R1' ESCAPE 'E' AND
'S23' SIMILAR TO '(R2R3)' ESCAPE 'E'
ii) Let S3 be the shortest final substring of S23 such that there is
a substring S2 of S23 such that the value of the following
<search condition> is True:
'S23' = 'S2' || 'S3' AND
'S2' SIMILAR TO 'R2' ESCAPE 'E' AND
'S3' SIMILAR TO 'R3' ESCAPE 'E'
iii) The result of the <regular expression substring function> is S2.
--
Andrew (irc:RhodiumToad)
Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
I looked up the spec on this point. As far as I can see, we're not
following it, but neither does the spec do what the OP wanted; in fact
the result should have included the _leading_ Q as well as the trailing
one.
Huh, interesting. So we should be translating the initial substring
to a non-greedy pattern. I believe Spencer's engine can handle that
by sticking (?:...){1,1}? around it.
Come to think of it, we probably need to be putting (?:...) around
the trailing substring as well. I suspect what we're doing today
produces non-spec results if "|" appears in the trailing part.
regards, tom lane
"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
I looked up the spec on this point. As far as I can see, we're not
following it, but neither does the spec do what the OP wanted; in
fact the result should have included the _leading_ Q as well as the
trailing one.
Tom> Huh, interesting. So we should be translating the initial
Tom> substring to a non-greedy pattern. I believe Spencer's engine can
Tom> handle that by sticking (?:...){1,1}? around it.
Tom> Come to think of it, we probably need to be putting (?:...) around
Tom> the trailing substring as well. I suspect what we're doing today
Tom> produces non-spec results if "|" appears in the trailing part.
Digging into it more:
SUBSTRING(x FROM 'expr' FOR 'escape') is from sql92/sql99 and is gone by
sql2008, replaced by SUBSTRING(x SIMILAR 'expr' ESCAPE 'escape'). sql99
defines the matching rule using different language, but with the same
actual effect (requiring shortest matches for the leading and trailing
strings).
Your suggested fix doesn't seem to work. If the leading/trailing
substrings do not have | or parens in then it seems to work to wrap them
in (?:(?:)??...), thanks to the rule that the first quantified atom in a
subexpression sets the whole subexpression's greediness, but handling |
or parens correctly seems harder.
Are there any other dbs that implement this feature that we can compare
against?
--
Andrew (irc:RhodiumToad)
Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
Tom> Huh, interesting. So we should be translating the initial
Tom> substring to a non-greedy pattern. I believe Spencer's engine can
Tom> handle that by sticking (?:...){1,1}? around it.
Your suggested fix doesn't seem to work. If the leading/trailing
substrings do not have | or parens in then it seems to work to wrap them
in (?:(?:)??...), thanks to the rule that the first quantified atom in a
subexpression sets the whole subexpression's greediness, but handling |
or parens correctly seems harder.
[ pokes at that... ] Huh. That's a bug, which AFAICS is aboriginal in
Henry's code: it optimizes away a {1,1} quantifier without regard to
whether the quantifier is attempting to impose a different greediness
preference than its argument would have naturally. The attached
seems to fix it.
regards, tom lane
Attachments:
fix-regex-greediness-imposition.patchtext/x-diff; charset=us-ascii; name=fix-regex-greediness-imposition.patchDownload+63-1
Just to be clear here. It seems to me that I am right in that the
leading/trailing Q should NOT be returned...
From Section 9.7.2 in the manual:
The |substring| function with three parameters, |substring(/string/
from /pattern/ for /escape-character/)|, provides extraction of a
substring that matches an SQL regular expression pattern. As with
SIMILAR TO, the specified pattern must match the entire data string,
or else the function fails and returns null. To indicate the part of
the pattern that should be returned on success, the pattern must
contain two occurrences of the escape character followed by a double
quote ("). The text matching the portion of the pattern between
these markers is returned.
In my mind I see this:
QMy Q NAmeQ is interpreted as Q/My Q Name/Q rather than Q/My Q Name Q//.
bob
Show quoted text
On 5/12/2019 12:27 AM, Andrew Gierth wrote:
"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
I looked up the spec on this point. As far as I can see, we're not
following it, but neither does the spec do what the OP wanted; in
fact the result should have included the _leading_ Q as well as the
trailing one.Tom> Huh, interesting. So we should be translating the initial
Tom> substring to a non-greedy pattern. I believe Spencer's engine can
Tom> handle that by sticking (?:...){1,1}? around it.Tom> Come to think of it, we probably need to be putting (?:...) around
Tom> the trailing substring as well. I suspect what we're doing today
Tom> produces non-spec results if "|" appears in the trailing part.Digging into it more:
SUBSTRING(x FROM 'expr' FOR 'escape') is from sql92/sql99 and is gone by
sql2008, replaced by SUBSTRING(x SIMILAR 'expr' ESCAPE 'escape'). sql99
defines the matching rule using different language, but with the same
actual effect (requiring shortest matches for the leading and trailing
strings).Your suggested fix doesn't seem to work. If the leading/trailing
substrings do not have | or parens in then it seems to work to wrap them
in (?:(?:)??...), thanks to the rule that the first quantified atom in a
subexpression sets the whole subexpression's greediness, but handling |
or parens correctly seems harder.Are there any other dbs that implement this feature that we can compare
against?
Robert Schreiber <bobschreiber@charter.net> writes:
Just to be clear here. It seems to me that I am right in that the
leading/trailing Q should NOT be returned...
Nope. The SQL spec's pretty clear that when the first and third
sub-expressions of the pattern can match variable amounts of text,
they're supposed to be matched to the minimum possible amount of text,
which is nothing for a pattern like "Q?". It is a bug that we're
not doing that as the spec says, but once we fix that, it still won't
act as you're hoping.
regards, tom lane
Here's a proposed patch for dealing with the non-spec greediness behavior.
It also fixes misbehavior with | in the pattern, as shown in the new
test cases (well, if you try them against HEAD, you'll see the
misbehavior).
I haven't done anything about most of the SQL-spec incompatibilities
mentioned in the other thread, but I did change the code to throw error
for more than two escape-double-quote separators, mainly because not
doing so would've required more complexity in the result string
allocation logic.
Also, rather than throw error for a single separator, I just let the
code ignore the case, because with this implementation a fairly reasonable
behavior falls out: it acts the same as if the last pattern section were
empty. I'm a little worried that that's turning an implementation
artifact into a feature, but it seems to line up well with our historical
behavior for no separators (namely, we act as though the first and last
sections were both empty).
I've included doc changes addressing these specific behaviors, but
not done anything about the other omissions in the docs.
Comments? Should we try to get more than this done for v12, or just
leave it for some other day?
regards, tom lane
Attachments:
fix-SQL-substring-pattern-greediness.patchtext/x-diff; charset=us-ascii; name=fix-SQL-substring-pattern-greediness.patchDownload+174-20
"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
Tom> Here's a proposed patch for dealing with the non-spec greediness
Tom> behavior.
Looks good to me so far.
--
Andrew (irc:RhodiumToad)
Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
Tom> Here's a proposed patch for dealing with the non-spec greediness
Tom> behavior.
Looks good to me so far.
Thanks for looking! Pushed now.
regards, tom lane