9.5.3: substring: regex greedy operator not picking up chars as expected

Started by Foster, Russellover 9 years ago4 messagesbugs
Jump to latest
#1Foster, Russell
Russell.Foster@crl.com

Hello,

For the following query:

select substring('>772' from '.*?[0-9]+')

I would expect the output to be '>772', but it is '>7'. You can also see the expected result on https://regex101.com/, although I am aware not all regex processors work the same.

The following queries:

select substring('>772' from '^.*?[0-9]+$')

and:

select substring('>772' from '[0-9]+')

both return '>772', which is expected. Could the less greedy operator on the left (.*?) be affecting the more greedy right one (+)?

Thanks,
Russell Foster

#2David G. Johnston
david.g.johnston@gmail.com
In reply to: Foster, Russell (#1)
Re: 9.5.3: substring: regex greedy operator not picking up chars as expected

​Working as documented.​

https://www.postgresql.org/docs/9.5/static/functions-matching.html#POSIX-MATCHING-RULES

Specifically, this implementation considers greediness at a level higher
than just the atom/expression - and in a mixed "branch" if there is a
non-greedy quantifier in a branch the entire branch is non-greedy and can
in many situations cause greedy atoms to behave non-greedily.

In might help to consider that there aren't really any explicit "greedy"
operators like other engines have (i.e., ??, ?, ?+) but rather non-greedy
(lazy) and default. The default inherits the non-greedy trait from its
parent if applicable otherwise is behaves greedily.

On Mon, Aug 15, 2016 at 7:53 AM, Foster, Russell <Russell.Foster@crl.com>
wrote:

Hello,

For the following query:

select substring('>772' from '.*?[0-9]+')

​The pattern itself is non-greedy​ due to their only being a single branch
and it having a non-greedy quantifier within it.

.*? matches ">" and [0-9]+ only needs a single character to generate a
non-greedy match conforming match

I would expect the output to be ‘>772’, but it is ‘>7’. You can also see
the expected result on https://regex101.com/, although I am aware not all
regex processors work the same.

The following queries:

select substring('>772' from '^.*?[0-9]+$')

​This is treated exactly the same as the above but because of the ^$ the
shortest possible output string is the entire string​

and:

select substring('>772' from '[0-9]+')

both return ‘>772’, which is expected. Could the less greedy operator on
the left (.*?) be affecting the more greedy right one (+)?

Typo here? I'm not fluent with substring(regex).

Anyway, the entire RE (single branch) is now greedy so the greedy [0-9]+
atom matches as many numbers as possible.

David J.

#3Foster, Russell
Russell.Foster@crl.com
In reply to: David G. Johnston (#2)
Re: 9.5.3: substring: regex greedy operator not picking up chars as expected

Hi David,

Must have missed that in the manual, but makes sense now. Somewhat strange behavior that a non-greedy quantifier basically ruins the rest of the expression for the greedy ones, but at least it’s working as designed. Thanks!

Russell

From: David G. Johnston [mailto:david.g.johnston@gmail.com]
Sent: 15 August 2016 8:45 AM
To: Foster, Russell <Russell.Foster@crl.com>
Cc: pgsql-bugs@postgresql.org
Subject: Re: [BUGS] 9.5.3: substring: regex greedy operator not picking up chars as expected

​Working as documented.​

https://www.postgresql.org/docs/9.5/static/functions-matching.html#POSIX-MATCHING-RULES&lt;https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fwww.postgresql.org%2fdocs%2f9.5%2fstatic%2ffunctions-matching.html%23POSIX-MATCHING-RULES&amp;data=01%7c01%7cRussell.Foster%40crl.com%7cc1e71359ea8c4aa0a40008d3c509f7cf%7c374f8930e1504031bb35483215fe5900%7c0&amp;sdata=n4FmWZi0%2f%2bdgZ5KrY3Bfk1O0npbVGK%2bRCHWnNMMmXVo%3d&gt;

Specifically, this implementation considers greediness at a level higher than just the atom/expression - and in a mixed "branch" if there is a non-greedy quantifier in a branch the entire branch is non-greedy and can in many situations cause greedy atoms to behave non-greedily.

In might help to consider that there aren't really any explicit "greedy" operators like other engines have (i.e., ??, ?, ?+) but rather non-greedy (lazy) and default. The default inherits the non-greedy trait from its parent if applicable otherwise is behaves greedily.

On Mon, Aug 15, 2016 at 7:53 AM, Foster, Russell <Russell.Foster@crl.com<mailto:Russell.Foster@crl.com>> wrote:
Hello,

For the following query:

select substring('>772' from '.*?[0-9]+')

​The pattern itself is non-greedy​ due to their only being a single branch and it having a non-greedy quantifier within it.

.*? matches ">" and [0-9]+ only needs a single character to generate a non-greedy match conforming match

I would expect the output to be ‘>772’, but it is ‘>7’. You can also see the expected result on https://regex101.com/&lt;https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fregex101.com%2f&amp;data=01%7c01%7cRussell.Foster%40crl.com%7cc1e71359ea8c4aa0a40008d3c509f7cf%7c374f8930e1504031bb35483215fe5900%7c0&amp;sdata=ye55TdPxGOB6NUoDn85l%2fEg8o9MgYPkbOv%2bg4mGaXw4%3d&gt;, although I am aware not all regex processors work the same.

The following queries:

select substring('>772' from '^.*?[0-9]+$')

​This is treated exactly the same as the above but because of the ^$ the shortest possible output string is the entire string​

and:

select substring('>772' from '[0-9]+')

both return ‘>772’, which is expected. Could the less greedy operator on the left (.*?) be affecting the more greedy right one (+)?

Typo here? I'm not fluent with substring(regex).

Anyway, the entire RE (single branch) is now greedy so the greedy [0-9]+ atom matches as many numbers as possible.

David J.

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Foster, Russell (#1)
Re: 9.5.3: substring: regex greedy operator not picking up chars as expected

"Foster, Russell" <Russell.Foster@crl.com> writes:

For the following query:
select substring('>772' from '.*?[0-9]+')
I would expect the output to be '>772', but it is '>7'.

As David pointed out, that's what you get because the RE as a whole is
considered to be non-greedy, ie you get the shortest overall match.
However, you can adjust that by decorating the RE:

# select substring('>772' from '(.*?[0-9]+){1,1}');
substring
-----------

772

(1 row)

Now it's longest-overall, but the .*? part is still shortest-match,
so it doesn't consume any digits. However, I suspect that still is
not quite what you want, because it consumes too much in cases like:

# select substring('>772foo444' from '(.*?[0-9]+){1,1}');
substring
------------

772foo444

(1 row)

There's probably really no way out of that except to be less lazy about
writing the pattern:

# select substring('>772foo444' from '([^0-9]*?[0-9]+){1,1}');
substring
-----------

772

(1 row)

and in that formulation, of course, greediness doesn't really matter
because there is only one way to match.

# select substring('>772foo444' from '[^0-9]*[0-9]+');
substring
-----------

772

(1 row)

See
https://www.postgresql.org/docs/9.5/static/functions-matching.html#POSIX-MATCHING-RULES

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs