BUG #13538: REGEX non-greedy is working incorrectly (and also greedy matches fail if non-greedy is present)

Started by Christian Mächlerover 10 years ago10 messagesbugs
Jump to latest
#1Christian Mächler
christian_maechler@hotmail.com

The following bug has been logged on the website:

Bug reference: 13538
Logged by: Chris Mächler
Email address: christian_maechler@hotmail.com
PostgreSQL version: 9.3.0
Operating system: ?
Description:

Here is an example to verify and reproduce the error (extract a number and
the things before and after it with 3 groups):

'(.*)([+-]?[0-9]*\.[0-9]+)(.*)'

Using regexü_matches this will produce an undesirable result (only one digit
in group 2), but everything behaves correctly, the third group matches until
the end.

'(.*?)([+-]?[0-9]*\.[0-9]+)(.*)'

If we change the first group to non-greedy to fix this, then the bug
appears: the third group becomes non-greedy too (it shouldn't!) and
therefore it is always empty instead of matching until the end of the line.
Also the first group is empty (should match from start!), it should find a
match at start position, whether it is non-greedy or not and not look ahead
if the non-greedy group can be reduced if starting to match at the next
index. Both are wrong behaviors.

(the workaround is anchoring, but the behavior of the regex is still wrong)

link: http://sqlfiddle.com/#!15/f0f14/14

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#2David G. Johnston
david.g.johnston@gmail.com
In reply to: Christian Mächler (#1)
Re: BUG #13538: REGEX non-greedy is working incorrectly (and also greedy matches fail if non-greedy is present)

On Monday, August 3, 2015, <christian_maechler@hotmail.com> wrote:

The following bug has been logged on the website:

Bug reference: 13538
Logged by: Chris Mächler
Email address: christian_maechler@hotmail.com <javascript:;>
PostgreSQL version: 9.3.0
Operating system: ?
Description:

Here is an example to verify and reproduce the error (extract a number and
the things before and after it with 3 groups):

'(.*)([+-]?[0-9]*\.[0-9]+)(.*)'

Using regexü_matches this will produce an undesirable result (only one
digit
in group 2), but everything behaves correctly, the third group matches
until
the end.

'(.*?)([+-]?[0-9]*\.[0-9]+)(.*)'

If we change the first group to non-greedy to fix this, then the bug
appears: the third group becomes non-greedy too (it shouldn't!) and
therefore it is always empty instead of matching until the end of the line.
Also the first group is empty (should match from start!), it should find a
match at start position, whether it is non-greedy or not and not look ahead
if the non-greedy group can be reduced if starting to match at the next
index. Both are wrong behaviors.

(the workaround is anchoring, but the behavior of the regex is still wrong)

link: http://sqlfiddle.com/#!15/f0f14/14

Reading the documentation this seems to be working as intended.

http://www.postgresql.org/docs/9.3/static/functions-matching.html#POSIX-MATCHING-RULES

On what are you basing your concept of correctness? Specifically, what
language implementation do you consider "right"?

The TCL implementation used by PostgreSQL has some differences compared to
Java and Perl, the two I am most familiar with.

David J.

#3Christian Mächler
christian_maechler@hotmail.com
In reply to: David G. Johnston (#2)
Re: BUG #13538: REGEX non-greedy is working incorrectly (and also greedy matches fail if non-greedy is present)

Sorry probably sent my mail to the wrong address before.

Here some more detailed examples to show why the behavior of the 3rd group is clearly wrong also according to the specification:

abc0.123def applying the first regex (with optional dot)--> abc0.12 , 3, def

abc0.123def applying the 2nd regex (with optional dot) --> , 3 ,

Although the 3rd group wasn't touched it changed it's behavior from geedy to non-greedy. abc, 0.123, def would be correct.

The behavior of the non-greedy group isn't correct either, because a match should return the FIRST match found, greedyness is only about the FOLLOWING characters, but because it will already find a match at start position it should return that and not keep looking to find another match with reduced group size.

I was just trying to help, because I think this is a pretty big issue, although not using regex atm.

Chris

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Christian Mächler (#3)
Re: BUG #13538: REGEX non-greedy is working incorrectly (and also greedy matches fail if non-greedy is present)

=?iso-8859-1?B?Q2hyaXN0aWFuIE3kY2hsZXI=?= <christian_maechler@hotmail.com> writes:

Here some more detailed examples to show why the behavior of the 3rd group is clearly wrong also according to the specification:

What specification are you reading? The POSIX standard for regular
expressions doesn't mention non-greedy quantifiers at all.

As David says, these examples appear to be following what's stated in
http://www.postgresql.org/docs/9.4/static/functions-matching.html#POSIX-MATCHING-RULES
The Spencer regex engine we use has a notion of greediness or
non-greediness of the entire regex, and further that that takes precedence
for determining the overall match length over greediness of individual
subexpressions. That behavior might be inconvenient for this particular
use-case, but that doesn't make it a bug.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#4)
Re: BUG #13538: REGEX non-greedy is working incorrectly (and also greedy matches fail if non-greedy is present)

I wrote:

As David says, these examples appear to be following what's stated in
http://www.postgresql.org/docs/9.4/static/functions-matching.html#POSIX-MATCHING-RULES
The Spencer regex engine we use has a notion of greediness or
non-greediness of the entire regex, and further that that takes precedence
for determining the overall match length over greediness of individual
subexpressions. That behavior might be inconvenient for this particular
use-case, but that doesn't make it a bug.

BTW, perhaps it would be worth adding an example to that section that
shows how to control this behavior. The trick is obvious once you've seen
it, but not so much otherwise: you add something to the start of the regex
that establishes the overall greediness you want, but can never actually
match any characters. "\0*" or "\0*?" will work fine in Postgres
use-cases since there can never be a NUL character in the data.

regression=# select regexp_matches('abc01234xyz', '(.*)(\d+)(.*)');
regexp_matches
-----------------
{abc0123,4,xyz}
(1 row)

regression=# select regexp_matches('abc01234xyz', '(.*?)(\d+)(.*)');
regexp_matches
----------------
{abc,0,""}
(1 row)

regression=# select regexp_matches('abc01234xyz', '\0*(.*?)(\d+)(.*)');
regexp_matches
-----------------
{abc,01234,xyz}
(1 row)

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#6Christian Mächler
christian_maechler@hotmail.com
In reply to: Tom Lane (#4)
Re: BUG #13538: REGEX non-greedy is working incorrectly (and also greedy matches fail if non-greedy is present)

You say it is okay that a greedy group suddenly becomes non-greedy if ANOTHER group is made non-greedy?

I've chosen a simple example, but I'm pretty sure I could construct several use-cases which can be solved easily if the regex behaves like in java, javaScript, perl etc. but not with how it is done here. It's clearly not a feature. Already simple things like ending a match with any amount of numbers will become difficult if non-greedy groups are present, e.g. instead of ...([0-9]+) you will have to write ...([0-9]+)(?![0-9]) makes things easier...

Seriously I didn't want to start a debate whether this is right or wrong, because I honestly can't understand how anyone could defend the behavior mentioned in the first sentence of this message. As I said, I just wanted to point out that there is a bug to help improve, but if you prefer it like this it is fine with me, I just think then you probably haven't used regex that much.

Chris

Show quoted text

From: tgl@sss.pgh.pa.us
To: christian_maechler@hotmail.com
CC: david.g.johnston@gmail.com; pgsql-bugs@postgresql.org
Subject: Re: [BUGS] BUG #13538: REGEX non-greedy is working incorrectly (and also greedy matches fail if non-greedy is present)
Date: Tue, 4 Aug 2015 10:58:57 -0400

=?iso-8859-1?B?Q2hyaXN0aWFuIE3kY2hsZXI=?= <christian_maechler@hotmail.com> writes:

Here some more detailed examples to show why the behavior of the 3rd group is clearly wrong also according to the specification:

What specification are you reading? The POSIX standard for regular
expressions doesn't mention non-greedy quantifiers at all.

As David says, these examples appear to be following what's stated in
http://www.postgresql.org/docs/9.4/static/functions-matching.html#POSIX-MATCHING-RULES
The Spencer regex engine we use has a notion of greediness or
non-greediness of the entire regex, and further that that takes precedence
for determining the overall match length over greediness of individual
subexpressions. That behavior might be inconvenient for this particular
use-case, but that doesn't make it a bug.

regards, tom lane

#7David G. Johnston
david.g.johnston@gmail.com
In reply to: Christian Mächler (#6)
Re: BUG #13538: REGEX non-greedy is working incorrectly (and also greedy matches fail if non-greedy is present)

On Tue, Aug 4, 2015 at 8:39 AM, Christian Mächler <
christian_maechler@hotmail.com> wrote:

You say it is okay that a greedy group suddenly becomes non-greedy if
ANOTHER group is made non-greedy?

I've chosen a simple example, but I'm pretty sure I could construct
several use-cases which can be solved easily if the regex behaves like in
java, javaScript, perl etc. but not with how it is done here. It's clearly
not a feature. Already simple things like ending a match with any amount of
numbers will become difficult if non-greedy groups are present, e.g.
instead of ...([0-9]+) you will have to write ...([0-9]+)(?![0-9]) makes
things easier...

Seriously I didn't want to start a debate whether this is right or wrong,
because I honestly can't understand how anyone could defend the behavior
mentioned in the first sentence of this message. As I said, I just wanted
to point out that there is a bug to help improve, but if you prefer it like
this it is fine with me, I just think then you probably haven't used regex
that much.

​I use RegEx quite a bit and while I will agree with the sentiment it would
be nearly impossible to replace the existing implementation. Fortunately,
PostgreSQL is quite extensible and so if you need a more flexible RegEx
implementation you can write a function in pl/perl or pl/v8 and use their
implementations.

About the only thing that would make sense would be to get the TCL
implementation to accept the "possessive" modifier (+ in Java:
https://docs.oracle.com/javase/tutorial/essential/regex/quant.html) and let
it override the "overall greediness" aspect of the matching region while
leaving the unadorned case to use the overall aspect.

There is probable a bit more consideration than the brief amount I've done
here - though I think the point has been made. Right or wrong it is a
design choice that has been made and in use for many years. It works well
enough, and options/hacks exist, that finding someone who wants to dedicate
resources to improving the situation is likely to be difficult.

David J.

#8David G. Johnston
david.g.johnston@gmail.com
In reply to: Tom Lane (#5)
Re: BUG #13538: REGEX non-greedy is working incorrectly (and also greedy matches fail if non-greedy is present)

On Tue, Aug 4, 2015 at 8:39 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

As David says, these examples appear to be following what's stated in

http://www.postgresql.org/docs/9.4/static/functions-matching.html#POSIX-MATCHING-RULES

The Spencer regex engine we use has a notion of greediness or
non-greediness of the entire regex, and further that that takes

precedence

for determining the overall match length over greediness of individual
subexpressions. That behavior might be inconvenient for this particular
use-case, but that doesn't make it a bug.

BTW, perhaps it would be worth adding an example to that section that
shows how to control this behavior. The trick is obvious once you've seen
it, but not so much otherwise: you add something to the start of the regex
that establishes the overall greediness you want, but can never actually
match any characters. "\0*" or "\0*?" will work fine in Postgres
use-cases since there can never be a NUL character in the data.

regression=# select regexp_matches('abc01234xyz', '(.*)(\d+)(.*)');
regexp_matches
-----------------
{abc0123,4,xyz}
(1 row)

regression=# select regexp_matches('abc01234xyz', '(.*?)(\d+)(.*)');
regexp_matches
----------------
{abc,0,""}
(1 row)

regression=# select regexp_matches('abc01234xyz', '\0*(.*?)(\d+)(.*)');
regexp_matches
-----------------
{abc,01234,xyz}
(1 row)

​+1

David J.​

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: David G. Johnston (#8)
Re: BUG #13538: REGEX non-greedy is working incorrectly (and also greedy matches fail if non-greedy is present)

"David G. Johnston" <david.g.johnston@gmail.com> writes:

On Tue, Aug 4, 2015 at 8:39 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

BTW, perhaps it would be worth adding an example to that section that
shows how to control this behavior.

+1

When I looked closer, I noticed that the docs already had a recommendation
about how to force overall greediness or non-greediness, and it was
cleaner than the \0* hack I'd come up with on the spur of the moment.
So I extended that with an example.

For-the-archives: it strikes me that if we ever did want to break
backwards compatibility here in order to make it act a bit more like
Perl's regexps, we could try making the concatenation rule be that the
overall RE inherits the greediness of its last quantified atom rather than
its first one. But I'm not sure how close an approximation that would
produce to Perl's engine's behavior; there would probably still be some
discrepancies. I doubt it's worth breaking backwards compatibility for,
if we'd still get complaints from Perl users that our regex engine is
broken because it's not bug-compatible with Perl's.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#10John R Pierce
pierce@hogranch.com
In reply to: Tom Lane (#9)
Re: BUG #13538: REGEX non-greedy is working incorrectly (and also greedy matches fail if non-greedy is present)

On 8/4/2015 6:20 PM, Tom Lane wrote:

if we'd still get complaints from Perl users that our regex engine is
broken because it's not bug-compatible with Perl's.

Some wag once said... "Some folks, when faced with a problem, go, 'I
know, I'll use Regular Expressions'. Now they have TWO problems!"

--
john r pierce, recycling bits in santa cruz

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs