BUG #17761: Questionable regular expression behavior

Started by PG Bug reporting formabout 3 years ago4 messagesbugs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org

The following bug has been logged on the website:

Bug reference: 17761
Logged by: Konstantin Geordzhev
Email address: kosiodg@yahoo.com
PostgreSQL version: 11.10
Operating system: tested online
Description:

Executing:
select regexp_matches('a 1x1250x2500',
'(a).*?([1-9]\d*)\s*x\s*([1-9]\d*)(?:\s*x\s*([1-9]\d*))?');
returns: {a,1,1,NULL}
while executing:
select regexp_matches('a 1x1250x2500',
'(a|b).*?([1-9]\d*)\s*x\s*([1-9]\d*)(?:\s*x\s*([1-9]\d*))?');
returns: {a,1,1250,2500}

Shouldn't both results be equal?

Tested online at:
https://extendsclass.com/postgresql-online.html
and on ubuntu version 9.5

In reply to: PG Bug reporting form (#1)
Re: BUG #17761: Questionable regular expression behavior

On Fri, Jan 27, 2023 at 09:27:35AM +0000, PG Bug reporting form wrote:

The following bug has been logged on the website:

Bug reference: 17761
Logged by: Konstantin Geordzhev
Email address: kosiodg@yahoo.com
PostgreSQL version: 11.10
Operating system: tested online
Description:

Executing:
select regexp_matches('a 1x1250x2500',
'(a).*?([1-9]\d*)\s*x\s*([1-9]\d*)(?:\s*x\s*([1-9]\d*))?');
returns: {a,1,1,NULL}
while executing:
select regexp_matches('a 1x1250x2500',
'(a|b).*?([1-9]\d*)\s*x\s*([1-9]\d*)(?:\s*x\s*([1-9]\d*))?');
returns: {a,1,1250,2500}

Shouldn't both results be equal?

The problem is, afair, that there is some state in pg's regexp engine
that makes greedy/ungreedy decision once per regexp.

I don't recall details, but my take from back when I learned about it
(years ago) is to try to avoid things like .*?

Instead you can:

#v+
$ select regexp_matches('a 1x1250x2500', '(a)\D*([1-9]\d*)\s*x\s*([1-9]\d*)(?:\s*x\s*([1-9]\d*))?');
regexp_matches
─────────────────
{a,1,1250,2500}
(1 row)
#v-

depesz

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: hubert depesz lubaczewski (#2)
Re: BUG #17761: Questionable regular expression behavior

hubert depesz lubaczewski <depesz@depesz.com> writes:

On Fri, Jan 27, 2023 at 09:27:35AM +0000, PG Bug reporting form wrote:

Executing:
select regexp_matches('a 1x1250x2500',
'(a).*?([1-9]\d*)\s*x\s*([1-9]\d*)(?:\s*x\s*([1-9]\d*))?');
returns: {a,1,1,NULL}
while executing:
select regexp_matches('a 1x1250x2500',
'(a|b).*?([1-9]\d*)\s*x\s*([1-9]\d*)(?:\s*x\s*([1-9]\d*))?');
returns: {a,1,1250,2500}

Shouldn't both results be equal?

The problem is, afair, that there is some state in pg's regexp engine
that makes greedy/ungreedy decision once per regexp.

Yeah. Without having traced through it, I'm fairly sure that in the
first case, we have "(a)" which has no greediness, then ".*?" which
is non-greedy, and then that determines the overall greediness as
non-greedy, so it goes for the shortest overall match not the longest.

In the second case, "(a|b)" is greedy because anything involving "|"
is greedy, so we immediately decide we'll be greedy overall.

The fine manual explains how you can force greediness or non-greediness
when the engine's default rules for that don't do what you want.

regards, tom lane

#4Kosio Dimitrov
kosiodg@yahoo.com
In reply to: Tom Lane (#3)
Re: BUG #17761: Questionable regular expression behavior

Yes, greediness seems to be the case,One other solution I found to make it greedy is to add '.*$' at the end:'(a).*?([1-9]\d*)\s*x\s*([1-9]\d*)(?:\s*x\s*([1-9]\d*))?.*$'

В петък, 27 януари 2023 г., 18:05:01 ч. Гринуич+2, Tom Lane <tgl@sss.pgh.pa.us> написа:

hubert depesz lubaczewski <depesz@depesz.com> writes:

On Fri, Jan 27, 2023 at 09:27:35AM +0000, PG Bug reporting form wrote:

Executing:
select regexp_matches('a 1x1250x2500',
'(a).*?([1-9]\d*)\s*x\s*([1-9]\d*)(?:\s*x\s*([1-9]\d*))?');
returns: {a,1,1,NULL}
while executing:
select regexp_matches('a 1x1250x2500',
'(a|b).*?([1-9]\d*)\s*x\s*([1-9]\d*)(?:\s*x\s*([1-9]\d*))?');
returns: {a,1,1250,2500}

Shouldn't both results be equal?

The problem is, afair, that there is some state in pg's regexp engine
that makes greedy/ungreedy decision once per regexp.

Yeah.  Without having traced through it, I'm fairly sure that in the
first case, we have "(a)" which has no greediness, then ".*?" which
is non-greedy, and then that determines the overall greediness as
non-greedy, so it goes for the shortest overall match not the longest.

In the second case, "(a|b)" is greedy because anything involving "|"
is greedy, so we immediately decide we'll be greedy overall.

The fine manual explains how you can force greediness or non-greediness
when the engine's default rules for that don't do what you want.

            regards, tom lane