Regexp_replace bug / does not terminate on long strings

Started by Markhof, Ingolfalmost 5 years ago14 messagesgeneral

ingolf.markhof@de.verizon.com

almost 5 years ago

BRIEF:

regexp_replace(source,pattern,replacement,flags) needs very (!) long to
complete or does not complete at all (?!) for big input strings (a few k
characters). (Oracle SQL completes the same in a few ms)

VERBOSE

Given a comma-separated list of "words" (whereas a word is any sequence of
characters not containing a comma) I want to delete sequences of
duplicates, e.g.

'one,one,one,two,two,three' --> 'one,two,three'

I use this regexp_replace to delete duplicate "words" in a comma separated
string:

select regexp_replace(
'one,one,one,two,two,three', -- input string
'([^,]+)(,\1)*($|,)', -- pattern
'\1\3', -- replacement
'g' -- apply globally (all matches)
);

However, I found that this regexp_replace seemingly runs "endless" (I
cancelled the query after a few minutes) for bigger input strings. Such as:

'1/1,1/1,1/1,1/1,1/1,1/1,1/1,1/1,1/1,1/1,1/1,2/1,2/1,2/1,2/1,2/2,2/1,2/1,2/2,2/2,2/1,2/1,2/2,2/2,2/1,2/2,2/1,2/1,2/2,2/1,2/1,2/2,3/1,3/1,3/2,3/1,3/1,3/1,3/1,3/1,3/1,3/3,3/1,3/1,3/1,3/3,3/1,3/1,3/2,3/1,3/1,3/1,3/3,3/3,3/1,3/1,4/1,4/1,4/1,4/1,4/1,4/1,5/2,5/5,5/1,5/5,5/5,5/2,5/1,5/1,5/5,5/1,6/1,6/1,6/1,6/1,6/3,6/6,6/1,6/1,6/1,6/3,6/1,6/1,6/1,6/1,6/1,6/1,6/6,6/1,7/1,7/3,7/1,7/1,7/1,7/5,7/1,7/1,7/1,7/1,7/1,7/1,7/1,7/5,7/1,7/3,7/1,8/1,8/1,8/2,8/2,8/1,8/6,8/1,8/1,8/1,8/1,8/6,8/1,8/1,8/1,9/2,9/1,9/1,9/2,10/4,10/2,10/2,10/2,10/2,10/1,10/4,10/10,10/10,10/2,10/1,10/1,10/2,10/1,10/8,10/1,10/3,10/2,10/5,10/10,10/2,10/10,10/2,10/3,10/1,10/1,10/1,10/1,10/8,10/5,12/5,12/3,12/5,12/5,12/1,12/5,12/1,12/3,12/1,12/1,12/5,12/2,12/1,12/0.768,12/1,12/2,12/2,12/2,12/2,12/2,12/1,12/1,14/3,15/1,15/10,15/1,15/2,15/3,15/2,15/1,15/15,15/1,15/2,15/4,15/15,15/5,15/1,15/2,15/15,15/1,15/5,15/1,15/3,15/5,15/5,15/1,15/10,15/4,15/2,15/2,15/15,15/3,15/2,15/2,15/3,15/3,16/3,16/3,18/4,18/3,18/1,18/2,18/2,18/2,18/4,18/2,18/2,18/1,18/3,20/20,20/5,20/0.896,20/5,20/1,20/2,20/1,20/3,20/4,20/5,20/10,20/20,20/10,20/5,20/1,20/1,20/4,20/2,20/1,20/1,20/3,20/4,20/20,20/2,20/20,20/2,20/20,20/20,24/3,24/4,24/2,24/4,24/3,24/3,24/3,24/3,25/3,25/4,25/10,25/25,25/6,25/1,25/7,25/2,25/5,25/2,25/2,25/25,25/3,25/25,25/25,25/10,25/3,25/10,25/25,25/6,25/4,25/25,25/5,25/5,25/3,25/1,25/5,25/10,25/25,25/7,25/14,25/5,25/5,25/5,25/3,25/5,25/14,25/2,30/2,30/7,30/3,30/8,30/15,30/1,30/4,30/7,30/2,30/5,30/30,30/8,30/5,30/5,30/5,30/8,30/8,30/1,30/10,30/3,30/30,30/10,30/4,30/30,30/30,30/3,30/1,30/15,30/3,30/5,30/3,35/5,35/12,35/5,35/10,35/3,35/3,35/4,35/10,35/12,35/5,35/5,35/4,40/15,40/4,40/40,40/2,40/10,40/10,40/5,40/3,40/3,40/40,40/4,40/15,40/10,40/2,40/5,45/6,45/6,45/6,45/6,45/6,50/8,50/4,50/50,50/8,50/15,50/7,50/3,50/20,50/25,50/50,50/5,50/50,50/12,50/7,50/4,50/15,50/10,50/8,50/3,50/2,50/20,50/25,50/10,50/8,50/50,50/5,50/5,50/50,50/10,50/10,50/10,50/5,50/5,50/4,50/10,50/50,50/5,50/50,50/8,50/8,50/50,50/50,50/10,50/12,50/2,50/5,55/10,55/10,55/5,55/5,60/3,60/25,60/4,60/60,60/30,60/25,60/6,60/6,60/10,60/5,60/5,60/3,60/10,60/5,60/5,60/10,60/30,60/6,60/5,60/10,60/60,70/10,70/10,75/15,75/3,75/4,75/10,75/20,75/5,75/6,75/8,75/6,75/7,75/75,75/7,75/15,75/25,75/15,75/7,75/3,75/15,75/8,75/30,75/75,75/8,75/8,75/5,75/20,75/75,75/6,75/30,75/15,75/75,75/8,75/25,75/15,75/7,75/75,75/10,75/4,80/5,80/5,90/50,90/50,100/7,100/10,100/10,100/10,100/10,100/10,100/8,100/20,100/10,100/20,100/20,100/5,100/25,100/8,100/5,100/100,100/8,100/20,100/8,100/10,100/20,100/7,100/6,100/50,100/15,100/10,100/2,100/35,100/10,100/10,100/35,100/30,100/100,100/5,100/40,100/35,100/100,100/50,100/35,100/30,100/7,100/10,100/10,100/7,100/25,100/100,100/40,100/5,100/15,100/6,100/7,100/20,100/10,100/2,100/20,105/10,105/20,105/10,105/20,120/15,120/10,120/15,120/15,150/150,150/150,150/75,150/20,150/20,150/20,150/30,150/75,150/25,150/15,150/25,150/150,150/15,150/6,150/6,150/30,150/20,200/15,200/15,200/20,200/10,200/40,200/15,200/40,200/50,200/7,200/20,200/15,200/25,200/20,200/7,200/15,200/7,200/15,200/20,200/20,200/200,200/15,200/50,200/10,200/20,200/20,200/20,200/200,200/20,200/25,200/7,240/15,240/15,250/20,250/50,250/20,250/10,250/10,250/25,250/250,250/250,250/25,250/50,250/25,300/20,300/20,300/30,300/7,300/20,300/300,300/300,300/20,300/30,300/20,300/7,300/10,300/20,300/20,300/30,300/20,300/7,300/7,300/20,300/30,300/30,300/300,300/50,300/300,300/30,300/10,300/300,300/300,300/50,300/300,400/20,400/20,400/25,400/25,450/50,450/50,500/500,500/50,500/50,500/50,500/50,500/500,500/500,500/500,500/35,500/25,500/35,500/25,500/500,600/40,600/40,1000/20,1000/40,1000/20,1000/1000,1000/20,1000/35,1000/1000,1000/20,1000/50,1000/500,1000/50,1000/50,1000/1000,1000/500,1000/1000,1000/35,1000/40'

I also run the same in Oracle SQL where the same completes "immediately"
(with in a few ms).

There seems to be some issue in the implementation of regexp_replace or
regular expressions in general in PostgeSQL...

FYI: I have:

select version();
PostgreSQL 11.9 on aarch64-unknown-linux-gnu, compiled by
aarch64-unknown-linux-gnu-gcc (GCC) 7.4.0, 64-bit

Any comments?

======================================================================

Verizon Deutschland GmbH - Sebrathweg 20, 44149 Dortmund, Germany - Amtsgericht Dortmund, HRB 14952 - Geschäftsführer: Detlef Eppig - Vorsitzender des Aufsichtsrats: Francesco de Maio

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Markhof, Ingolf (#1)

Re: Regexp_replace bug / does not terminate on long strings

"Markhof, Ingolf" <ingolf.markhof@de.verizon.com> writes:

BRIEF:
regexp_replace(source,pattern,replacement,flags) needs very (!) long to
complete or does not complete at all (?!) for big input strings (a few k
characters). (Oracle SQL completes the same in a few ms)

Regexps containing backrefs are inherently hard --- every engine has
strengths and weaknesses. I doubt it'd be hard to find cases where
our engine is orders of magnitude faster than Oracle's; but you've
hit on a case where the opposite is true.

The core of the problem is that it's hard to tell how much of the
string could be matched by the (,\1)* subpattern. In principle, *all*
of the remaining string could be, if it were N repetitions of the
initial word. Or it could be N-1 repetitions followed by one other
word, and so on. The difficulty is that since our engine guarantees
to find the longest feasible match, it tries these options from
longest to shortest. Usually the actual match (if any) will be pretty
short, so that you have O(N) wasted work per word, making the runtime
at least O(N^2).

I think your best bet is to not try to eliminate multiple duplicates
at a time. Get rid of one dup at a time, say by
str := regexp_replace(str, '([^,]+)(,\1)?($|,)', '\1\3', 'g');
and repeat till the string doesn't get any shorter.

I did come across a performance bug [1]/messages/by-id/1808998.1629412269@sss.pgh.pa.us while poking at this, but
alas fixing it doesn't move the needle very much for this example.

regards, tom lane

[1]: /messages/by-id/1808998.1629412269@sss.pgh.pa.us

Michael Lewis

mlewis@entrata.com

almost 5 years ago

In reply to: Tom Lane (#2)

Re: Regexp_replace bug / does not terminate on long strings

If you need it ordered, this is a bit awkward but works and returns for me
in about 5ms on my dev machine.

select string_agg( value, ',' ) As final_result from(

select

value,

min( row_num ) as min_row_num

from(

select

sub.value,

row_number() over () as row_num

from

( select unnest( string_to_array(
'1/1,1/1,1/1,1/1,1/1,1/1,1/1,1/1,1/1,1/1,1/1,2/1,2/1,2/1,2/1,2/2,2/1,2/1,2/2,2/2,2/1,2/1,2/2,2/2,2/1,2/2,2/1,2/1,2/2,2/1,2/1,2/2,3/1,3/1,3/2,3/1,3/1,3/1,3/1,3/1,3/1,3/3,3/1,3/1,3/1,3/3,3/1,3/1,3/2,3/1,3/1,3/1,3/3,3/3,3/1,3/1,4/1,4/1,4/1,4/1,4/1,4/1,5/2,5/5,5/1,5/5,5/5,5/2,5/1,5/1,5/5,5/1,6/1,6/1,6/1,6/1,6/3,6/6,6/1,6/1,6/1,6/3,6/1,6/1,6/1,6/1,6/1,6/1,6/6,6/1,7/1,7/3,7/1,7/1,7/1,7/5,7/1,7/1,7/1,7/1,7/1,7/1,7/1,7/5,7/1,7/3,7/1,8/1,8/1,8/2,8/2,8/1,8/6,8/1,8/1,8/1,8/1,8/6,8/1,8/1,8/1,9/2,9/1,9/1,9/2,10/4,10/2,10/2,10/2,10/2,10/1,10/4,10/10,10/10,10/2,10/1,10/1,10/2,10/1,10/8,10/1,10/3,10/2,10/5,10/10,10/2,10/10,10/2,10/3,10/1,10/1,10/1,10/1,10/8,10/5,12/5,12/3,12/5,12/5,12/1,12/5,12/1,12/3,12/1,12/1,12/5,12/2,12/1,12/0.768,12/1,12/2,12/2,12/2,12/2,12/2,12/1,12/1,14/3,15/1,15/10,15/1,15/2,15/3,15/2,15/1,15/15,15/1,15/2,15/4,15/15,15/5,15/1,15/2,15/15,15/1,15/5,15/1,15/3,15/5,15/5,15/1,15/10,15/4,15/2,15/2,15/15,15/3,15/2,15/2,15/3,15/3,16/3,16/3,18/4,18/3,18/1,18/2,18/2,18/2,18/4,18/2,18/2,18/1,18/3,20/20,20/5,20/0.896,20/5,20/1,20/2,20/1,20/3,20/4,20/5,20/10,20/20,20/10,20/5,20/1,20/1,20/4,20/2,20/1,20/1,20/3,20/4,20/20,20/2,20/20,20/2,20/20,20/20,24/3,24/4,24/2,24/4,24/3,24/3,24/3,24/3,25/3,25/4,25/10,25/25,25/6,25/1,25/7,25/2,25/5,25/2,25/2,25/25,25/3,25/25,25/25,25/10,25/3,25/10,25/25,25/6,25/4,25/25,25/5,25/5,25/3,25/1,25/5,25/10,25/25,25/7,25/14,25/5,25/5,25/5,25/3,25/5,25/14,25/2,30/2,30/7,30/3,30/8,30/15,30/1,30/4,30/7,30/2,30/5,30/30,30/8,30/5,30/5,30/5,30/8,30/8,30/1,30/10,30/3,30/30,30/10,30/4,30/30,30/30,30/3,30/1,30/15,30/3,30/5,30/3,35/5,35/12,35/5,35/10,35/3,35/3,35/4,35/10,35/12,35/5,35/5,35/4,40/15,40/4,40/40,40/2,40/10,40/10,40/5,40/3,40/3,40/40,40/4,40/15,40/10,40/2,40/5,45/6,45/6,45/6,45/6,45/6,50/8,50/4,50/50,50/8,50/15,50/7,50/3,50/20,50/25,50/50,50/5,50/50,50/12,50/7,50/4,50/15,50/10,50/8,50/3,50/2,50/20,50/25,50/10,50/8,50/50,50/5,50/5,50/50,50/10,50/10,50/10,50/5,50/5,50/4,50/10,50/50,50/5,50/50,50/8,50/8,50/50,50/50,50/10,50/12,50/2,50/5,55/10,55/10,55/5,55/5,60/3,60/25,60/4,60/60,60/30,60/25,60/6,60/6,60/10,60/5,60/5,60/3,60/10,60/5,60/5,60/10,60/30,60/6,60/5,60/10,60/60,70/10,70/10,75/15,75/3,75/4,75/10,75/20,75/5,75/6,75/8,75/6,75/7,75/75,75/7,75/15,75/25,75/15,75/7,75/3,75/15,75/8,75/30,75/75,75/8,75/8,75/5,75/20,75/75,75/6,75/30,75/15,75/75,75/8,75/25,75/15,75/7,75/75,75/10,75/4,80/5,80/5,90/50,90/50,100/7,100/10,100/10,100/10,100/10,100/10,100/8,100/20,100/10,100/20,100/20,100/5,100/25,100/8,100/5,100/100,100/8,100/20,100/8,100/10,100/20,100/7,100/6,100/50,100/15,100/10,100/2,100/35,100/10,100/10,100/35,100/30,100/100,100/5,100/40,100/35,100/100,100/50,100/35,100/30,100/7,100/10,100/10,100/7,100/25,100/100,100/40,100/5,100/15,100/6,100/7,100/20,100/10,100/2,100/20,105/10,105/20,105/10,105/20,120/15,120/10,120/15,120/15,150/150,150/150,150/75,150/20,150/20,150/20,150/30,150/75,150/25,150/15,150/25,150/150,150/15,150/6,150/6,150/30,150/20,200/15,200/15,200/20,200/10,200/40,200/15,200/40,200/50,200/7,200/20,200/15,200/25,200/20,200/7,200/15,200/7,200/15,200/20,200/20,200/200,200/15,200/50,200/10,200/20,200/20,200/20,200/200,200/20,200/25,200/7,240/15,240/15,250/20,250/50,250/20,250/10,250/10,250/25,250/250,250/250,250/25,250/50,250/25,300/20,300/20,300/30,300/7,300/20,300/300,300/300,300/20,300/30,300/20,300/7,300/10,300/20,300/20,300/30,300/20,300/7,300/7,300/20,300/30,300/30,300/300,300/50,300/300,300/30,300/10,300/300,300/300,300/50,300/300,400/20,400/20,400/25,400/25,450/50,450/50,500/500,500/50,500/50,500/50,500/50,500/500,500/500,500/500,500/35,500/25,500/35,500/25,500/500,600/40,600/40,1000/20,1000/40,1000/20,1000/1000,1000/20,1000/35,1000/1000,1000/20,1000/50,1000/500,1000/50,1000/50,1000/1000,1000/500,1000/1000,1000/35,1000/40',
',' ) ) as value

) as sub

) sub_outer

group by value

order by min_row_num

) sub_outermost;

/* return value
1/1,2/1,2/2,3/1,3/2,3/3,4/1,5/2,5/5,5/1,6/1,6/3,6/6,7/1,7/3,7/5,8/1,8/2,8/6,9/2,9/1,10/4,10/2,10/1,10/10,10/8,10/3,10/5,12/5,12/3,12/1,12/2,12/0.768,14/3,15/1,15/10,15/2,15/3,15/15,15/4,15/5,16/3,18/4,18/3,18/1,18/2,20/20,20/5,20/0.896,20/1,20/2,20/3,20/4,20/10,24/3,24/4,24/2,25/3,25/4,25/10,25/25,25/6,25/1,25/7,25/2,25/5,25/14,30/2,30/7,30/3,30/8,30/15,30/1,30/4,30/5,30/30,30/10,35/5,35/12,35/10,35/3,35/4,40/15,40/4,40/40,40/2,40/10,40/5,40/3,45/6,50/8,50/4,50/50,50/15,50/7,50/3,50/20,50/25,50/5,50/12,50/10,50/2,55/10,55/5,60/3,60/25,60/4,60/60,60/30,60/6,60/10,60/5,70/10,75/15,75/3,75/4,75/10,75/20,75/5,75/6,75/8,75/7,75/75,75/25,75/30,80/5,90/50,100/7,100/10,100/8,100/20,100/5,100/25,100/100,100/6,100/50,100/15,100/2,100/35,100/30,100/40,105/10,105/20,120/15,120/10,150/150,150/75,150/20,150/30,150/25,150/15,150/6,200/15,200/20,200/10,200/40,200/50,200/7,200/25,200/200,240/15,250/20,250/50,250/10,250/25,250/250,300/20,300/30,300/7,300/300,300/10,300/50,400/20,400/25,450/50,500/500,500/50,500/35,500/25,600/40,1000/20,1000/40,1000/1000,1000/35,1000/50,1000/500

If you don't need the order maintained, it becomes a much simpler problem
and you can strip off some of this complexity.

*Michael Lewis | Database Engineer*
*Entrata*

On Thu, Aug 19, 2021 at 4:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

"Markhof, Ingolf" <ingolf.markhof@de.verizon.com> writes:

BRIEF:
regexp_replace(source,pattern,replacement,flags) needs very (!) long to
complete or does not complete at all (?!) for big input strings (a few k
characters). (Oracle SQL completes the same in a few ms)

Regexps containing backrefs are inherently hard --- every engine has
strengths and weaknesses. I doubt it'd be hard to find cases where
our engine is orders of magnitude faster than Oracle's; but you've
hit on a case where the opposite is true.

The core of the problem is that it's hard to tell how much of the
string could be matched by the (,\1)* subpattern. In principle, *all*
of the remaining string could be, if it were N repetitions of the
initial word. Or it could be N-1 repetitions followed by one other
word, and so on. The difficulty is that since our engine guarantees
to find the longest feasible match, it tries these options from
longest to shortest. Usually the actual match (if any) will be pretty
short, so that you have O(N) wasted work per word, making the runtime
at least O(N^2).

I think your best bet is to not try to eliminate multiple duplicates
at a time. Get rid of one dup at a time, say by
str := regexp_replace(str, '([^,]+)(,\1)?($|,)', '\1\3', 'g');
and repeat till the string doesn't get any shorter.

I did come across a performance bug [1] while poking at this, but
alas fixing it doesn't move the needle very much for this example.

regards, tom lane

[1]
/messages/by-id/1808998.1629412269@sss.pgh.pa.us

Michael Lewis

mlewis@entrata.com

almost 5 years ago

In reply to: Michael Lewis (#3)

Re: Regexp_replace bug / does not terminate on long strings

Btw- My apologies for top posting. I think my caffeine wore off.

Markhof, Ingolf

ingolf.markhof@de.verizon.com

almost 5 years ago

In reply to: Tom Lane (#2)

Re: [E] Re: Regexp_replace bug / does not terminate on long strings

Hi Tom,

thank you very much for your reply. Actually, I was assuming all these
regular expressions are based on the same core implementation.
Interestingly, this doesn't seem to be true...

I am also surprised that you say the (\1)+ subpattern is computationally
expensive. Regular expressions are greedy by default. I.e. in case of a*
matching against a string of 1000 a's, the system will not try a, aa, aaa,
... and so on, right? Instead, it will consume all the a's in one go.
Likewise, I was expecting that the system would eat all the repetitions of
a word in one go. Your proposal to use (\1)? instead at the first glance
seems to require more effort, because the same word and it's matching
successor will need to be matched again and again and again. Roughly, 2N
matches are to be done instead of just N.

However, you are perfectly right: When I use (\1)? instead of (\1)+, the
expression is evaluates quickly!

Thank you very much for looking into this and for proposing the alternative
approach which is working fine.

Regards
Ingolf

On Fri, Aug 20, 2021 at 12:42 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Markhof, Ingolf" <ingolf.markhof@de.verizon.com> writes:

BRIEF:
regexp_replace(source,pattern,replacement,flags) needs very (!) long to
complete or does not complete at all (?!) for big input strings (a few k
characters). (Oracle SQL completes the same in a few ms)

Regexps containing backrefs are inherently hard --- every engine has
strengths and weaknesses. I doubt it'd be hard to find cases where
our engine is orders of magnitude faster than Oracle's; but you've
hit on a case where the opposite is true.

The core of the problem is that it's hard to tell how much of the
string could be matched by the (,\1)* subpattern. In principle, *all*
of the remaining string could be, if it were N repetitions of the
initial word. Or it could be N-1 repetitions followed by one other
word, and so on. The difficulty is that since our engine guarantees
to find the longest feasible match, it tries these options from
longest to shortest. Usually the actual match (if any) will be pretty
short, so that you have O(N) wasted work per word, making the runtime
at least O(N^2).

I think your best bet is to not try to eliminate multiple duplicates
at a time. Get rid of one dup at a time, say by
str := regexp_replace(str, '([^,]+)(,\1)?($|,)', '\1\3', 'g');
and repeat till the string doesn't get any shorter.

I did come across a performance bug [1] while poking at this, but
alas fixing it doesn't move the needle very much for this example.

regards, tom lane

[1]
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.postgresql.org_message-2Did_1808998.1629412269-2540sss.pgh.pa.us&d=DwIBAg&c=udBTRvFvXC5Dhqg7UHpJlPps3mZ3LRxpb6__0PomBTQ&r=ivZWA-ECVj3XrXBe0obDwKY7Ui7K5Nj9oD2KKWLm0Bw&m=q11bVTHCxVx8BQu2pjn6-3nOY8aN8hORXofVK38HqF8&s=hJxrzmTT6G7AUomoeFgh0IGDO3NcUP4gB9kvYHnt3m0&e=

======================================================================

Verizon Deutschland GmbH - Sebrathweg 20, 44149 Dortmund, Germany - Amtsgericht Dortmund, HRB 14952 - Geschäftsführer: Detlef Eppig - Vorsitzender des Aufsichtsrats: Francesco de Maio

Markhof, Ingolf

ingolf.markhof@de.verizon.com

almost 5 years ago

In reply to: Markhof, Ingolf (#5)

Re: [E] Re: Regexp_replace bug / does not terminate on long strings

Thank you very much for all your proposals!

Ingolf

======================================================================

Verizon Deutschland GmbH - Sebrathweg 20, 44149 Dortmund, Germany - Amtsgericht Dortmund, HRB 14952 - Geschäftsführer: Detlef Eppig - Vorsitzender des Aufsichtsrats: Francesco de Maio

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Markhof, Ingolf (#5)

Re: [E] Re: Regexp_replace bug / does not terminate on long strings

"Markhof, Ingolf" <ingolf.markhof@de.verizon.com> writes:

thank you very much for your reply. Actually, I was assuming all these
regular expressions are based on the same core implementation.

They are not. There are at least three fundamentally different
implementation technologies (DFA, NFA, hybrid). Friedl's "Mastering
Regular Expressions" cites multiple different programs using each
of those, every one of which behaves a bit differently when you start
poking at corner cases. And that's just in the open-source world;
I don't know what Oracle is using, but I bet it ain't open source.

I am also surprised that you say the (\1)+ subpattern is computationally
expensive. Regular expressions are greedy by default. I.e. in case of a*
matching against a string of 1000 a's, the system will not try a, aa, aaa,
... and so on, right? Instead, it will consume all the a's in one go.

"a*" is easy. "(a*)\1" is less easy --- if you let the a* consume the
whole string, you will not get a match, even though one is possible.
In general, backrefs create a mess in what would otherwise be a pretty
straightforward concept :-(.

regards, tom lane

Mark Dilger

mark.dilger@enterprisedb.com

almost 5 years ago

In reply to: Tom Lane (#7)

Re: [E] Regexp_replace bug / does not terminate on long strings

On Aug 20, 2021, at 9:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"a*" is easy. "(a*)\1" is less easy --- if you let the a* consume the
whole string, you will not get a match, even though one is possible.
In general, backrefs create a mess in what would otherwise be a pretty
straightforward concept :-(.

The following queries take radically different time to run:

\timing
select regexp_replace(
repeat('someone,one,one,one,one,one,one,', 60),
'(?<=^|,)([^,]+)(?:,\1)+(?=$|,)',
'\1', -- replacement
'g' -- apply globally (all matches)
);

Time: 16476.529 ms (00:16.477)

select regexp_replace(
repeat('someone,one,one,one,one,one,one,', 60),
'(?<=^|,)([^,]+)(?:,\1){5}(?=$|,)',
'\1', -- replacement
'g' -- apply globally (all matches)
);

Time: 1.452 ms

The only difference in the patterns is the + vs. the {5}. It looks to me like the first pattern should greedily match five ",one" matches and be forced to stop since ",someone" doesn't match, and the second pattern should grab the five ",one" matches it was told to grab and not try to grab the ",someone", but other than that, they should be performing the same work. I don't see why the performance should be so different.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Miles Elam

miles.elam@productops.com

almost 5 years ago

In reply to: Mark Dilger (#8)

Re: [E] Regexp_replace bug / does not terminate on long strings

On Fri, Aug 20, 2021 at 12:32 PM Mark Dilger <mark.dilger@enterprisedb.com>
wrote:

The following queries take radically different time to run:

Unbounded ranges seem like a problem. Seems worth trying a range from 1 to
N where you play around with N to find your optimum
performance/functionality tradeoff. {1,20} is like '+' but clamps at 20.

select regexp_replace(
repeat('someone,one,one,one,one,one,one,', 60),
'(?<=^|,)([^,]+)(?:,\1){1,20}(?=$|,)',
'\1', -- replacement
'g' -- apply globally (all matches)
);
- Miles Elam

#10

Mark Dilger

mark.dilger@enterprisedb.com

almost 5 years ago

In reply to: Miles Elam (#9)

Re: [E] Regexp_replace bug / does not terminate on long strings

On Aug 20, 2021, at 12:51 PM, Miles Elam <miles.elam@productops.com> wrote:

Unbounded ranges seem like a problem.

Seems so. The problem appears to be in regcomp.c's repeat() function which handles {1,SOME} differently than {1,INF}

Seems worth trying a range from 1 to N where you play around with N to find your optimum performance/functionality tradeoff. {1,20} is like '+' but clamps at 20.

For any such value (5, 20, whatever) there can always be a string with more repeated words than the number you've chosen, and the call to regexp_replace won't do what you want. There is also an upper bound at work, because values above 255 will draw a regex compilation error. So it seems worth a bit of work to determine why the regex engine has bad performance in these cases.

It sounds like the OP will be working around this problem by refactoring to call regexp_replace multiple times until all repeats are eradicated, but I don't think such workarounds should be necessary.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11

Markhof, Ingolf

ingolf.markhof@de.verizon.com

almost 5 years ago

In reply to: Tom Lane (#7)

Re: [E] Re: Regexp_replace bug / does not terminate on long strings

Right. Considering a longer sequence of a's, "(a*)\1" allows a wide variety
of matches. But in fact, this is not what I was trying to use. I was more
looking at "(a)\1*" which shall match exactly what "a+" matches. As
matching is greedy, "(a)\1*" shall consume all a's in a sequence in one go,
just like "a+" does...?!

Regards,
Ingolf

On Fri, Aug 20, 2021 at 6:52 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Markhof, Ingolf" <ingolf.markhof@de.verizon.com> writes:

thank you very much for your reply. Actually, I was assuming all these
regular expressions are based on the same core implementation.

They are not. There are at least three fundamentally different
implementation technologies (DFA, NFA, hybrid). Friedl's "Mastering
Regular Expressions" cites multiple different programs using each
of those, every one of which behaves a bit differently when you start
poking at corner cases. And that's just in the open-source world;
I don't know what Oracle is using, but I bet it ain't open source.

I am also surprised that you say the (\1)+ subpattern is computationally
expensive. Regular expressions are greedy by default. I.e. in case of a*
matching against a string of 1000 a's, the system will not try a, aa,

aaa,

... and so on, right? Instead, it will consume all the a's in one go.

"a*" is easy. "(a*)\1" is less easy --- if you let the a* consume the
whole string, you will not get a match, even though one is possible.
In general, backrefs create a mess in what would otherwise be a pretty
straightforward concept :-(.

regards, tom lane

======================================================================

Verizon Deutschland GmbH - Sebrathweg 20, 44149 Dortmund, Germany - Amtsgericht Dortmund, HRB 14952 - Geschäftsführer: Detlef Eppig - Vorsitzender des Aufsichtsrats: Francesco de Maio

#12

Markhof, Ingolf

ingolf.markhof@de.verizon.com

almost 5 years ago

In reply to: Markhof, Ingolf (#5)

Re: [E] Re: Regexp_replace bug / does not terminate on long strings

Argh...

Yes, When I use (\1)? instead of (\1)+, the expression is evaluated
quickly, but it doesn't return what I want. Once a word is written, it is
not subject to matching again. i.e.

select regexp_replace( --> remove double entries
'one,one,one,two,two,three,three',
'([^,]+)(,\1)?($|,)',
'\1\3',
'g'
) as res;

returns 'one,one,two,three'. The first 'one' is found, followed by a ',one,
so the 'one,one,' is replaced by a 'one,'. It looks like the system is
'reading' from an input string and writing to output string. And the 'one,'
send to the output is not matched again. So the system instead finds just
another 'one,' in the input string and writes another 'one,' to the output
string.

Honestly, this behaviour seems to be incorrect for me. Once the system
replaces the first two 'one,one,' by a single 'one,', I'd expect to match
this replaced one 'one,' with the next 'one,' following, replacing these
two by another, single 'one,', again...

Regards,
Ingolf

Ingolf Markhof <ingolf.markhof@de.verizon.com>
International Network Product - International Access <
INP-IntlAccess@verizon.com>

Sebrathweg 20, 44149 Dortmund, Germany
Office: +49 231 972 1475 | Vnet: 317-1475

On Fri, Aug 20, 2021 at 6:11 PM Markhof, Ingolf <
ingolf.markhof@de.verizon.com> wrote:

Show quoted text

Hi Tom,

thank you very much for your reply. Actually, I was assuming all these
regular expressions are based on the same core implementation.
Interestingly, this doesn't seem to be true...

I am also surprised that you say the (\1)+ subpattern is computationally
expensive. Regular expressions are greedy by default. I.e. in case of a*
matching against a string of 1000 a's, the system will not try a, aa, aaa,
... and so on, right? Instead, it will consume all the a's in one go.
Likewise, I was expecting that the system would eat all the repetitions of
a word in one go. Your proposal to use (\1)? instead at the first glance
seems to require more effort, because the same word and it's matching
successor will need to be matched again and again and again. Roughly, 2N
matches are to be done instead of just N.

However, you are perfectly right: When I use (\1)? instead of (\1)+, the
expression is evaluates quickly!

Thank you very much for looking into this and for proposing the
alternative approach which is working fine.

Regards
Ingolf

On Fri, Aug 20, 2021 at 12:42 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Markhof, Ingolf" <ingolf.markhof@de.verizon.com> writes:

BRIEF:
regexp_replace(source,pattern,replacement,flags) needs very (!) long to
complete or does not complete at all (?!) for big input strings (a few k
characters). (Oracle SQL completes the same in a few ms)

Regexps containing backrefs are inherently hard --- every engine has
strengths and weaknesses. I doubt it'd be hard to find cases where
our engine is orders of magnitude faster than Oracle's; but you've
hit on a case where the opposite is true.

The core of the problem is that it's hard to tell how much of the
string could be matched by the (,\1)* subpattern. In principle, *all*
of the remaining string could be, if it were N repetitions of the
initial word. Or it could be N-1 repetitions followed by one other
word, and so on. The difficulty is that since our engine guarantees
to find the longest feasible match, it tries these options from
longest to shortest. Usually the actual match (if any) will be pretty
short, so that you have O(N) wasted work per word, making the runtime
at least O(N^2).

I think your best bet is to not try to eliminate multiple duplicates
at a time. Get rid of one dup at a time, say by
str := regexp_replace(str, '([^,]+)(,\1)?($|,)', '\1\3', 'g');
and repeat till the string doesn't get any shorter.

I did come across a performance bug [1] while poking at this, but
alas fixing it doesn't move the needle very much for this example.

regards, tom lane

[1]
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.postgresql.org_message-2Did_1808998.1629412269-2540sss.pgh.pa.us&d=DwIBAg&c=udBTRvFvXC5Dhqg7UHpJlPps3mZ3LRxpb6__0PomBTQ&r=ivZWA-ECVj3XrXBe0obDwKY7Ui7K5Nj9oD2KKWLm0Bw&m=q11bVTHCxVx8BQu2pjn6-3nOY8aN8hORXofVK38HqF8&s=hJxrzmTT6G7AUomoeFgh0IGDO3NcUP4gB9kvYHnt3m0&e=

#13

Francisco Olarte

folarte@peoplecall.com

almost 5 years ago

In reply to: Markhof, Ingolf (#12)

Re: [E] Re: Regexp_replace bug / does not terminate on long strings

Ingolf:

On Mon, Aug 23, 2021 at 2:39 PM Markhof, Ingolf
<ingolf.markhof@de.verizon.com> wrote:

Yes, When I use (\1)? instead of (\1)+, the expression is evaluated quickly, but it doesn't return what I want. Once a word is written, it is not subject to matching again. i.e.
select regexp_replace( --> remove double entries
'one,one,one,two,two,three,three',
'([^,]+)(,\1)?($|,)',
'\1\3',
'g'
) as res;

...

Honestly, this behaviour seems to be incorrect for me. Once the system replaces the first two 'one,one,' by a single 'one,', I'd expect to match this replaced one 'one,' with the next 'one,' following, replacing these two by another, single 'one,', again...

I think your expectation is misguided. All the regexp engines I've
used do it this way, when asked to match "g"lobally they do
non-overlapping matches, they do not substitute and recurse with the
modified string.

Also, your way opens the door to run-away or infinite loops (
rr('a','a','aa','g') or rr('a','a','a','g'), not to speak of
r('x','','','g') ). Even a misguided r(str, '_+','_','g'), used
sometimes to normalize space runs and similar things, can go into a
loop.

Francisco Olarte.

#14

Markhof, Ingolf

ingolf.markhof@de.verizon.com

almost 5 years ago

In reply to: Francisco Olarte (#13)

Re: [E] Re: Regexp_replace bug / does not terminate on long strings

You are right, I also found the same behaviour when using e.g the UNIX sed
command.

Ingolf

On Mon, Aug 23, 2021 at 4:24 PM Francisco Olarte <folarte@peoplecall.com>
wrote:

Ingolf:

On Mon, Aug 23, 2021 at 2:39 PM Markhof, Ingolf
<ingolf.markhof@de.verizon.com> wrote:

Yes, When I use (\1)? instead of (\1)+, the expression is evaluated

quickly, but it doesn't return what I want. Once a word is written, it is
not subject to matching again. i.e.

select regexp_replace( --> remove double entries
'one,one,one,two,two,three,three',
'([^,]+)(,\1)?($|,)',
'\1\3',
'g'
) as res;

...

Honestly, this behaviour seems to be incorrect for me. Once the system

replaces the first two 'one,one,' by a single 'one,', I'd expect to match
this replaced one 'one,' with the next 'one,' following, replacing these
two by another, single 'one,', again...

I think your expectation is misguided. All the regexp engines I've
used do it this way, when asked to match "g"lobally they do
non-overlapping matches, they do not substitute and recurse with the
modified string.

Also, your way opens the door to run-away or infinite loops (
rr('a','a','aa','g') or rr('a','a','a','g'), not to speak of
r('x','','','g') ). Even a misguided r(str, '_+','_','g'), used
sometimes to normalize space runs and similar things, can go into a
loop.

Francisco Olarte.

======================================================================

Verizon Deutschland GmbH - Sebrathweg 20, 44149 Dortmund, Germany - Amtsgericht Dortmund, HRB 14952 - Geschäftsführer: Detlef Eppig - Vorsitzender des Aufsichtsrats: Francesco de Maio