Why this regexp matches?!

Started by hubert depesz lubaczewskiabout 14 years ago11 messagesgeneral
Jump to latest

select 'depesz depeszx depesz' ~ E'^(.*)( \\1)+$';

what's worse:
$ select regexp_replace( 'depesz depeszx depesz', E'^(.*)( \\1)+$', E'\\1' );
regexp_replace
────────────────
depesz
(1 row)

I know that Pg regexps are limited, but even grep's regexps match this
correctly:

=$ printf 'depesz depesz depesz\ndepesz depeszx depesz\n' | grep -E '^(.*)( \1)+$';
depesz depesz depesz

Best regards,

depesz

--
The best thing about modern society is how easy it is to avoid contact with it.
http://depesz.com/

#2Szymon Guz
mabewlun@gmail.com
In reply to: hubert depesz lubaczewski (#1)
Re: Why this regexp matches?!

On 4 February 2012 09:46, hubert depesz lubaczewski <depesz@depesz.com>wrote:

select 'depesz depeszx depesz' ~ E'^(.*)( \\1)+$';

what's worse:
$ select regexp_replace( 'depesz depeszx depesz', E'^(.*)( \\1)+$', E'\\1'
);
regexp_replace
────────────────
depesz
(1 row)

I know that Pg regexps are limited, but even grep's regexps match this
correctly:

=$ printf 'depesz depesz depesz\ndepesz depeszx depesz\n' | grep -E
'^(.*)( \1)+$';
depesz depesz depesz

Best regards,

depesz

Hi,
some time ago I hit the same problem, however the solution was a little bit
tricky. I didn't have time to investigate it, but this works:

postgres@postgres:5840=# select regexp_replace( 'depesz depeszx depesz',
E'^(.*)( \\\\1)+$', E'\\\\1' );
regexp_replace
-----------------------
depesz depeszx depesz
(1 row)

regards
Szymon

In reply to: Szymon Guz (#2)
Re: Why this regexp matches?!

On Sat, Feb 04, 2012 at 09:54:34AM +0100, Szymon Guz wrote:

On 4 February 2012 09:46, hubert depesz lubaczewski <depesz@depesz.com>wrote:

select 'depesz depeszx depesz' ~ E'^(.*)( \\1)+$';

what's worse:
$ select regexp_replace( 'depesz depeszx depesz', E'^(.*)( \\1)+$', E'\\1'
);
regexp_replace
────────────────
depesz
(1 row)

I know that Pg regexps are limited, but even grep's regexps match this
correctly:

=$ printf 'depesz depesz depesz\ndepesz depeszx depesz\n' | grep -E
'^(.*)( \1)+$';
depesz depesz depesz

Best regards,

depesz

Hi,
some time ago I hit the same problem, however the solution was a little bit
tricky. I didn't have time to investigate it, but this works:

postgres@postgres:5840=# select regexp_replace( 'depesz depeszx depesz',
E'^(.*)( \\\\1)+$', E'\\\\1' );
regexp_replace
-----------------------
depesz depeszx depesz
(1 row)

not sure if I understand your point.

This regexp was meant to find repeated substrings.

Like this one does in perl:

/^(.*)( \1)+$/

We can see how it works with:
=$ perl -e 'if ( shift =~ m/^(.*)( \1)+$/ ) { print "is repeat of [$1]\n" } else {print "is not repeated\n"}' 'depesz depesz depesz'
is repeat of [depesz]

=$ perl -e 'if ( shift =~ m/^(.*)( \1)+$/ ) { print "is repeat of [$1]\n" } else {print "is not repeated\n"}' 'depesz depeszx depesz'
is not repeated

reason why your regexp matches is also a mystery for me.

Best regards,

depesz

--
The best thing about modern society is how easy it is to avoid contact with it.
http://depesz.com/

#4David G. Johnston
david.g.johnston@gmail.com
In reply to: hubert depesz lubaczewski (#3)
Re: Why this regexp matches?!

On Feb 4, 2012, at 3:58, hubert depesz lubaczewski <depesz@depesz.com> wrote:

On Sat, Feb 04, 2012 at 09:54:34AM +0100, Szymon Guz wrote:

On 4 February 2012 09:46, hubert depesz lubaczewski <depesz@depesz.com>wrote:

select 'depesz depeszx depesz' ~ E'^(.*)( \\1)+$';

what's worse:
$ select regexp_replace( 'depesz depeszx depesz', E'^(.*)( \\1)+$', E'\\1'
);
regexp_replace
────────────────
depesz
(1 row)

I know that Pg regexps are limited, but even grep's regexps match this
correctly:

=$ printf 'depesz depesz depesz\ndepesz depeszx depesz\n' | grep -E
'^(.*)( \1)+$';
depesz depesz depesz

Best regards,

depesz

Hi,
some time ago I hit the same problem, however the solution was a little bit
tricky. I didn't have time to investigate it, but this works:

postgres@postgres:5840=# select regexp_replace( 'depesz depeszx depesz',
E'^(.*)( \\\\1)+$', E'\\\\1' );
regexp_replace
-----------------------
depesz depeszx depesz
(1 row)

not sure if I understand your point.

This regexp was meant to find repeated substrings.

Like this one does in perl:

/^(.*)( \1)+$/

We can see how it works with:
=$ perl -e 'if ( shift =~ m/^(.*)( \1)+$/ ) { print "is repeat of [$1]\n" } else {print "is not repeated\n"}' 'depesz depesz depesz'
is repeat of [depesz]

=$ perl -e 'if ( shift =~ m/^(.*)( \1)+$/ ) { print "is repeat of [$1]\n" } else {print "is not repeated\n"}' 'depesz depeszx depesz'
is not repeated

reason why your regexp matches is also a mystery for me.

Best regards,

depesz

Don't know the answer (if there is one other than 'it's a bug') but as a workaround you can split the string on whitespace then perform grouping and see if more than one record results...

David J.

#5Alban Hertroys
haramrae@gmail.com
In reply to: hubert depesz lubaczewski (#1)
Re: Why this regexp matches?!

On 4 Feb 2012, at 9:46, hubert depesz lubaczewski wrote:

select 'depesz depeszx depesz' ~ E'^(.*)( \\1)+$';

Peculiar.

It's probably no use to you, but a version where the repetition is expanded (for that particular string) works:

select 'depesz depeszx depesz' ~ E'^(.*)( \\1)( \\1)$';

And this works too:

select 'depesz depeszx depesz' ~ E'^(depesz)( \\1)+$';

Apparently something odd is going on between the wildcard, the repetitive part and the back-reference. That could be just us not seeing what's wrong with the expression or be an actual bug.

I know that Pg regexps are limited, but even grep's regexps match this

Limited? They're really not. According to the docs they are beyond POSIX compliant, even including several extensions as they appear in, among others, Perl. That said, the docs do mention a known limitation with braces and forward-references - maybe this is related.

Alban Hertroys

--
The scale of a problem often equals the size of an ego.

In reply to: Alban Hertroys (#5)
Re: Why this regexp matches?!

On Sat, Feb 04, 2012 at 07:31:25PM +0100, Alban Hertroys wrote:

I know that Pg regexps are limited, but even grep's regexps match this

Limited? They're really not. According to the docs they are beyond
POSIX compliant, even including several extensions as they appear in,
among others, Perl. That said, the docs do mention a known limitation
with braces and forward-references - maybe this is related.

Limited - because (for example) Pg regexps, are the only regexp flavour
that I know that you can't have both greedy and non-greedy operators in
the same expression.

depesz

--
The best thing about modern society is how easy it is to avoid contact with it.
http://depesz.com/

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: hubert depesz lubaczewski (#6)
Re: Why this regexp matches?!

hubert depesz lubaczewski <depesz@depesz.com> writes:

On Sat, Feb 04, 2012 at 07:31:25PM +0100, Alban Hertroys wrote:

Limited? They're really not.

Limited - because (for example) Pg regexps, are the only regexp flavour
that I know that you can't have both greedy and non-greedy operators in
the same expression.

Huh? Sure you can.

The engine's rules for combining greedy and non-greedy behavior might be
a bit different from Perl's, but that doesn't make it "limited". It
just means it has different idiosyncrasies from Perl's engine. I do not
accept the proposition that Perl's regexps are perfect and everybody
else's are wrong to the extent that they act differently from Perl's.

As for the specific behavior at hand, it does look like a bug from here,
but I don't have time to poke at it right now.

regards, tom lane

In reply to: Tom Lane (#7)
Re: Why this regexp matches?!

On Sat, Feb 04, 2012 at 03:27:53PM -0500, Tom Lane wrote:

hubert depesz lubaczewski <depesz@depesz.com> writes:

that I know that you can't have both greedy and non-greedy operators in
the same expression.

Huh? Sure you can.

wrote about it year ago:

http://archives.postgresql.org/pgsql-general/2010-01/msg00067.php

Just tested, and it behaves the same way in 9.2devel.

Best regards,

depesz

--
The best thing about modern society is how easy it is to avoid contact with it.
http://depesz.com/

#9Jasen Betts
jasen@xnet.co.nz
In reply to: hubert depesz lubaczewski (#1)
Re: Why this regexp matches?!

On 2012-02-04, hubert depesz lubaczewski <depesz@depesz.com> wrote:

select 'depesz depeszx depesz' ~ E'^(.*)( \\1)+$';

what's worse:
$ select regexp_replace( 'depesz depeszx depesz', E'^(.*)( \\1)+$', E'\\1' );
regexp_replace
────────────────
depesz
(1 row)

I know that Pg regexps are limited, but even grep's regexps match this
correctly:

whose grep?

Postgres is BSD licence and that means they can't use the latest and
greatest GPL libraries.

--
⚂⚃ 100% natural

In reply to: Jasen Betts (#9)
Re: Why this regexp matches?!

On Mon, Feb 06, 2012 at 11:29:23AM +0000, Jasen Betts wrote:

On 2012-02-04, hubert depesz lubaczewski <depesz@depesz.com> wrote:

select 'depesz depeszx depesz' ~ E'^(.*)( \\1)+$';

what's worse:
$ select regexp_replace( 'depesz depeszx depesz', E'^(.*)( \\1)+$', E'\\1' );
regexp_replace
────────────────
depesz
(1 row)

I know that Pg regexps are limited, but even grep's regexps match this
correctly:

whose grep?

Postgres is BSD licence and that means they can't use the latest and
greatest GPL libraries.

yes, I did use gnu grep. but it's hardly "latest and greatest" - there
is nothing very special about this regexp, aside from the fact, that
according to pg docs (how I read them) - it shouldn't match, but it
does.

depesz

--
The best thing about modern society is how easy it is to avoid contact with it.
http://depesz.com/

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alban Hertroys (#5)
Re: Why this regexp matches?!

Alban Hertroys <haramrae@gmail.com> writes:

On 4 Feb 2012, at 9:46, hubert depesz lubaczewski wrote:

select 'depesz depeszx depesz' ~ E'^(.*)( \\1)+$';

Apparently something odd is going on between the wildcard, the repetitive part and the back-reference. That could be just us not seeing what's wrong with the expression or be an actual bug.

FYI, I've made some progress on characterizing the cause of this bug,
as per comments at the upstream bug report:
https://sourceforge.net/tracker/index.php?func=detail&amp;aid=1115587&amp;group_id=10894&amp;atid=110894
There are actually two distinct bugs involved, and I don't yet have a
patch for the case depesz illustrates.

regards, tom lane