regex help wanted

Started by Karsten Hilbertalmost 13 years ago8 messagesgeneral
Jump to latest
#1Karsten Hilbert
Karsten.Hilbert@gmx.net

Hi,

I am in the process of converting some TEXT data which I try
to identify by regular expression.

What I don't understand is: Why does the following return a
substring ?

select substring ('junk $<allergy::test::99>$ junk' from '\$<[^<]+?::[^:]+?>\$');

I would have thought the '::[^:]+?>' part should have meant

after two ":"s
match at least one character
except any further ":"s
until the next ">"

I don't find the flaw in my thinking. Can anyone help ?

(Sure, it is not PostgreSQL-specific yet I need to run this
in PostgreSQL on data migration.)

Karsten
--
GPG key ID E4071346 @ gpg-keyserver.de
E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Karsten Hilbert (#1)
Re: regex help wanted

Karsten Hilbert <Karsten.Hilbert@gmx.net> writes:

What I don't understand is: Why does the following return a
substring ?

select substring ('junk $<allergy::test::99>$ junk' from '\$<[^<]+?::[^:]+?>\$');

There's a perfectly valid match in which [^<]+? matches allergy::test
and [^:]+? matches 99.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Thom Brown
thom@linux.com
In reply to: Tom Lane (#2)
Re: regex help wanted

On 25 April 2013 15:32, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Karsten Hilbert <Karsten.Hilbert@gmx.net> writes:

What I don't understand is: Why does the following return a
substring ?

select substring ('junk $<allergy::test::99>$ junk' from '\$<[^<]+?::[^:]+?>\$');

There's a perfectly valid match in which [^<]+? matches allergy::test
and [^:]+? matches 99.

Yeah, I think there may be an assumption that a lazy quantifier will
stop short and cause the remainder to fail to match permanently, but
it will backtrack, forcing the lazy quantifier to expand until it can
match the expression.

--
Thom

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4Karsten Hilbert
Karsten.Hilbert@gmx.net
In reply to: Tom Lane (#2)
Re: regex help wanted

On Thu, Apr 25, 2013 at 10:32:26AM -0400, Tom Lane wrote:

Karsten Hilbert <Karsten.Hilbert@gmx.net> writes:

What I don't understand is: Why does the following return a
substring ?

select substring ('junk $<allergy::test::99>$ junk' from '\$<[^<]+?::[^:]+?>\$');

There's a perfectly valid match in which [^<]+? matches allergy::test
and [^:]+? matches 99.

Tom, thanks for helping !

I would have thought "<[^<]+?:" should mean:

match a "<"
followed by 1-n characters as long as they are not "<"
until the VERY NEXT ":"

The "?" should make the "+" after "[^<]" non-greedy and thus
stop at the first occurrence of ":", right ? Or am I
misunderstanding that part ?

At any rate,

select substring ('junk $<allergy::test::99>$ junk' from '\$<[^<:]+?::[^:]+?>\$');

(which follows from your hint) appears to do what I need.

Thanks,
Karsten
--
GPG key ID E4071346 @ gpg-keyserver.de
E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#5Karsten Hilbert
Karsten.Hilbert@gmx.net
In reply to: Thom Brown (#3)
Re: regex help wanted

On Thu, Apr 25, 2013 at 03:40:51PM +0100, Thom Brown wrote:

On 25 April 2013 15:32, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Karsten Hilbert <Karsten.Hilbert@gmx.net> writes:

What I don't understand is: Why does the following return a
substring ?

select substring ('junk $<allergy::test::99>$ junk' from '\$<[^<]+?::[^:]+?>\$');

There's a perfectly valid match in which [^<]+? matches allergy::test
and [^:]+? matches 99.

Yeah, I think there may be an assumption that a lazy quantifier will
stop short and cause the remainder to fail to match permanently, but
it will backtrack, forcing the lazy quantifier to expand until it can
match the expression.

Yup, therein lies the rub :-)

Thanks,
Karsten
--
GPG key ID E4071346 @ gpg-keyserver.de
E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Karsten Hilbert (#4)
Re: regex help wanted

Karsten Hilbert <Karsten.Hilbert@gmx.net> writes:

I would have thought "<[^<]+?:" should mean:

match a "<"
followed by 1-n characters as long as they are not "<"
until the VERY NEXT ":"

The "?" should make the "+" after "[^<]" non-greedy and thus
stop at the first occurrence of ":", right ? Or am I
misunderstanding that part ?

No, non-greedy just means that if there are multiple ways to make the
pattern match the string, prefer the way that makes this sub-match the
shortest (whereas the default makes leftmost sub-matches longest).
If you don't want the char class to match : then you need to say that
explicitly.

BTW, I'm fairly sure that unless you are doing something that extracts
or replaces sub-matches, there is no value whatever in marking
quantifiers non-greedy; that just complicates life for the regex
compiler. A match is a match, if you're not paying attention to
where the subpattern boundaries are.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#7Jasen Betts
jasen@xnet.co.nz
In reply to: Karsten Hilbert (#1)
Re: regex help wanted

On 2013-04-25, Karsten Hilbert <Karsten.Hilbert@gmx.net> wrote:

On Thu, Apr 25, 2013 at 10:32:26AM -0400, Tom Lane wrote:

Karsten Hilbert <Karsten.Hilbert@gmx.net> writes:

What I don't understand is: Why does the following return a
substring ?

select substring ('junk $<allergy::test::99>$ junk' from '\$<[^<]+?::[^:]+?>\$');

There's a perfectly valid match in which [^<]+? matches allergy::test
and [^:]+? matches 99.

Tom, thanks for helping !

I would have thought "<[^<]+?:" should mean:

match a "<"
followed by 1-n characters as long as they are not "<"
until the VERY NEXT ":"

if you want that say: "<[^<:]+:"

The "?" should make the "+" after "[^<]" non-greedy and thus
stop at the first occurrence of ":", right ? Or am I
misunderstanding that part ?

From "the fine manual"

Non-greedy quantifiers (available in AREs only) match the same
possibilities as their corresponding normal (greedy) counterparts, but
prefer the smallest number rather than the largest number of matches.
See Section 9.7.3.5 for more detail.

--
⚂⚃ 100% natural

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#8Matthew Byrne
matt@byrney.com
In reply to: Jasen Betts (#7)
Re: regex help wanted

On 2013-04-25, Karsten Hilbert <Karsten.Hilbert@gmx.net> wrote:

On Thu, Apr 25, 2013 at 10:32:26AM -0400, Tom Lane wrote:

Karsten Hilbert <Karsten.Hilbert@gmx.net> writes:

What I don't understand is: Why does the following return a
substring ?

select substring ('junk $<allergy::test::99>$ junk' from

'\$<[^<]+?::[^:]+?>\$');

There's a perfectly valid match in which [^<]+? matches allergy::test
and [^:]+? matches 99.

Tom, thanks for helping !

I would have thought "<[^<]+?:" should mean:

match a "<"
followed by 1-n characters as long as they are not "<"
until the VERY NEXT ":"

if you want that say: "<[^<:]+:"

The "?" should make the "+" after "[^<]" non-greedy and thus
stop at the first occurrence of ":", right ? Or am I
misunderstanding that part ?

Greediness and non-greediness of operators are like hints - they are only
honoured if there is a choice in the matter. In your case, if the
<[^<]+?: stopped at the first ":", it would be impossible to match the
rest of the pattern.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general