REGEXP_REPLACE woes

Started by Leif B. Kristensenalmost 18 years ago14 messagesgeneral
Jump to latest
#1Leif B. Kristensen
leif@solumslekt.org

I want to transform the text '[p=1242|John Smith]' to
<a href="./family.php?person=1242">John Smith</a>, but what I get is:

pgslekt=> select REGEXP_REPLACE('[p=1242|John Smith]',
pgslekt(> E'[p=(\d+)|(.+?)]',
pgslekt(> E'<a href="./family.php?person=\\1">\\2</a>');
regexp_replace
------------------------------------------------------
[<a href="./family.php?person="></a>=1242|John Smith]
(1 row)

What am I doing wrong?

PostgreSQL 8.2.7 on i686-pc-linux-gnu BTW.
--
Leif Biberg Kristensen | Registered Linux User #338009
http://solumslekt.org/ | Cruising with Gentoo/KDE
My Jazz Jukebox: http://www.last.fm/user/leifbk/

#2Michael Fuhr
mike@fuhr.org
In reply to: Leif B. Kristensen (#1)
Re: REGEXP_REPLACE woes

On Tue, Jun 10, 2008 at 01:28:06PM +0200, Leif B. Kristensen wrote:

I want to transform the text '[p=1242|John Smith]' to
<a href="./family.php?person=1242">John Smith</a>, but what I get is:

pgslekt=> select REGEXP_REPLACE('[p=1242|John Smith]',
pgslekt(> E'[p=(\d+)|(.+?)]',
pgslekt(> E'<a href="./family.php?person=\\1">\\2</a>');
regexp_replace
------------------------------------------------------
[<a href="./family.php?person="></a>=1242|John Smith]
(1 row)

What am I doing wrong?

Parts of the regular expression need more escaping. Try this:

select regexp_replace(
'[p=1242|John Smith]',
e'\\[p=(\\d+)\\|(.+?)\\]',
e'<a href="./family.php?person=\\1">\\2</a>'
);

regexp_replace
---------------------------------------------------
<a href="./family.php?person=1242">John Smith</a>

Caution: this method doesn't do HTML entity escaping so if your
input isn't trustworthy then you could end up with HTML that's
different from what you intended.

--
Michael Fuhr

#3Leif B. Kristensen
leif@solumslekt.org
In reply to: Michael Fuhr (#2)
Re: REGEXP_REPLACE woes

On Tuesday 10. June 2008, Michael Fuhr wrote:

Parts of the regular expression need more escaping.  Try this:

select regexp_replace(
   '[p=1242|John Smith]',
  e'\\[p=(\\d+)\\|(.+?)\\]',
  e'<a href="./family.php?person=\\1">\\2</a>'
);

                  regexp_replace
---------------------------------------------------
 <a href="./family.php?person=1242">John Smith</a>

Thank you Michael, I figured it was something fishy with the escaping.
When I try your example, I get

pgslekt=> select regexp_replace(
pgslekt(>    '[p=1242|John Smith]',
pgslekt(>   e'\\[p=(\\d+)\\|(.+?)\\]',
pgslekt(>   e'<a href="./family.php?person=\\1">\\2</a>'
pgslekt(> );
ERROR: syntax error at or near " "
LINE 2:    '[p=1242|John Smith]',

But with my own doctored code, it works just fine:

pgslekt=> select REGEXP_REPLACE(E'[p=1242|John Smith]',
E'\\[p=(\\d+)\\|(.+?)\\]',
E'<a href="./family.php?person=\\1">\\2</a>');
regexp_replace
---------------------------------------------------
<a href="./family.php?person=1242">John Smith</a>
(1 row)

Caution: this method doesn't do HTML entity escaping so if your
input isn't trustworthy then you could end up with HTML that's
different from what you intended.

The input is all my own from 127.0.0.1, so it's of course totally
trustworthy :-)
--
Leif Biberg Kristensen | Registered Linux User #338009
http://solumslekt.org/ | Cruising with Gentoo/KDE
My Jazz Jukebox: http://www.last.fm/user/leifbk/

#4Leif B. Kristensen
leif@solumslekt.org
In reply to: Leif B. Kristensen (#3)
Re: REGEXP_REPLACE woes

I put the code into a function, link_expand():

CREATE OR REPLACE FUNCTION link_expand(TEXT) RETURNS TEXT AS $$
SELECT REGEXP_REPLACE($1,
E'\\[p=(\\d+)\\|(.+?)\\]',
E'<a href="./family.php?person=\\1">\\2</a>', 'g');
$$ LANGUAGE sql STABLE;

pgslekt=> select link_expand('[p=123|John Smith]');
link_expand
--------------------------------------------------
<a href="./family.php?person=123">John Smith</a>
(1 row)

So far, so good. But look here:

pgslekt=> select link_expand('[p=123|John Smith] and [p=456|Jane Doe]');
link_expand
-----------------------------------------------------------------------
<a href="./family.php?person=123">John Smith] and [p=456|Jane Doe</a>
(1 row)

Hey, I told it not to be greedy, didn't I?
--
Leif Biberg Kristensen | Registered Linux User #338009
http://solumslekt.org/ | Cruising with Gentoo/KDE
My Jazz Jukebox: http://www.last.fm/user/leifbk/

#5Michael Fuhr
mike@fuhr.org
In reply to: Leif B. Kristensen (#3)
Re: REGEXP_REPLACE woes

On Tue, Jun 10, 2008 at 02:25:44PM +0200, Leif B. Kristensen wrote:

Thank you Michael, I figured it was something fishy with the escaping.
When I try your example, I get

pgslekt=> select regexp_replace(
pgslekt(> � �'[p=1242|John Smith]',
pgslekt(> � e'\\[p=(\\d+)\\|(.+?)\\]',
pgslekt(> � e'<a href="./family.php?person=\\1">\\2</a>'
pgslekt(> );
ERROR: syntax error at or near "�"
LINE 2: � �'[p=1242|John Smith]',

Something between my message and your shell appears to have converted
a few spaces to no-break spaces. A hex dump of your query shows the
following:

00000000 73 65 6c 65 63 74 20 72 65 67 65 78 70 5f 72 65 |select regexp_re|
00000010 70 6c 61 63 65 28 0a c2 a0 20 c2 a0 27 5b 70 3d |place(.� �'[p=|

Notice the byte sequences "c2 a0", which is the UTF-8 encoding of
<U+00A0 NO-BREAK SPACE>. Apparently psql doesn't like that. I don't
see that sequence in my original message:

00000000 73 65 6c 65 63 74 20 72 65 67 65 78 70 5f 72 65 |select regexp_re|
00000010 70 6c 61 63 65 28 0a 20 20 20 27 5b 70 3d 31 32 |place(. '[p=12|

--
Michael Fuhr

#6Michael Fuhr
mike@fuhr.org
In reply to: Leif B. Kristensen (#4)
Re: REGEXP_REPLACE woes

On Tue, Jun 10, 2008 at 02:59:53PM +0200, Leif B. Kristensen wrote:

So far, so good. But look here:

pgslekt=> select link_expand('[p=123|John Smith] and [p=456|Jane Doe]');
link_expand
-----------------------------------------------------------------------
<a href="./family.php?person=123">John Smith] and [p=456|Jane Doe</a>
(1 row)

Hey, I told it not to be greedy, didn't I?

Yes, but regexp_replace only replaces that part of the original
string that matches the regular expression -- the rest it leaves
alone.

--
Michael Fuhr

#7Leif B. Kristensen
leif@solumslekt.org
In reply to: Leif B. Kristensen (#4)
Re: REGEXP_REPLACE woes

On Tuesday 10. June 2008, Leif B. Kristensen wrote:

Hey, I told it not to be greedy, didn't I?

Found it. I must make *both* atoms non-greedy:

pgslekt=> CREATE OR REPLACE FUNCTION link_expand(TEXT) RETURNS TEXT AS
$$
SELECT REGEXP_REPLACE($1,
E'\\[p=(\\d+?)\\|(.+?)\\]',
E'<a href="./family.php?person=\\1">\\2</a>', 'g');
$$ LANGUAGE sql STABLE;
CREATE FUNCTION
pgslekt=> select link_expand('[p=123|John Smith] and [p=456|Jane Doe]');
link_expand
-----------------------------------------------------------------------------------------------------
<a href="./family.php?person=123">John Smith</a> and <a
href="./family.php?person=456">Jane Doe</a>
(1 row)

It's not totally intuitive, but at least now it works.
--
Leif Biberg Kristensen | Registered Linux User #338009
http://solumslekt.org/ | Cruising with Gentoo/KDE
My Jazz Jukebox: http://www.last.fm/user/leifbk/

#8CaT
cat@zip.com.au
In reply to: Leif B. Kristensen (#7)
Re: REGEXP_REPLACE woes

On Tue, Jun 10, 2008 at 03:43:02PM +0200, Leif B. Kristensen wrote:

On Tuesday 10. June 2008, Leif B. Kristensen wrote:

Hey, I told it not to be greedy, didn't I?

Found it. I must make *both* atoms non-greedy:

That makes no sense. Take this bit of perl, which works as expected:

$str = '[p=123|John Smith] and [p=456|Jane Doe]';

print "before: '$str'\n";

$str =~ s#\[p=(\d+)\|(.+?)\]#1 = $1, 2 = $2 rest of str = #;

print "after: '$str'\n";

There's a bug in your version of pcre I think as postgres would have
little to do with the regex itself (I be guessing).

--
"Police noticed some rustling sounds from Linn's bottom area
and on closer inspection a roll of cash was found protruding
from Linn's anus, the full amount of cash taken in the robbery."
- http://www.smh.com.au/news/world/robber-hides-loot-up-his-booty/2008/05/09/1210131248617.html

#9Michael Fuhr
mike@fuhr.org
In reply to: Michael Fuhr (#6)
Re: REGEXP_REPLACE woes

On Tue, Jun 10, 2008 at 07:41:53AM -0600, Michael Fuhr wrote:

On Tue, Jun 10, 2008 at 02:59:53PM +0200, Leif B. Kristensen wrote:

So far, so good. But look here:

pgslekt=> select link_expand('[p=123|John Smith] and [p=456|Jane Doe]');
link_expand
-----------------------------------------------------------------------
<a href="./family.php?person=123">John Smith] and [p=456|Jane Doe</a>
(1 row)

Hey, I told it not to be greedy, didn't I?

Yes, but regexp_replace only replaces that part of the original
string that matches the regular expression -- the rest it leaves
alone.

Sorry, this isn't quite right. As you already discovered, the
pattern was being more greedy than you wanted. That's one reason
why I often use an inverted class instead of assuming that a
non-greedy quantifier will grab only what I want.

select regexp_replace(
'[p=123|John Smith] and [p=456|Jane Doe]',
E'\\[p=(\\d+)\\|([^]]+)\\]',
E'<a href="./family.php?person=\\1">\\2</a>',
'g'
);

regexp_replace
-----------------------------------------------------------------------------------------------------
<a href="./family.php?person=123">John Smith</a> and <a href="./family.php?person=456">Jane Doe</a>

--
Michael Fuhr

#10Leif B. Kristensen
leif@solumslekt.org
In reply to: CaT (#8)
Re: REGEXP_REPLACE woes

On Tuesday 10. June 2008, CaT wrote:

On Tue, Jun 10, 2008 at 03:43:02PM +0200, Leif B. Kristensen wrote:

On Tuesday 10. June 2008, Leif B. Kristensen wrote:

Hey, I told it not to be greedy, didn't I?

Found it. I must make *both* atoms non-greedy:

That makes no sense. Take this bit of perl, which works as expected:

$str = '[p=123|John Smith] and [p=456|Jane Doe]';

print "before: '$str'\n";

$str =~ s#\[p=(\d+)\|(.+?)\]#1 = $1, 2 = $2 rest of str = #;

print "after: '$str'\n";

There's a bug in your version of pcre I think as postgres would have
little to do with the regex itself (I be guessing).

I found this obscure reference in the Pg docs
<http://www.postgresql.org/docs/8.2/interactive/functions-matching.html&gt;

"In short, when an RE contains both greedy and non-greedy
subexpressions, the total match length is either as long as possible or
as short as possible, according to the attribute assigned to the whole
RE. The attributes assigned to the subexpressions only affect how much
of that match they are allowed to "eat" relative to each other."

And it was what made me try the above approach. I agree that it doesn't
make much sense.
--
Leif Biberg Kristensen | Registered Linux User #338009
http://solumslekt.org/ | Cruising with Gentoo/KDE
My Jazz Jukebox: http://www.last.fm/user/leifbk/

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: CaT (#8)
Re: REGEXP_REPLACE woes

CaT <cat@zip.com.au> writes:

There's a bug in your version of pcre I think as postgres would have
little to do with the regex itself (I be guessing).

You be wrong ... PG uses Tcl's regex engine, not pcre, and this behavior
is as documented. No, I don't know why Henry Spencer chose to do it
that way, but he's certainly forgotten more about REs than the rest of
us will ever know.

regards, tom lane

#12Leif B. Kristensen
leif@solumslekt.org
In reply to: Michael Fuhr (#5)
Re: REGEXP_REPLACE woes

On Tuesday 10. June 2008, Michael Fuhr wrote:

Something between my message and your shell appears to have converted
a few spaces to no-break spaces. A hex dump of your query shows the
following:

00000000 73 65 6c 65 63 74 20 72 65 67 65 78 70 5f 72 65 |select
regexp_re| 00000010 70 6c 61 63 65 28 0a c2 a0 20 c2 a0 27 5b 70 3d
|place(.   '[p=|

Notice the byte sequences "c2 a0", which is the UTF-8 encoding of
<U+00A0 NO-BREAK SPACE>. Apparently psql doesn't like that. I don't
see that sequence in my original message:

It's probably a KMail bug, or more likely a Qt or KDE library bug. IIRC
there was a similar bug in KNode some years ago when i tried to copy
and paste some Python code from a news discussion.
--
Leif Biberg Kristensen | Registered Linux User #338009
http://solumslekt.org/ | Cruising with Gentoo/KDE
My Jazz Jukebox: http://www.last.fm/user/leifbk/

#13Leif B. Kristensen
leif@solumslekt.org
In reply to: Leif B. Kristensen (#7)
Re: REGEXP_REPLACE woes

For the record: I've got two different flavors of those "shortlinks".
The first one, [p=123|John Smith] is the one that I started this thread
with. The second one is just a person number like [p=123] and should be
expanded to a similar link, with the default person name (fetched by
get_person_name(123)) inserted.

Here's my full function that expands both kinds of shortlinks:

CREATE OR REPLACE FUNCTION link_expand(TEXT) RETURNS TEXT AS $$
DECLARE
str TEXT;
tmp TEXT;
name TEXT;
p INTEGER;
BEGIN
-- the easy part: replace [p=xxx|yyy] with full link
str := REGEXP_REPLACE($1,
E'\\[p=(\\d+?)\\|(.+?)\\]',
E'<a href="./family.php?person=\\1">\\2</a>', 'g');
-- the hard part: replace [p=xxx] with full link
WHILE str SIMILAR TO E'%\\[p=\\d+\\]%' LOOP
str := REGEXP_REPLACE(str,
E'\\[p=(\\d+?)\\]',
E'<a href="./family.php?person=\\1">#\\1#</a>');
tmp := SUBSTRING(str, E'#\\d+?#');
p := BTRIM(tmp, '#')::INTEGER;
name := get_person_name(p);
str := REPLACE(str, tmp, name);
END LOOP;
RETURN str;
END
$$ LANGUAGE plpgsql STABLE;

I still think that "the hard part" is a bit ugly, though.
--
Leif Biberg Kristensen | Registered Linux User #338009
http://solumslekt.org/ | Cruising with Gentoo/KDE
My Jazz Jukebox: http://www.last.fm/user/leifbk/

#14Leif B. Kristensen
leif@solumslekt.org
In reply to: Leif B. Kristensen (#13)
Re: REGEXP_REPLACE woes

On Wednesday 11. June 2008, Leif B. Kristensen wrote:

p := BTRIM(tmp, '#')::INTEGER;
name := get_person_name(p);
str := REPLACE(str, tmp, name);

I did some "folding" and replaced the above with

str := REPLACE(str, tmp, get_person_name(BTRIM(tmp, '#')::INTEGER));

and got rid of two variables.
--
Leif Biberg Kristensen | Registered Linux User #338009
http://solumslekt.org/ | Cruising with Gentoo/KDE
My Jazz Jukebox: http://www.last.fm/user/leifbk/