regexp_replace

Started by Andy Colsonabout 10 years ago8 messagesgeneral
Jump to latest
#1Andy Colson
andy@squeakycode.net

Hi all.

This is not doing as I'd expected:

select regexp_replace('71.09.6.01.3', '(\d)[.-](\d)', '\1\2', 'g');

regexp_replace
----------------
71096.013
(1 row)

It acts the same with dashes:
select regexp_replace('71-09-6-01-3', '(\d)[.-](\d)', '\1\2', 'g');

regexp_replace
----------------
71096-013
(1 row)

I cannot use translate because there is other text in the field. I'm
trying to strip masking characters from a parcel number in a larger text
field (for example: "the parcel 12-34-56 has caught on fire")

I seem to be missing something, any hints?

I'm on PG 9.3.9 on Slackware64.

Thanks for your time,

-Andy

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andy Colson (#1)
Re: regexp_replace

Andy Colson <andy@squeakycode.net> writes:

This is not doing as I'd expected:

select regexp_replace('71.09.6.01.3', '(\d)[.-](\d)', '\1\2', 'g');

regexp_replace
----------------
71096.013
(1 row)

I think regexp_replace considers only non-overlapping substrings,
eg, once it's replaced 1.0 with 10, it then picks up searching at
the 9 rather than starting over. The dot after 6 doesn't get
removed because the 6 can't belong to two replaceable substrings, and
it already got consumed in the process of removing the dot before 6.

I might be wrong, but I think two passes of regexp_replace would
do what you want in this example.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3John McKown
john.archie.mckown@gmail.com
In reply to: Andy Colson (#1)
Re: regexp_replace

How about:

select regexp_replace('71.09.6.01.3', '(\d)[.-]', '\1', 'g');

?

In your example, the (\d)[.-](\d) says find a digit followed by a period or
dash followed by another digit. The first time through 1.0 is matched and
replaced with 10 (710) with the "current location" pointing before the 9.
Go again and 9.6 is replaced by 96 for (71096) with the "current location"
pointing to the period! So ".0" doesn't match. (71096.0) next match is 1.3
and result is 13 ( 71096.013). If you don't want to eliminate the period or
dash unless it is _between_ two digits, try:

select regexp_replace('71.09.6.01.3', '(\d)[.-](?=\d)', '\1', 'g');

(?=\d) is a "look ahead") match which says that the period or dash must be
followed by a digit, but the expression _does not_ "consume" the digit
matched.

On Thu, Jan 14, 2016 at 1:43 PM, Andy Colson <andy@squeakycode.net> wrote:

Hi all.

This is not doing as I'd expected:

select regexp_replace('71.09.6.01.3', '(\d)[.-](\d)', '\1\2', 'g');

regexp_replace
----------------
71096.013
(1 row)

It acts the same with dashes:
select regexp_replace('71-09-6-01-3', '(\d)[.-](\d)', '\1\2', 'g');

regexp_replace
----------------
71096-013
(1 row)

I cannot use translate because there is other text in the field. I'm
trying to strip masking characters from a parcel number in a larger text
field (for example: "the parcel 12-34-56 has caught on fire")

I seem to be missing something, any hints?

I'm on PG 9.3.9 on Slackware64.

Thanks for your time,

-Andy

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Werner Heisenberg is driving down the autobahn. A police officer pulls
him over. The officer says, "Excuse me, sir, do you know how fast you
were going?"
"No," replies Dr. Heisenberg, "but I know where I am."

Computer Science is the only discipline in which we view adding a new wing
to a building as being maintenance -- Jim Horning

Schrodinger's backup: The condition of any backup is unknown until a
restore is attempted.

He's about as useful as a wax frying pan.

Maranatha! <><
John McKown

#4Andy Colson
andy@squeakycode.net
In reply to: Tom Lane (#2)
Re: regexp_replace

On 1/14/2016 1:59 PM, Tom Lane wrote:

Andy Colson <andy@squeakycode.net> writes:

This is not doing as I'd expected:

select regexp_replace('71.09.6.01.3', '(\d)[.-](\d)', '\1\2', 'g');

regexp_replace
----------------
71096.013
(1 row)

I think regexp_replace considers only non-overlapping substrings,
eg, once it's replaced 1.0 with 10, it then picks up searching at
the 9 rather than starting over. The dot after 6 doesn't get
removed because the 6 can't belong to two replaceable substrings, and
it already got consumed in the process of removing the dot before 6.

I might be wrong, but I think two passes of regexp_replace would
do what you want in this example.

regards, tom lane

Ah, that would make sense, and seems to explain:

select regexp_replace('7-9-6-1-3', '(\d)[.-](\d)', '\1\2', 'g');

regexp_replace
----------------
79-61-3
(1 row)

select regexp_replace('71-09-56-01-53', '(\d)[.-](\d)', '\1\2', 'g');

regexp_replace
----------------
7109560153
(1 row)

I can work two passes in. Thanks Tom!

-Andy

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#5David G. Johnston
david.g.johnston@gmail.com
In reply to: Andy Colson (#1)
Re: regexp_replace

On Thu, Jan 14, 2016 at 12:43 PM, Andy Colson <andy@squeakycode.net> wrote:

Hi all.

This is not doing as I'd expected:

select regexp_replace('71.09.6.01.3', '(\d)[.-](\d)', '\1\2', 'g');

regexp_replace
----------------
71096.013
(1 row)

​Solution: select regexp_replace('71.09.6.01.3', '(\d)[.-](?=\d)', '\1\2',
'g');

Reason: in the original the trailing "(\d)" eats ​the digit following the
symbol and then that digit is no longer available for matching the
preceding digit in the expression. IOW the same character cannot be used
to match both the first \d and the second \d so once the first \d captures
the 6 there is no \d to match before trailing period.

By using the construct (?:\d) you are zero-width (non-capturing) asserting
the the next character is a digit but you are not consuming it and so the
continuation of the global matching still has that character to match the
first \d.

David J.

#6Andy Colson
andy@squeakycode.net
In reply to: John McKown (#3)
Re: regexp_replace

On 1/14/2016 2:02 PM, John McKown wrote:

How about:

select regexp_replace('71.09.6.01.3', '(\d)[.-]', '\1', 'g');

match is 1.3 and result is 13 ( 71096.013). If you don't want to
eliminate the period or dash unless it is _between_ two digits, try:

select regexp_replace('71.09.6.01.3', '(\d)[.-](?=\d)', '\1', 'g');

(?=\d) is a "look ahead") match which says that the period or dash must
be followed by a digit, but the expression _does not_ "consume" the
digit matched.

Maranatha! <><
John McKown

Yes, excellent, both seem to work. I'll run a bunch of data through
them both and see what happens.

Thanks much for the help!

-Andy

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#7Andy Colson
andy@squeakycode.net
In reply to: David G. Johnston (#5)
Re: regexp_replace

On 1/14/2016 2:06 PM, David G. Johnston wrote:

select regexp_replace('71.09.6.01.3', '(\d)[.-](?=\d)', '\1\2', 'g');

Thanks David!

-Andy

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#8David G. Johnston
david.g.johnston@gmail.com
In reply to: Andy Colson (#7)
Re: regexp_replace

On Thu, Jan 14, 2016 at 1:27 PM, Andy Colson <andy@squeakycode.net> wrote:

On 1/14/2016 2:06 PM, David G. Johnston wrote:

select regexp_replace('71.09.6.01.3', '(\d)[.-](?=\d)', '\1\2', 'g');

​John already picked up on the fact that the "\2" in the replacement is
pointless (neither helping nor hurting) since nothing was captured at that
position.

David J.