FInding "corrupt" values in UTF-8 tables (regexp question, I think)

Started by Phoenix Kiulaover 18 years ago6 messagesgeneral
Jump to latest
#1Phoenix Kiula
phoenix.kiula@gmail.com

I'm noticing that some of my data has been imported as junk text:

For instance:

klciã«"

What would be the SQL to find data of this nature? My column can only
have alphanumeric data, and the only symbols allowed are "-" and "_",
so I tried this regexp query:

select id, t_code
from traders
where t_code ~ '[^A-Za-z1-9\-]'
limit 100;

But this starts to return values such as "181xn-807199" which is valid
as per the above regexp? Also, when I try to include the underscore,
as follows...

select id, t_code
from traders
where t_code ~ '[^A-Za-z1-9\-\_]'
limit 100;

This gives me an error: "ERROR: invalid regular expression: invalid
character range".

What am I missing? Does this have something to do with erroneous
encodings? I want my data to be utf-8 but I do want to find it with
latin1 queries when the text in columns is supposed to be only latin1
characters! Or is "a-z" in utf-8 considered different from "a-z" in
latin1?

#2Manuel Sugawara
masm@fciencias.unam.mx
In reply to: Phoenix Kiula (#1)
Re: FInding "corrupt" values in UTF-8 tables (regexp question, I think)

"Phoenix Kiula" <phoenix.kiula@gmail.com> writes:

select id, t_code
from traders
where t_code ~ '[^A-Za-z1-9\-\_]'
limit 100;

This gives me an error: "ERROR: invalid regular expression: invalid
character range".

Put the dash at the start of the character class: [^-A-Za-z1-9_]

What am I missing?

In a character class expression the dash has an special meaning. If
you need to match a dash it has to be the first character.

Regards,
Manuel.

#3Michael Glaesemann
grzm@seespotcode.net
In reply to: Phoenix Kiula (#1)
Re: FInding "corrupt" values in UTF-8 tables (regexp question, I think)

On Aug 17, 2007, at 10:58 , Phoenix Kiula wrote:

What would be the SQL to find data of this nature? My column can only
have alphanumeric data, and the only symbols allowed are "-" and "_",
so I tried this regexp query:

select id, t_code
from traders
where t_code ~ '[^A-Za-z1-9\-]'

If you're including - in a range as a character, doesn't it have to
go first?
Try this:

WHERE t_code ~ $re$[^-A-Za-z1-9_]$re$

Michael Glaesemann
grzm seespotcode net

#4Belinda M. Giardine
giardine@bx.psu.edu
In reply to: Michael Glaesemann (#3)
Re: FInding "corrupt" values in UTF-8 tables (regexp question, I think)

On Fri, 17 Aug 2007, Michael Glaesemann wrote:

On Aug 17, 2007, at 10:58 , Phoenix Kiula wrote:

What would be the SQL to find data of this nature? My column can only
have alphanumeric data, and the only symbols allowed are "-" and "_",
so I tried this regexp query:

select id, t_code
from traders
where t_code ~ '[^A-Za-z1-9\-]'

If you're including - in a range as a character, doesn't it have to go first?
Try this:

WHERE t_code ~ $re$[^-A-Za-z1-9_]$re$

Michael Glaesemann
grzm seespotcode net

How about
WHERE t_code ~ $re$[^-A-Za-z0-9_]$re$

So that zeros are allowed?

Belinda

#5Michael Glaesemann
grzm@seespotcode.net
In reply to: Phoenix Kiula (#1)
Re: FInding "corrupt" values in UTF-8 tables (regexp question, I think)

[Please reply to the list so that others may benefit from and
participate in the discussion.]

On Aug 17, 2007, at 12:50 , Phoenix Kiula wrote:

On 18/08/07, Michael Glaesemann <grzm@seespotcode.net> wrote:

On Aug 17, 2007, at 10:58 , Phoenix Kiula wrote:

What would be the SQL to find data of this nature? My column can
only
have alphanumeric data, and the only symbols allowed are "-" and
"_",
so I tried this regexp query:

select id, t_code
from traders
where t_code ~ '[^A-Za-z1-9\-]'

If you're including - in a range as a character, doesn't it have to
go first?
Try this:

WHERE t_code ~ $re$[^-A-Za-z1-9_]$re$

Thanks, yes, this is sweet!

If I include this into a check constraint on the table, would that be
very resource intensive for INSERTs and UPDATEs?

Maybe. I don't know. What's very? Measure, change, and measure again.
Premature optimization and all that. :) First I make it correct.

Michael Glaesemann
grzm seespotcode net

#6Phoenix Kiula
phoenix.kiula@gmail.com
In reply to: Michael Glaesemann (#5)
Re: FInding "corrupt" values in UTF-8 tables (regexp question, I think)

On 18/08/07, Michael Glaesemann <grzm@seespotcode.net> wrote:

[Please reply to the list so that others may benefit from and
participate in the discussion.]

If you're including - in a range as a character, doesn't it have to
go first?
Try this:

WHERE t_code ~ $re$[^-A-Za-z1-9_]$re$

Thanks, yes, this is sweet!

If I include this into a check constraint on the table, would that be
very resource intensive for INSERTs and UPDATEs?

Maybe. I don't know. What's very? Measure, change, and measure again.
Premature optimization and all that. :) First I make it correct.

Yes, I meant to reply all. Thanks much for your help. I'll play around
with the check constraint.