FInding "corrupt" values in UTF-8 tables (regexp question, I think)
I'm noticing that some of my data has been imported as junk text:
For instance:
klciã«"
What would be the SQL to find data of this nature? My column can only
have alphanumeric data, and the only symbols allowed are "-" and "_",
so I tried this regexp query:
select id, t_code
from traders
where t_code ~ '[^A-Za-z1-9\-]'
limit 100;
But this starts to return values such as "181xn-807199" which is valid
as per the above regexp? Also, when I try to include the underscore,
as follows...
select id, t_code
from traders
where t_code ~ '[^A-Za-z1-9\-\_]'
limit 100;
This gives me an error: "ERROR: invalid regular expression: invalid
character range".
What am I missing? Does this have something to do with erroneous
encodings? I want my data to be utf-8 but I do want to find it with
latin1 queries when the text in columns is supposed to be only latin1
characters! Or is "a-z" in utf-8 considered different from "a-z" in
latin1?
"Phoenix Kiula" <phoenix.kiula@gmail.com> writes:
select id, t_code
from traders
where t_code ~ '[^A-Za-z1-9\-\_]'
limit 100;This gives me an error: "ERROR: invalid regular expression: invalid
character range".
Put the dash at the start of the character class: [^-A-Za-z1-9_]
What am I missing?
In a character class expression the dash has an special meaning. If
you need to match a dash it has to be the first character.
Regards,
Manuel.
On Aug 17, 2007, at 10:58 , Phoenix Kiula wrote:
What would be the SQL to find data of this nature? My column can only
have alphanumeric data, and the only symbols allowed are "-" and "_",
so I tried this regexp query:select id, t_code
from traders
where t_code ~ '[^A-Za-z1-9\-]'
If you're including - in a range as a character, doesn't it have to
go first?
Try this:
WHERE t_code ~ $re$[^-A-Za-z1-9_]$re$
Michael Glaesemann
grzm seespotcode net
On Fri, 17 Aug 2007, Michael Glaesemann wrote:
On Aug 17, 2007, at 10:58 , Phoenix Kiula wrote:
What would be the SQL to find data of this nature? My column can only
have alphanumeric data, and the only symbols allowed are "-" and "_",
so I tried this regexp query:select id, t_code
from traders
where t_code ~ '[^A-Za-z1-9\-]'If you're including - in a range as a character, doesn't it have to go first?
Try this:WHERE t_code ~ $re$[^-A-Za-z1-9_]$re$
Michael Glaesemann
grzm seespotcode net
How about
WHERE t_code ~ $re$[^-A-Za-z0-9_]$re$
So that zeros are allowed?
Belinda
[Please reply to the list so that others may benefit from and
participate in the discussion.]
On Aug 17, 2007, at 12:50 , Phoenix Kiula wrote:
On 18/08/07, Michael Glaesemann <grzm@seespotcode.net> wrote:
On Aug 17, 2007, at 10:58 , Phoenix Kiula wrote:
What would be the SQL to find data of this nature? My column can
only
have alphanumeric data, and the only symbols allowed are "-" and
"_",
so I tried this regexp query:select id, t_code
from traders
where t_code ~ '[^A-Za-z1-9\-]'If you're including - in a range as a character, doesn't it have to
go first?
Try this:WHERE t_code ~ $re$[^-A-Za-z1-9_]$re$
Thanks, yes, this is sweet!
If I include this into a check constraint on the table, would that be
very resource intensive for INSERTs and UPDATEs?
Maybe. I don't know. What's very? Measure, change, and measure again.
Premature optimization and all that. :) First I make it correct.
Michael Glaesemann
grzm seespotcode net
Import Notes
Reply to msg id not found: e373d31e0708171050i7c5d3e2l9d46403b4d0da94e@mail.gmail.com
On 18/08/07, Michael Glaesemann <grzm@seespotcode.net> wrote:
[Please reply to the list so that others may benefit from and
participate in the discussion.]If you're including - in a range as a character, doesn't it have to
go first?
Try this:WHERE t_code ~ $re$[^-A-Za-z1-9_]$re$
Thanks, yes, this is sweet!
If I include this into a check constraint on the table, would that be
very resource intensive for INSERTs and UPDATEs?Maybe. I don't know. What's very? Measure, change, and measure again.
Premature optimization and all that. :) First I make it correct.
Yes, I meant to reply all. Thanks much for your help. I'll play around
with the check constraint.