finding bogus UTF-8

Started by Scott Ribeabout 15 years ago10 messagesgeneral

scott_ribe@elevated-dev.com

about 15 years ago

I know that I have at least one instance of a varchar that is not valid UTF-8, imported from a source with errors (AMA CPT files, actually) before PG's checking was as stringent as it is today. Can anybody suggest a query to find such values?

--
Scott Ribe
scott_ribe@elevated-dev.com
http://www.elevated-dev.com/
(303) 722-0567 voice

dennis jenkins

dennis.jenkins.75@gmail.com

about 15 years ago

In reply to: Scott Ribe (#1)

Re: finding bogus UTF-8

I'm working on a project to convert a large database form SQL_ASCII to
UTF-8. I am using this procedure:

1) pg_dump the SQL_ASCII database to an SQL text file.
2) Run through a small (efficient) C program that logs each line that
contains ANY "unclean" ASCII text.
3) Parse that log with a small perl program (hashes are easier in perl
than C) to produce a report, and emit some SQL.
4) Construct SQL update statements to "repair" the original data.
5) Repeat at step #1 until the database is clean.
6) pg_dump (SQL_ASCII) -> pg_restore -EUTF8 new database.
7) Profit!

If you are interested, I can email to you the C and Perl source.

It runs like this:

# time pg_restore /db-dumps/some_ascii_pgdump.bin | ./ascii-tester |
./bad-ascii-report.pl > unclean-ascii.rpt

real 11m11.804s
user 18m2.579s
sys 2m25.803s

# grep "^--" unclean-ascii.rpt
-- some_table 4051021
-- other_table 16

^^^ Numbers are count of rows that need cleaning.
Entire "rpt" file contains SQL comments "--" and SQL select statements
of the form:
select * from table where primary_key in (1, 2, 3, 4, ....);

The perl script contains a hash that maps table names to primary key
column IDs (to pick up when parsing the raw SQL restore "COPY"
script). I will need to purge my secret schema stuff from it before
sharing it with anyone.

My solution is probably not perfect, and probably not optimal, but it
is working great so far. I'm almost done cleaning up my database and
hope to attempt a real UTF8 restore in the near future.

Show quoted text

On Thu, Feb 10, 2011 at 1:02 PM, Scott Ribe <scott_ribe@elevated-dev.com> wrote:

I know that I have at least one instance of a varchar that is not valid UTF-8, imported from a source with errors (AMA CPT files, actually) before PG's checking was as stringent as it is today. Can anybody suggest a query to find such values?

--
Scott Ribe
scott_ribe@elevated-dev.com
http://www.elevated-dev.com/
(303) 722-0567 voice

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

dennis jenkins

dennis.jenkins.75@gmail.com

about 15 years ago

In reply to: Scott Ribe (#1)

Re: finding bogus UTF-8

On Thu, Feb 10, 2011 at 1:02 PM, Scott Ribe <scott_ribe@elevated-dev.com> wrote:

I know that I have at least one instance of a varchar that is not valid UTF-8, imported from a source with errors (AMA CPT files, actually) before PG's checking was as stringent as it is today. Can anybody suggest a query to find such values?

If you know which table and column the data is in, you can also do
something like this:

(I typed this up without checking the syntax of it. The basic idea is
to cast the column as bytea, encode with the 'escape' method, then
grep for back-slashes).

select * from bad_table where regexp_match (encode (bad_column::bytea,
'escape'), '\\\\'));

dennis jenkins

dennis.jenkins.75@gmail.com

about 15 years ago

In reply to: dennis jenkins (#2)

Re: finding bogus UTF-8

If you are interested, I can email to you the C and Perl source.

It runs like this:

# time pg_restore /db-dumps/some_ascii_pgdump.bin | ./ascii-tester |
./bad-ascii-report.pl > unclean-ascii.rpt

http://www.ecoligames.com/~djenkins/pgsql/

Disclaimer: I offer NO warranty. Use at your own risk. Code does
minimal error checking (its a hack / tool for manual use, not reliable
production use).

C code compiles cleanly with gcc. Perl code uses no libraries (just a
STDIN -> STDOUT processor). This code should run damn near anywhere.

The code will stay on my web server until I forget about it and re-org
stuff in a few weeks, so grab it while you can.

Glenn Maynard

glenn@zewt.org

about 15 years ago

In reply to: Scott Ribe (#1)

Re: finding bogus UTF-8

On Thu, Feb 10, 2011 at 2:02 PM, Scott Ribe <scott_ribe@elevated-dev.com>wrote:

I know that I have at least one instance of a varchar that is not valid
UTF-8, imported from a source with errors (AMA CPT files, actually) before
PG's checking was as stringent as it is today. Can anybody suggest a query
to find such values?

I hit this problem too, if I remember correctly when trying to upgrade a
database from 8.3 to 8.4. I ended up aborting the upgrade, since the
upgrade documentation made no mention of this and I didn't have time to dig
into it at the time. A tool to find all instances of this would be very
helpful.

--
Glenn Maynard

Geoffrey

lists@serioustechnology.com

about 15 years ago

In reply to: Glenn Maynard (#5)

Re: finding bogus UTF-8

Glenn Maynard wrote:

On Thu, Feb 10, 2011 at 2:02 PM, Scott Ribe <scott_ribe@elevated-dev.com
<mailto:scott_ribe@elevated-dev.com>> wrote:

I know that I have at least one instance of a varchar that is not
valid UTF-8, imported from a source with errors (AMA CPT files,
actually) before PG's checking was as stringent as it is today. Can
anybody suggest a query to find such values?

I hit this problem too, if I remember correctly when trying to upgrade a
database from 8.3 to 8.4. I ended up aborting the upgrade, since the
upgrade documentation made no mention of this and I didn't have time to
dig into it at the time. A tool to find all instances of this would be
very helpful.

I'm about to pipe the ascii output of a database dump through a perl
script that removes any unwanted characters. To help define what
'unwanted characters' are, compare the ord() values to decimal values at
http://www.asciitable.com/

while (<>)
{
$_ =~ s/(.)/((ord($1) >= 0) && (ord($1) <= 8))
|| ((ord($1) >= 11) && (ord($1) <= 31))
|| ((ord($1) >= 127)) ?"": $1/egs;
print;
}

comments would be appreciated.

--
Until later, Geoffrey

"I predict future happiness for America if they can prevent
the government from wasting the labors of the people under
the pretense of taking care of them."
- Thomas Jefferson

Vick Khera

vivek@khera.org

about 15 years ago

In reply to: Geoffrey (#6)

Re: finding bogus UTF-8

On Tue, Feb 15, 2011 at 11:09 AM, Geoffrey Myers
<lists@serioustechnology.com> wrote:

comments would be appreciated.

If all you're doing is filtering stdin to stdout and deleting a range
of characters, it seems that tr would be a faster tool:

cat foo.txt | tr -d '\000-\008\013-\037\177-\377' > foo-cleaned.txt

Marko Kreen

markokr@gmail.com

about 15 years ago

In reply to: Scott Ribe (#1)

Re: finding bogus UTF-8

On Thu, Feb 10, 2011 at 9:02 PM, Scott Ribe <scott_ribe@elevated-dev.com> wrote:

I know that I have at least one instance of a varchar that is not valid UTF-8, imported from a source with errors (AMA CPT files, actually) before PG's checking was as stringent as it is today. Can anybody suggest a query to find such values?

CREATE OR REPLACE FUNCTION is_utf8(text)
RETURNS bool AS $$
try:
args[0].decode('utf8')
return True
except UnicodeDecodeError:
return False
$$ LANGUAGE plpythonu STRICT;

--
marko

Geoffrey

lists@serioustechnology.com

about 15 years ago

In reply to: Vick Khera (#7)

Re: finding bogus UTF-8

Vick Khera wrote:

On Tue, Feb 15, 2011 at 11:09 AM, Geoffrey Myers
<lists@serioustechnology.com> wrote:

comments would be appreciated.

If all you're doing is filtering stdin to stdout and deleting a range
of characters, it seems that tr would be a faster tool:

cat foo.txt | tr -d '\000-\008\013-\037\177-\377' > foo-cleaned.txt

I toyed with tr for a bit, but could not get it to work. The above did
not work for me either. Not exactly sure what it's doing, but here's a
couple of diff lines:

1619c1619
< days integer DEFAULT 28,
---

days integer DEFAULT 2,

So it appears 'tr' is deleting the '8' character, rather then the octal
value for 008.

--
Until later, Geoffrey

"I predict future happiness for America if they can prevent
the government from wasting the labors of the people under
the pretense of taking care of them."
- Thomas Jefferson

#10

Vick Khera

vivek@khera.org

about 15 years ago

In reply to: Geoffrey (#9)

Re: finding bogus UTF-8

On Tue, Feb 15, 2011 at 5:06 PM, Geoffrey Myers
<lists@serioustechnology.com> wrote:

I toyed with tr for a bit, but could not get it to work. The above did not
work for me either. Not exactly sure what it's doing, but here's a couple
of diff lines:

check your shell escaping. You may need \\ to protect the \