Eliminating bad characters from a database for upgrading from 7.4 to 8.1

Started by Gregory S. Williamsonover 19 years ago10 messagesgeneral
Jump to latest
#1Gregory S. Williamson
gsw@globexplorer.com

Dear list,

I have been banging my head against a problem for a few days now, and although I am making progress it is painfully slow, and I am hoping that some one out there can steer me in a better way.

I've got a medium size database (7.4), about 64 gigs of data, about 1/3rd of which is in one table, which has ~32 million rows (22 gigs when dumped). This largish table has about 20 different columns of varchar or text.

There are some records that have illegal characters in them, according to postgres 8.1.5, which imposes stricter standards on UTF encoding.

I've been using copy to dump the big table to disk, then try to load it into my new table. When it fails, I use split to break the file into managable chunks and then use ""vi" to find the offending line, then figure out the column. Then I use something like:

create table bad_char_gids as select gid from parcels where position('Ñ' in s_street) > 0;

And so create a table with the ids of the bad records; and then use replace to either replace or eliminate the offending characters from that column. This example got 5001 records, but often it is one record in the whole DB will have some other offending character. I fix the problem in the loaddata as well, and continue.

The problem is that there are errors in quite a few of the columns (but only a few tens of thousands of records), and the offending characters are all quite different (wierd diacritics and characters, upper and lower case). And so this is a very slow process.

Is there any way to get a list of records, even if done repeatedly for each column, that would let me find the offending records in 7.4 which have any invalid UTF chars? I am feeling stupid for not seeing one ... I can find any individual bad character, but I want to find them all at once, if possible.

TIA,

Greg Williamson
DBA
GlobeXplorer LLC

#2Russell Smith
mr-russ@pws.com.au
In reply to: Gregory S. Williamson (#1)
Re: Eliminating bad characters from a database for upgrading

Gregory S. Williamson wrote:

Dear list,

I have been banging my head against a problem for a few days now, and although I am making progress it is painfully slow, and I am hoping that some one out there can steer me in a better way.

I've got a medium size database (7.4), about 64 gigs of data, about 1/3rd of which is in one table, which has ~32 million rows (22 gigs when dumped). This largish table has about 20 different columns of varchar or text.

There are some records that have illegal characters in them, according to postgres 8.1.5, which imposes stricter standards on UTF encoding.

I've been using copy to dump the big table to disk, then try to load it into my new table. When it fails, I use split to break the file into managable chunks and then use ""vi" to find the offending line, then figure out the column. Then I use something like:

create table bad_char_gids as select gid from parcels where position('�' in s_street) > 0;

And so create a table with the ids of the bad records; and then use replace to either replace or eliminate the offending characters from that column. This example got 5001 records, but often it is one record in the whole DB will have some other offending character. I fix the problem in the loaddata as well, and continue.

The problem is that there are errors in quite a few of the columns (but only a few tens of thousands of records), and the offending characters are all quite different (wierd diacritics and characters, upper and lower case). And so this is a very slow process.

Is there any way to get a list of records, even if done repeatedly for each column, that would let me find the offending records in 7.4 which have any invalid UTF chars? I am feeling stupid for not seeing one ... I can find any individual bad character, but I want to find them all at once, if possible.

Try converting the dump files encoding to UTF-8. before 8.1 you could
insert invalid characters into the DB because it accepted other
encodings. It will also dump other encoding. For example, converting
something with windows characters in it.

iconv -f "WINDOWS-1251" -t "UTF-8" dump_file > converted_dump_file

And import the converted file. you may need to try a couple of
different input encodings if you aren't sure what encoding was used when
inserting data into the DB.

Russell.

Show quoted text

TIA,

Greg Williamson
DBA
GlobeXplorer LLC

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

#3Gregory S. Williamson
gsw@globexplorer.com
In reply to: Gregory S. Williamson (#1)
Re: Eliminating bad characters from a database for upgrading from 7.4 to 8.1

Thanks for the suggestion ... since the data involved came from different source, I suspect there may be more than one encoding, but this has great promise.

Greg

-----Original Message-----
From: Russell Smith [mailto:mr-russ@pws.com.au]
Sent: Thu 11/16/2006 7:27 PM
To: Gregory S. Williamson
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Eliminating bad characters from a database for upgrading from 7.4 to 8.1

Gregory S. Williamson wrote:

Dear list,

I have been banging my head against a problem for a few days now, and although I am making progress it is painfully slow, and I am hoping that some one out there can steer me in a better way.

I've got a medium size database (7.4), about 64 gigs of data, about 1/3rd of which is in one table, which has ~32 million rows (22 gigs when dumped). This largish table has about 20 different columns of varchar or text.

There are some records that have illegal characters in them, according to postgres 8.1.5, which imposes stricter standards on UTF encoding.

I've been using copy to dump the big table to disk, then try to load it into my new table. When it fails, I use split to break the file into managable chunks and then use ""vi" to find the offending line, then figure out the column. Then I use something like:

create table bad_char_gids as select gid from parcels where position('Ñ' in s_street) > 0;

And so create a table with the ids of the bad records; and then use replace to either replace or eliminate the offending characters from that column. This example got 5001 records, but often it is one record in the whole DB will have some other offending character. I fix the problem in the loaddata as well, and continue.

The problem is that there are errors in quite a few of the columns (but only a few tens of thousands of records), and the offending characters are all quite different (wierd diacritics and characters, upper and lower case). And so this is a very slow process.

Is there any way to get a list of records, even if done repeatedly for each column, that would let me find the offending records in 7.4 which have any invalid UTF chars? I am feeling stupid for not seeing one ... I can find any individual bad character, but I want to find them all at once, if possible.

Try converting the dump files encoding to UTF-8. before 8.1 you could
insert invalid characters into the DB because it accepted other
encodings. It will also dump other encoding. For example, converting
something with windows characters in it.

iconv -f "WINDOWS-1251" -t "UTF-8" dump_file > converted_dump_file

And import the converted file. you may need to try a couple of
different input encodings if you aren't sure what encoding was used when
inserting data into the DB.

Russell.

TIA,

Greg Williamson
DBA
GlobeXplorer LLC

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

-------------------------------------------------------
Click link below if it is SPAM gsw@globexplorer.com
"https://mailscanner.globexplorer.com/dspam/dspam.cgi?signatureID=455d2cee144961034217237&user=gsw@globexplorer.com&retrain=spam&template=history&history_page=1"
!DSPAM:455d2cee144961034217237!
-------------------------------------------------------

#4mike g
mike@thegodshalls.com
In reply to: Gregory S. Williamson (#3)
Re: Eliminating bad characters from a database for

The manual suggests: iconv -c -f UTF-8 -t UTF-8 -o cleanfile.sql
dumpfile.sql. The -c option removes invalid character sequences. A diff
of the two files will show the sequences that are invalid. iconv reads
the entire input file into memory so it might be necessary to use split
to break up the dump into multiple smaller files for processing.

Show quoted text

On Thu, 2006-11-16 at 19:38 -0800, Gregory S. Williamson wrote:

Thanks for the suggestion ... since the data involved came from different source, I suspect there may be more than one encoding, but this has great promise.

Greg

-----Original Message-----
From: Russell Smith [mailto:mr-russ@pws.com.au]
Sent: Thu 11/16/2006 7:27 PM
To: Gregory S. Williamson
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Eliminating bad characters from a database for upgrading from 7.4 to 8.1

Gregory S. Williamson wrote:

Dear list,

I have been banging my head against a problem for a few days now, and although I am making progress it is painfully slow, and I am hoping that some one out there can steer me in a better way.

I've got a medium size database (7.4), about 64 gigs of data, about 1/3rd of which is in one table, which has ~32 million rows (22 gigs when dumped). This largish table has about 20 different columns of varchar or text.

There are some records that have illegal characters in them, according to postgres 8.1.5, which imposes stricter standards on UTF encoding.

I've been using copy to dump the big table to disk, then try to load it into my new table. When it fails, I use split to break the file into managable chunks and then use ""vi" to find the offending line, then figure out the column. Then I use something like:

create table bad_char_gids as select gid from parcels where position('Ñ' in s_street) > 0;

And so create a table with the ids of the bad records; and then use replace to either replace or eliminate the offending characters from that column. This example got 5001 records, but often it is one record in the whole DB will have some other offending character. I fix the problem in the loaddata as well, and continue.

The problem is that there are errors in quite a few of the columns (but only a few tens of thousands of records), and the offending characters are all quite different (wierd diacritics and characters, upper and lower case). And so this is a very slow process.

Is there any way to get a list of records, even if done repeatedly for each column, that would let me find the offending records in 7.4 which have any invalid UTF chars? I am feeling stupid for not seeing one ... I can find any individual bad character, but I want to find them all at once, if possible.

Try converting the dump files encoding to UTF-8. before 8.1 you could
insert invalid characters into the DB because it accepted other
encodings. It will also dump other encoding. For example, converting
something with windows characters in it.

iconv -f "WINDOWS-1251" -t "UTF-8" dump_file > converted_dump_file

And import the converted file. you may need to try a couple of
different input encodings if you aren't sure what encoding was used when
inserting data into the DB.

Russell.

TIA,

Greg Williamson
DBA
GlobeXplorer LLC

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

-------------------------------------------------------
Click link below if it is SPAM gsw@globexplorer.com
"https://mailscanner.globexplorer.com/dspam/dspam.cgi?signatureID=455d2cee144961034217237&user=gsw@globexplorer.com&retrain=spam&template=history&history_page=1"
!DSPAM:455d2cee144961034217237!
-------------------------------------------------------

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

#5Gregory S. Williamson
gsw@globexplorer.com
In reply to: Gregory S. Williamson (#1)
Re: Eliminating bad characters from a database for upgrading from 7.4 to 8.1

Thanks to you and Russell -- the iconv trick had eluded me for some reason. Serious pain on such a large file, but at least it works (test of the small 22 gig sample)! A little splitting, a little converting, some diff-ing, reassmbly and load. piece o' cake!

Thanks again ... sorry for wasting bandwidth for what seems to have an RFTM question!

G

-----Original Message-----
From: mike [mailto:mike@thegodshalls.com]
Sent: Thu 11/16/2006 7:49 PM
To: Gregory S. Williamson
Cc: Russell Smith; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Eliminating bad characters from a database for upgrading from 7.4 to 8.1

The manual suggests: iconv -c -f UTF-8 -t UTF-8 -o cleanfile.sql
dumpfile.sql. The -c option removes invalid character sequences. A diff
of the two files will show the sequences that are invalid. iconv reads
the entire input file into memory so it might be necessary to use split
to break up the dump into multiple smaller files for processing.

On Thu, 2006-11-16 at 19:38 -0800, Gregory S. Williamson wrote:

Thanks for the suggestion ... since the data involved came from different source, I suspect there may be more than one encoding, but this has great promise.

Greg

-----Original Message-----
From: Russell Smith [mailto:mr-russ@pws.com.au]
Sent: Thu 11/16/2006 7:27 PM
To: Gregory S. Williamson
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Eliminating bad characters from a database for upgrading from 7.4 to 8.1

Gregory S. Williamson wrote:

Dear list,

I have been banging my head against a problem for a few days now, and although I am making progress it is painfully slow, and I am hoping that some one out there can steer me in a better way.

I've got a medium size database (7.4), about 64 gigs of data, about 1/3rd of which is in one table, which has ~32 million rows (22 gigs when dumped). This largish table has about 20 different columns of varchar or text.

There are some records that have illegal characters in them, according to postgres 8.1.5, which imposes stricter standards on UTF encoding.

I've been using copy to dump the big table to disk, then try to load it into my new table. When it fails, I use split to break the file into managable chunks and then use ""vi" to find the offending line, then figure out the column. Then I use something like:

create table bad_char_gids as select gid from parcels where position('Ñ' in s_street) > 0;

And so create a table with the ids of the bad records; and then use replace to either replace or eliminate the offending characters from that column. This example got 5001 records, but often it is one record in the whole DB will have some other offending character. I fix the problem in the loaddata as well, and continue.

The problem is that there are errors in quite a few of the columns (but only a few tens of thousands of records), and the offending characters are all quite different (wierd diacritics and characters, upper and lower case). And so this is a very slow process.

Is there any way to get a list of records, even if done repeatedly for each column, that would let me find the offending records in 7.4 which have any invalid UTF chars? I am feeling stupid for not seeing one ... I can find any individual bad character, but I want to find them all at once, if possible.

Try converting the dump files encoding to UTF-8. before 8.1 you could
insert invalid characters into the DB because it accepted other
encodings. It will also dump other encoding. For example, converting
something with windows characters in it.

iconv -f "WINDOWS-1251" -t "UTF-8" dump_file > converted_dump_file

And import the converted file. you may need to try a couple of
different input encodings if you aren't sure what encoding was used when
inserting data into the DB.

Russell.

TIA,

Greg Williamson
DBA
GlobeXplorer LLC

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

-------------------------------------------------------
Click link below if it is SPAM gsw@globexplorer.com
"https://mailscanner.globexplorer.com/dspam/dspam.cgi?signatureID=455d2cee144961034217237&user=gsw@globexplorer.com&retrain=spam&template=history&history_page=1"

-------------------------------------------------------

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

-------------------------------------------------------
Click link below if it is SPAM gsw@globexplorer.com
"https://mailscanner.globexplorer.com/dspam/dspam.cgi?signatureID=455d31d7146531336712104&user=gsw@globexplorer.com&retrain=spam&template=history&history_page=1"
!DSPAM:455d31d7146531336712104!
-------------------------------------------------------

#6Vick Khera
vivek@khera.org
In reply to: Gregory S. Williamson (#3)
Re: Eliminating bad characters from a database for upgrading from 7.4 to 8.1

On Nov 16, 2006, at 10:38 PM, Gregory S. Williamson wrote:

Try converting the dump files encoding to UTF-8. before 8.1 you could
insert invalid characters into the DB because it accepted other
encodings. It will also dump other encoding. For example, converting
something with windows characters in it.

iconv -f "WINDOWS-1251" -t "UTF-8" dump_file > converted_dump_file

And import the converted file. you may need to try a couple of
different input encodings if you aren't sure what encoding was used
when
inserting data into the DB.

I had a similar situation a few months back. I just ended up doing a
translation from UTF8 to UTF8 and dropping bad characters. The db
itself had multiple different encodings for different records: some
were windows encoding, some were other. It was just a mess. I'm
glad 8.1 is more strict about this!

Attachments:

smime.p7sapplication/pkcs7-signature; name=smime.p7sDownload
#7Gurjeet Singh
singh.gurjeet@gmail.com
In reply to: Gregory S. Williamson (#5)
Re: Eliminating bad characters from a database for upgrading from 7.4 to 8.1

On 11/17/06, Gregory S. Williamson <gsw@globexplorer.com> wrote:

Thanks again ... sorry for wasting bandwidth for what seems to have an
RFTM question!

I am sure this is not wasted bandwidth; it sure was interesting to follow
and will be of great help to others.

BTW... what does RFTM stand for? I couldn't find it using `define:RFTM` in
google.

Best regards,

--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | yahoo }.com

#8Mikko Partio
dun@paprikalaatikko.be
In reply to: Gurjeet Singh (#7)
Re: Eliminating bad characters from a database for upgrading

Gurjeet Singh wrote:

BTW... what does RFTM stand for? I couldn't find it using `define:RFTM`
in google.

Read F*ing The Manual? :)

MP

#9Richard Broersma Jr
rabroersma@yahoo.com
In reply to: Gurjeet Singh (#7)
Re: Eliminating bad characters from a database for upgrading from 7.4 to 8.1

BTW... what does RFTM stand for? I couldn't find it using `define:RFTM` in
google.

I think that RFTM is a typo and should have been RTFM:

http://www.acronymfinder.com/af-query.asp?Acronym=RTFM&amp;string=exact

Regards,

Richard Broersma Jr.

#10Gurjeet Singh
singh.gurjeet@gmail.com
In reply to: Mikko Partio (#8)
Re: Eliminating bad characters from a database for upgrading from 7.4 to 8.1

On 11/20/06, Mikko Partio <dun@paprikalaatikko.be> wrote:

Gurjeet Singh wrote:

BTW... what does RFTM stand for

Read F*ing The Manual? :)

I have to say that this definitely wasn't an RTFM question.

All, RTFM.

--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | yahoo }.com