Best practices for moving UTF8 databases

Started by Phoenix Kiulaover 16 years ago27 messagesgeneral
Jump to latest
#1Phoenix Kiula
phoenix.kiula@gmail.com

Hi. I *always* get an error moving my current fully utf-8 database
data into a new DB.

My server has the version 8.3 with a five year old DB. Everything, all
collation, LC_LOCALE etc are all utf8.

When I install a new Postgresql 8.4 on my home Mac OSX machine (after
losing some hair) I set everything about a new database to be utf8. At
least anything I could see in PgAdmin.

But when I pull in the data dump from the server I always see that
error from the utf8 mismatch and such.

So, my question. What is a good way to make sure that error does NOT
occur? I simply wish to replicate the server database on another PG
installation. What should one do?

#2Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Phoenix Kiula (#1)
Re: Best practices for moving UTF8 databases

Phoenix Kiula wrote:

Hi. I *always* get an error moving my current fully utf-8 database
data into a new DB.

My server has the version 8.3 with a five year old DB. Everything, all
collation, LC_LOCALE etc are all utf8.

When I install a new Postgresql 8.4 on my home Mac OSX machine (after
losing some hair) I set everything about a new database to be utf8. At
least anything I could see in PgAdmin.

But when I pull in the data dump from the server I always see that
error from the utf8 mismatch and such.

So, my question. What is a good way to make sure that error does NOT
occur? I simply wish to replicate the server database on another PG
installation. What should one do?

Improve PostgreSQL so that there are no more loopholes how you can
enter incorrect data :^(

8.3 is about one and a half years old, so your database must have
oribinated with an older release. Correctness checks have improved
considerably over the years.

What you will have to do it track down the rows containing incorrect data
and fix them.

Yours,
Laurenz Albe

#3Michael Gould
mgould@intermodalsoftwaresolutions.net
In reply to: Laurenz Albe (#2)
Request for features

I would like to know what the formal method of requesting new features are
for Postgres and how are decisions made which features are included in a
future release or not.

Best Regards
--
Michael Gould, Managing Partner
Intermodal Software Solutions, LLC
904.226.0978
904.592.5250 fax

#4Bruce Momjian
bruce@momjian.us
In reply to: Michael Gould (#3)
Re: Request for features

Michael Gould wrote:

I would like to know what the formal method of requesting new features are
for Postgres and how are decisions made which features are included in a
future release or not.

Uh, you usually ask for the feature on the bugs or hackers email lists,
and if we consider it useful it will be added to the TODO list. Then,
someday, someone might implement your feature. Not very rigorous, but
that is the only method we have.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#5Andres Freund
andres@anarazel.de
In reply to: Phoenix Kiula (#1)
Re: Best practices for moving UTF8 databases

On Sunday 12 July 2009 13:19:50 Phoenix Kiula wrote:

Hi. I *always* get an error moving my current fully utf-8 database
data into a new DB.

My server has the version 8.3 with a five year old DB. Everything, all
collation, LC_LOCALE etc are all utf8.

When I install a new Postgresql 8.4 on my home Mac OSX machine (after
losing some hair) I set everything about a new database to be utf8. At
least anything I could see in PgAdmin.

But when I pull in the data dump from the server I always see that
error from the utf8 mismatch and such.

So, my question. What is a good way to make sure that error does NOT
occur? I simply wish to replicate the server database on another PG
installation. What should one do?

What is the _exact_ error you get? During which statement?

Andres

#6Jasen Betts
jasen@xnet.co.nz
In reply to: Phoenix Kiula (#1)
Re: Best practices for moving UTF8 databases

On 2009-07-13, Andres Freund <andres@anarazel.de> wrote:

On Sunday 12 July 2009 13:19:50 Phoenix Kiula wrote:

Hi. I *always* get an error moving my current fully utf-8 database
data into a new DB.

My server has the version 8.3 with a five year old DB. Everything, all
collation, LC_LOCALE etc are all utf8.

When I install a new Postgresql 8.4 on my home Mac OSX machine (after
losing some hair) I set everything about a new database to be utf8. At
least anything I could see in PgAdmin.

But when I pull in the data dump from the server I always see that
error from the utf8 mismatch and such.

So, my question. What is a good way to make sure that error does NOT
occur? I simply wish to replicate the server database on another PG
installation. What should one do?

What is the _exact_ error you get? During which statement?

Andres

if you do an ascii dump and the dump starts out "SET CLIENT ENCODING 'UTF8'"
or similar but you still get errors.

run it through run it through "iconv -f UFT8 -t UTF8//IGNORE"
that'll drop any illegal symbols. In theory that's a reduction in data
integrity.

iconv doesn't seem to have an option to replace them with U+FFFD :(
I had hoped that //TRASNSLIT would do that, but no.

#7Andres Freund
andres@anarazel.de
In reply to: Jasen Betts (#6)
Re: Best practices for moving UTF8 databases

On Tuesday 14 July 2009 11:36:57 Jasen Betts wrote:

On 2009-07-13, Andres Freund <andres@anarazel.de> wrote:

On Sunday 12 July 2009 13:19:50 Phoenix Kiula wrote:

Hi. I *always* get an error moving my current fully utf-8 database
data into a new DB.

My server has the version 8.3 with a five year old DB. Everything, all
collation, LC_LOCALE etc are all utf8.

When I install a new Postgresql 8.4 on my home Mac OSX machine (after
losing some hair) I set everything about a new database to be utf8. At
least anything I could see in PgAdmin.

But when I pull in the data dump from the server I always see that
error from the utf8 mismatch and such.

So, my question. What is a good way to make sure that error does NOT
occur? I simply wish to replicate the server database on another PG
installation. What should one do?

What is the _exact_ error you get? During which statement?

if you do an ascii dump and the dump starts out "SET CLIENT ENCODING
'UTF8'" or similar but you still get errors.

Do you mean that a dump from SQL_ASCII can yield non-utf8 data? right. But
According to the OP his 8.3 database is UTF8...
So there should not be invalid data in there.

Andres

#8Jasen Betts
jasen@xnet.co.nz
In reply to: Laurenz Albe (#2)
Re: Request for features

On 2009-07-13, Michael Gould <mgould@intermodalsoftwaresolutions.net> wrote:

I would like to know what the formal method of requesting new features are
for Postgres and how are decisions made which features are included in a
future release or not.

Formal?

for open soure software in general, changes are are requested using
the bug reporting system.

Change requests accompanied with working source patches have an advantage.

Grants of cash to the developers tend to help too.

#9Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#7)
Re: Best practices for moving UTF8 databases

Andres Freund wrote:

On Tuesday 14 July 2009 11:36:57 Jasen Betts wrote:

if you do an ascii dump and the dump starts out "SET CLIENT ENCODING
'UTF8'" or similar but you still get errors.

Do you mean that a dump from SQL_ASCII can yield non-utf8 data? right. But
According to the OP his 8.3 database is UTF8...
So there should not be invalid data in there.

I haven't followed this thread, but older PG versions had less strict
checks on UTF8 data, which meant that some invalid data could creep in.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#10Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#9)
Re: Best practices for moving UTF8 databases

On Tuesday 14 July 2009 15:52:29 Alvaro Herrera wrote:

Andres Freund wrote:

On Tuesday 14 July 2009 11:36:57 Jasen Betts wrote:

if you do an ascii dump and the dump starts out "SET CLIENT ENCODING
'UTF8'" or similar but you still get errors.

Do you mean that a dump from SQL_ASCII can yield non-utf8 data? right.
But According to the OP his 8.3 database is UTF8...
So there should not be invalid data in there.

I haven't followed this thread, but older PG versions had less strict
checks on UTF8 data, which meant that some invalid data could creep in.

But that was from 8.2 -> 8.3 and not 8.3->8.4 I think?

Andres

#11David Fetter
david@fetter.org
In reply to: Jasen Betts (#8)
Re: Request for features

On Tue, Jul 14, 2009 at 10:00:04AM +0000, Jasen Betts wrote:

On 2009-07-13, Michael Gould <mgould@intermodalsoftwaresolutions.net> wrote:

I would like to know what the formal method of requesting new features are
for Postgres and how are decisions made which features are included in a
future release or not.

Formal?

for open soure software in general, changes are are requested using
the bug reporting system.

Not in PostgreSQL, generally.

Change requests accompanied with working source patches have an
advantage.

Not always. It's much better to discuss the design of the feature on
-hackers, come to some rough consensus, *then* write some code than to
come up with some huge wonk of code which no one is interested in
understanding and will languish.

Grants of cash to the developers tend to help too.

There are several outfits that can help you organize your donations.
Software in the Public Interest, an non-profit company in the US, has
a way to earmark donations for the PostgreSQL project.

Other ways to contribute resources include hardware and/or people
available to the project.

Being a pleasant, helpful part of the community will incline everyone
else favorably toward setting your ideas at a higher priority, even if
some of them don't fly :)

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

#12Phoenix Kiula
phoenix.kiula@gmail.com
In reply to: Alvaro Herrera (#9)
Re: Best practices for moving UTF8 databases

On Tue, Jul 14, 2009 at 9:52 PM, Alvaro
Herrera<alvherre@commandprompt.com> wrote:

Andres Freund wrote:

On Tuesday 14 July 2009 11:36:57 Jasen Betts wrote:

if you do an ascii dump and the dump starts out "SET CLIENT ENCODING
'UTF8'" or similar but you still get errors.

Do you mean that a dump from SQL_ASCII can yield non-utf8 data? right. But
According to the OP his 8.3 database is UTF8...
So there should not be invalid data in there.

I haven't followed this thread, but older PG versions had less strict
checks on UTF8 data, which meant that some invalid data could creep in.

If so, how can I check for them in my old database, which is 8.2.9?
I'm now moving first to 8.3 (then to the 84).

Really, PG absolutely needs a way to upgrade the database without so
much data related downtime and all these silly woes. Several competing
database systems are a cinch to upgrade.

Anyway this is the annoying error I see as always:

ERROR: invalid byte sequence for encoding "UTF8": 0x80

I think my old DB is all utf8. If there are a few characters that are
not, how can I work with this? I've done everything I can to take care
of the encoding and such. This code was used to initdb:

initdb --locale=en_US.UTF-8 --encoding=UTF8

Locale environment variables are all "en_US.UTF-8" too.

Thanks for any pointers!

#13Martijn van Oosterhout
kleptog@svana.org
In reply to: Phoenix Kiula (#12)
Re: Best practices for moving UTF8 databases

On Sun, Jul 19, 2009 at 10:16:17AM +0800, Phoenix Kiula wrote:

If so, how can I check for them in my old database, which is 8.2.9?
I'm now moving first to 8.3 (then to the 84).

Really, PG absolutely needs a way to upgrade the database without so
much data related downtime and all these silly woes. Several competing
database systems are a cinch to upgrade.

Look through the archives, there are scripts that will scan all your
text fields for UTF-8 problems. If you run them once you can clear out
all the problems prior to upgrading.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Please line up in a tree and maintain the heap invariant while
boarding. Thank you for flying nlogn airlines.

#14Phoenix Kiula
phoenix.kiula@gmail.com
In reply to: Martijn van Oosterhout (#13)
Re: Best practices for moving UTF8 databases

On Sun, Jul 19, 2009 at 7:08 PM, Martijn van
Oosterhout<kleptog@svana.org> wrote:

On Sun, Jul 19, 2009 at 10:16:17AM +0800, Phoenix Kiula wrote:

Look through the archives, there are scripts that will scan all your
text fields for UTF-8 problems. If you run them once you can clear out
all the problems prior to upgrading.

Thanks Martin. I tried searching through the archives and could only
come with something like this:

http://docs.moodle.org/en/UTF-8_PostgreSQL

But this only has the usual iconv stuff suggested.

Could you pls suggest some specific scripts or URLs? Or let me know
the keywords I should look for?

Thanks!

#15Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Phoenix Kiula (#12)
Re: Best practices for moving UTF8 databases

Phoenix Kiula wrote:

Really, PG absolutely needs a way to upgrade the database without so
much data related downtime and all these silly woes. Several competing
database systems are a cinch to upgrade.

I'd call it data corruption, not a silly woe.

I know that Oracle for example would not make that much fuss about
your data: they would be imported without even a warning, and
depending on your encoding settings the bad bytes would either be
imported as-is or tacitly changed to inverted (or normal) question
marks.

It's basically a design choice that PostgreSQL made: we think that
an error is preferrable to clandestinely modifying the user's data
or accepting input that cannot possibly make any sense when it is
retrieved at a future time.

Anyway this is the annoying error I see as always:

ERROR: invalid byte sequence for encoding "UTF8": 0x80

I think my old DB is all utf8. If there are a few characters that are
not, how can I work with this? I've done everything I can to take care
of the encoding and such. This code was used to initdb:

initdb --locale=en_US.UTF-8 --encoding=UTF8

Locale environment variables are all "en_US.UTF-8" too.

"0x80" makes me think of the following:
The data originate from a Windows system, where 0x80 is a Euro
sign. Somehow these were imported into PostgreSQL without the
appropriate translation into UTF-8 (how I do not know).

I wonder: why do you spend so much time complaining instead of
simply locating the buggy data and fixing them?

This does not incur any downtime (you can fix the data in the old
database before migrating), and it will definitely enhance the fun
your users have with your database (if they actually see Euros where
they should be).

Yours,
Laurenz Albe

#16Martijn van Oosterhout
kleptog@svana.org
In reply to: Phoenix Kiula (#14)
Re: Best practices for moving UTF8 databases

On Mon, Jul 20, 2009 at 10:32:15AM +0800, Phoenix Kiula wrote:

Thanks Martin. I tried searching through the archives and could only
come with something like this:

http://docs.moodle.org/en/UTF-8_PostgreSQL

But this only has the usual iconv stuff suggested.

Could you pls suggest some specific scripts or URLs? Or let me know
the keywords I should look for?

I was thinking of this one:

http://www.mail-archive.com/pgsql-hackers@postgresql.org/msg68541.html

You can use it to find the bogus strings in your existing database and
fix them. It's not really fast, but maybe it'll help.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Please line up in a tree and maintain the heap invariant while
boarding. Thank you for flying nlogn airlines.

#17Phoenix Kiula
phoenix.kiula@gmail.com
In reply to: Laurenz Albe (#15)
Re: Best practices for moving UTF8 databases

"0x80" makes me think of the following:
The data originate from a Windows system, where 0x80 is a Euro
sign. Somehow these were imported into PostgreSQL without the
appropriate translation into UTF-8 (how I do not know).

I wonder: why do you spend so much time complaining instead of
simply locating the buggy data and fixing them?

I'd love to fix them. But if I do a search for

SELECT * FROM xyz WHERE col like '%0x80%'

it doesn't work. How should I search for these characters?

Thanks much

#18Daniel Verite
daniel@manitou-mail.org
In reply to: Phoenix Kiula (#17)
Re: Best practices for moving UTF8 databases

Phoenix Kiula wrote:

I'd love to fix them. But if I do a search for

SELECT * FROM xyz WHERE col like '%0x80%'

it doesn't work. How should I search for these characters?

In 8.2, try: WHERE strpos(col, E'\x80') > 0

Note that this may find valid data as well, because the error you get
is when 0x80 is the first byte of a character in UTF8; when it's at
another position, you don't want to change it.

Best regards,
--
Daniel
PostgreSQL-powered mail user agent and storage:
http://www.manitou-mail.org

#19Sam Mason
sam@samason.me.uk
In reply to: Daniel Verite (#18)
Re: Best practices for moving UTF8 databases

On Tue, Jul 21, 2009 at 09:37:04AM +0200, Daniel Verite wrote:

I'd love to fix them. But if I do a search for
SELECT * FROM xyz WHERE col like '%0x80%'

it doesn't work. How should I search for these characters?

In 8.2, try: WHERE strpos(col, E'\x80') > 0

Note that this may find valid data as well, because the error you get
is when 0x80 is the first byte of a character in UTF8; when it's at
another position, you don't want to change it.

There are various regexs around to check for valid UTF-8 encoding; one
appears to be:

http://keithdevens.com/weblog/archive/2004/Jun/29/UTF-8.regex

One translation into PG would be:

WHERE NOT col ~ ( '^('||
$$[\09\0A\0D\x20-\x7E]|$$|| -- ASCII
$$[\xC2-\xDF][\x80-\xBF]|$$|| -- non-overlong 2-byte
$$\xE0[\xA0-\xBF][\x80-\xBF]|$$|| -- excluding overlongs
$$[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|$$|| -- straight 3-byte
$$\xED[\x80-\x9F][\x80-\xBF]|$$|| -- excluding surrogates
$$\xF0[\x90-\xBF][\x80-\xBF]{2}|$$|| -- planes 1-3
$$[\xF1-\xF3][\x80-\xBF]{3}|$$|| -- planes 4-15
$$\xF4[\x80-\x8F][\x80-\xBF]{2}$$|| -- plane 16
'*)$' );

This seems to do the right thing for me in an SQL_ASCII database.

--
Sam http://samason.me.uk/

#20Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Phoenix Kiula (#17)
Re: Best practices for moving UTF8 databases

Phoenix Kiula wrote:

I wonder: why do you spend so much time complaining instead of
simply locating the buggy data and fixing them?

I'd love to fix them. But if I do a search for

SELECT * FROM xyz WHERE col like '%0x80%'

it doesn't work. How should I search for these characters?

I would get GNU libiconv:
http://www.gnu.org/software/libiconv/

Then take a plain text pg_dump of the database in UTF-8.
Let's call it "db.orig.dmp".

Strip all problem characters:
iconv -f UTF-8 -t UTF-8 -c <db.orig.dmp >db.stripped.sql

Compare both files:
diff -u db.orig.dmp db.stripped.sql

That output will show all lines containing a problem character.

Now the tedious part:
Use "db.orig.dmp" to find out which tables they belong to,
locate the records in the database by primary key and fix them.

Alternatively, you can use iconv's auto-repair if you know which
bytes give you a problem.
For example, if you know that the trouble stems only from 0x80 bytes
that should be Euro symbols, you could:
iconv -f UTF-8 -t UTF-8 --byte-subst="<0x%x>" <db.orig.dmp | sed -e 's/<0x80>/EUR/g' >db.fixed.sql

The resulting "db.fixed.sql" could then be loaded into the new database.

Yours,
Laurenz Albe

#21Phoenix Kiula
phoenix.kiula@gmail.com
In reply to: Sam Mason (#19)
#22Justin Pasher
justinp@newmediagateway.com
In reply to: Phoenix Kiula (#21)
#23Sam Mason
sam@samason.me.uk
In reply to: Phoenix Kiula (#21)
#24Jonathan Vanasco
postgres@2xlp.com
In reply to: Phoenix Kiula (#21)
#25Vick Khera
vivek@khera.org
In reply to: Jonathan Vanasco (#24)
#26Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Vick Khera (#25)
#27Jonathan Vanasco
postgres@2xlp.com
In reply to: Vick Khera (#25)