Best practices for moving UTF8 databases
Hi. I *always* get an error moving my current fully utf-8 database
data into a new DB.
My server has the version 8.3 with a five year old DB. Everything, all
collation, LC_LOCALE etc are all utf8.
When I install a new Postgresql 8.4 on my home Mac OSX machine (after
losing some hair) I set everything about a new database to be utf8. At
least anything I could see in PgAdmin.
But when I pull in the data dump from the server I always see that
error from the utf8 mismatch and such.
So, my question. What is a good way to make sure that error does NOT
occur? I simply wish to replicate the server database on another PG
installation. What should one do?
Phoenix Kiula wrote:
Hi. I *always* get an error moving my current fully utf-8 database
data into a new DB.My server has the version 8.3 with a five year old DB. Everything, all
collation, LC_LOCALE etc are all utf8.When I install a new Postgresql 8.4 on my home Mac OSX machine (after
losing some hair) I set everything about a new database to be utf8. At
least anything I could see in PgAdmin.But when I pull in the data dump from the server I always see that
error from the utf8 mismatch and such.So, my question. What is a good way to make sure that error does NOT
occur? I simply wish to replicate the server database on another PG
installation. What should one do?
Improve PostgreSQL so that there are no more loopholes how you can
enter incorrect data :^(
8.3 is about one and a half years old, so your database must have
oribinated with an older release. Correctness checks have improved
considerably over the years.
What you will have to do it track down the rows containing incorrect data
and fix them.
Yours,
Laurenz Albe
I would like to know what the formal method of requesting new features are
for Postgres and how are decisions made which features are included in a
future release or not.
Best Regards
--
Michael Gould, Managing Partner
Intermodal Software Solutions, LLC
904.226.0978
904.592.5250 fax
Michael Gould wrote:
I would like to know what the formal method of requesting new features are
for Postgres and how are decisions made which features are included in a
future release or not.
Uh, you usually ask for the feature on the bugs or hackers email lists,
and if we consider it useful it will be added to the TODO list. Then,
someday, someone might implement your feature. Not very rigorous, but
that is the only method we have.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
On Sunday 12 July 2009 13:19:50 Phoenix Kiula wrote:
Hi. I *always* get an error moving my current fully utf-8 database
data into a new DB.My server has the version 8.3 with a five year old DB. Everything, all
collation, LC_LOCALE etc are all utf8.When I install a new Postgresql 8.4 on my home Mac OSX machine (after
losing some hair) I set everything about a new database to be utf8. At
least anything I could see in PgAdmin.But when I pull in the data dump from the server I always see that
error from the utf8 mismatch and such.So, my question. What is a good way to make sure that error does NOT
occur? I simply wish to replicate the server database on another PG
installation. What should one do?
What is the _exact_ error you get? During which statement?
Andres
On 2009-07-13, Andres Freund <andres@anarazel.de> wrote:
On Sunday 12 July 2009 13:19:50 Phoenix Kiula wrote:
Hi. I *always* get an error moving my current fully utf-8 database
data into a new DB.My server has the version 8.3 with a five year old DB. Everything, all
collation, LC_LOCALE etc are all utf8.When I install a new Postgresql 8.4 on my home Mac OSX machine (after
losing some hair) I set everything about a new database to be utf8. At
least anything I could see in PgAdmin.But when I pull in the data dump from the server I always see that
error from the utf8 mismatch and such.So, my question. What is a good way to make sure that error does NOT
occur? I simply wish to replicate the server database on another PG
installation. What should one do?What is the _exact_ error you get? During which statement?
Andres
if you do an ascii dump and the dump starts out "SET CLIENT ENCODING 'UTF8'"
or similar but you still get errors.
run it through run it through "iconv -f UFT8 -t UTF8//IGNORE"
that'll drop any illegal symbols. In theory that's a reduction in data
integrity.
iconv doesn't seem to have an option to replace them with U+FFFD :(
I had hoped that //TRASNSLIT would do that, but no.
On Tuesday 14 July 2009 11:36:57 Jasen Betts wrote:
On 2009-07-13, Andres Freund <andres@anarazel.de> wrote:
On Sunday 12 July 2009 13:19:50 Phoenix Kiula wrote:
Hi. I *always* get an error moving my current fully utf-8 database
data into a new DB.My server has the version 8.3 with a five year old DB. Everything, all
collation, LC_LOCALE etc are all utf8.When I install a new Postgresql 8.4 on my home Mac OSX machine (after
losing some hair) I set everything about a new database to be utf8. At
least anything I could see in PgAdmin.But when I pull in the data dump from the server I always see that
error from the utf8 mismatch and such.So, my question. What is a good way to make sure that error does NOT
occur? I simply wish to replicate the server database on another PG
installation. What should one do?What is the _exact_ error you get? During which statement?
if you do an ascii dump and the dump starts out "SET CLIENT ENCODING
'UTF8'" or similar but you still get errors.
Do you mean that a dump from SQL_ASCII can yield non-utf8 data? right. But
According to the OP his 8.3 database is UTF8...
So there should not be invalid data in there.
Andres
On 2009-07-13, Michael Gould <mgould@intermodalsoftwaresolutions.net> wrote:
I would like to know what the formal method of requesting new features are
for Postgres and how are decisions made which features are included in a
future release or not.
Formal?
for open soure software in general, changes are are requested using
the bug reporting system.
Change requests accompanied with working source patches have an advantage.
Grants of cash to the developers tend to help too.
Andres Freund wrote:
On Tuesday 14 July 2009 11:36:57 Jasen Betts wrote:
if you do an ascii dump and the dump starts out "SET CLIENT ENCODING
'UTF8'" or similar but you still get errors.Do you mean that a dump from SQL_ASCII can yield non-utf8 data? right. But
According to the OP his 8.3 database is UTF8...
So there should not be invalid data in there.
I haven't followed this thread, but older PG versions had less strict
checks on UTF8 data, which meant that some invalid data could creep in.
--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Tuesday 14 July 2009 15:52:29 Alvaro Herrera wrote:
Andres Freund wrote:
On Tuesday 14 July 2009 11:36:57 Jasen Betts wrote:
if you do an ascii dump and the dump starts out "SET CLIENT ENCODING
'UTF8'" or similar but you still get errors.Do you mean that a dump from SQL_ASCII can yield non-utf8 data? right.
But According to the OP his 8.3 database is UTF8...
So there should not be invalid data in there.I haven't followed this thread, but older PG versions had less strict
checks on UTF8 data, which meant that some invalid data could creep in.
But that was from 8.2 -> 8.3 and not 8.3->8.4 I think?
Andres
On Tue, Jul 14, 2009 at 10:00:04AM +0000, Jasen Betts wrote:
On 2009-07-13, Michael Gould <mgould@intermodalsoftwaresolutions.net> wrote:
I would like to know what the formal method of requesting new features are
for Postgres and how are decisions made which features are included in a
future release or not.Formal?
for open soure software in general, changes are are requested using
the bug reporting system.
Not in PostgreSQL, generally.
Change requests accompanied with working source patches have an
advantage.
Not always. It's much better to discuss the design of the feature on
-hackers, come to some rough consensus, *then* write some code than to
come up with some huge wonk of code which no one is interested in
understanding and will languish.
Grants of cash to the developers tend to help too.
There are several outfits that can help you organize your donations.
Software in the Public Interest, an non-profit company in the US, has
a way to earmark donations for the PostgreSQL project.
Other ways to contribute resources include hardware and/or people
available to the project.
Being a pleasant, helpful part of the community will incline everyone
else favorably toward setting your ideas at a higher priority, even if
some of them don't fly :)
Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com
Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
On Tue, Jul 14, 2009 at 9:52 PM, Alvaro
Herrera<alvherre@commandprompt.com> wrote:
Andres Freund wrote:
On Tuesday 14 July 2009 11:36:57 Jasen Betts wrote:
if you do an ascii dump and the dump starts out "SET CLIENT ENCODING
'UTF8'" or similar but you still get errors.Do you mean that a dump from SQL_ASCII can yield non-utf8 data? right. But
According to the OP his 8.3 database is UTF8...
So there should not be invalid data in there.I haven't followed this thread, but older PG versions had less strict
checks on UTF8 data, which meant that some invalid data could creep in.
If so, how can I check for them in my old database, which is 8.2.9?
I'm now moving first to 8.3 (then to the 84).
Really, PG absolutely needs a way to upgrade the database without so
much data related downtime and all these silly woes. Several competing
database systems are a cinch to upgrade.
Anyway this is the annoying error I see as always:
ERROR: invalid byte sequence for encoding "UTF8": 0x80
I think my old DB is all utf8. If there are a few characters that are
not, how can I work with this? I've done everything I can to take care
of the encoding and such. This code was used to initdb:
initdb --locale=en_US.UTF-8 --encoding=UTF8
Locale environment variables are all "en_US.UTF-8" too.
Thanks for any pointers!
On Sun, Jul 19, 2009 at 10:16:17AM +0800, Phoenix Kiula wrote:
If so, how can I check for them in my old database, which is 8.2.9?
I'm now moving first to 8.3 (then to the 84).Really, PG absolutely needs a way to upgrade the database without so
much data related downtime and all these silly woes. Several competing
database systems are a cinch to upgrade.
Look through the archives, there are scripts that will scan all your
text fields for UTF-8 problems. If you run them once you can clear out
all the problems prior to upgrading.
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
Please line up in a tree and maintain the heap invariant while
boarding. Thank you for flying nlogn airlines.
On Sun, Jul 19, 2009 at 7:08 PM, Martijn van
Oosterhout<kleptog@svana.org> wrote:
On Sun, Jul 19, 2009 at 10:16:17AM +0800, Phoenix Kiula wrote:
Look through the archives, there are scripts that will scan all your
text fields for UTF-8 problems. If you run them once you can clear out
all the problems prior to upgrading.
Thanks Martin. I tried searching through the archives and could only
come with something like this:
http://docs.moodle.org/en/UTF-8_PostgreSQL
But this only has the usual iconv stuff suggested.
Could you pls suggest some specific scripts or URLs? Or let me know
the keywords I should look for?
Thanks!
Phoenix Kiula wrote:
Really, PG absolutely needs a way to upgrade the database without so
much data related downtime and all these silly woes. Several competing
database systems are a cinch to upgrade.
I'd call it data corruption, not a silly woe.
I know that Oracle for example would not make that much fuss about
your data: they would be imported without even a warning, and
depending on your encoding settings the bad bytes would either be
imported as-is or tacitly changed to inverted (or normal) question
marks.
It's basically a design choice that PostgreSQL made: we think that
an error is preferrable to clandestinely modifying the user's data
or accepting input that cannot possibly make any sense when it is
retrieved at a future time.
Anyway this is the annoying error I see as always:
ERROR: invalid byte sequence for encoding "UTF8": 0x80
I think my old DB is all utf8. If there are a few characters that are
not, how can I work with this? I've done everything I can to take care
of the encoding and such. This code was used to initdb:initdb --locale=en_US.UTF-8 --encoding=UTF8
Locale environment variables are all "en_US.UTF-8" too.
"0x80" makes me think of the following:
The data originate from a Windows system, where 0x80 is a Euro
sign. Somehow these were imported into PostgreSQL without the
appropriate translation into UTF-8 (how I do not know).
I wonder: why do you spend so much time complaining instead of
simply locating the buggy data and fixing them?
This does not incur any downtime (you can fix the data in the old
database before migrating), and it will definitely enhance the fun
your users have with your database (if they actually see Euros where
they should be).
Yours,
Laurenz Albe
On Mon, Jul 20, 2009 at 10:32:15AM +0800, Phoenix Kiula wrote:
Thanks Martin. I tried searching through the archives and could only
come with something like this:http://docs.moodle.org/en/UTF-8_PostgreSQL
But this only has the usual iconv stuff suggested.
Could you pls suggest some specific scripts or URLs? Or let me know
the keywords I should look for?
I was thinking of this one:
http://www.mail-archive.com/pgsql-hackers@postgresql.org/msg68541.html
You can use it to find the bogus strings in your existing database and
fix them. It's not really fast, but maybe it'll help.
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
Please line up in a tree and maintain the heap invariant while
boarding. Thank you for flying nlogn airlines.
"0x80" makes me think of the following:
The data originate from a Windows system, where 0x80 is a Euro
sign. Somehow these were imported into PostgreSQL without the
appropriate translation into UTF-8 (how I do not know).I wonder: why do you spend so much time complaining instead of
simply locating the buggy data and fixing them?
I'd love to fix them. But if I do a search for
SELECT * FROM xyz WHERE col like '%0x80%'
it doesn't work. How should I search for these characters?
Thanks much
Phoenix Kiula wrote:
I'd love to fix them. But if I do a search for
SELECT * FROM xyz WHERE col like '%0x80%'
it doesn't work. How should I search for these characters?
In 8.2, try: WHERE strpos(col, E'\x80') > 0
Note that this may find valid data as well, because the error you get
is when 0x80 is the first byte of a character in UTF8; when it's at
another position, you don't want to change it.
Best regards,
--
Daniel
PostgreSQL-powered mail user agent and storage:
http://www.manitou-mail.org
On Tue, Jul 21, 2009 at 09:37:04AM +0200, Daniel Verite wrote:
I'd love to fix them. But if I do a search for
SELECT * FROM xyz WHERE col like '%0x80%'it doesn't work. How should I search for these characters?
In 8.2, try: WHERE strpos(col, E'\x80') > 0
Note that this may find valid data as well, because the error you get
is when 0x80 is the first byte of a character in UTF8; when it's at
another position, you don't want to change it.
There are various regexs around to check for valid UTF-8 encoding; one
appears to be:
http://keithdevens.com/weblog/archive/2004/Jun/29/UTF-8.regex
One translation into PG would be:
WHERE NOT col ~ ( '^('||
$$[\09\0A\0D\x20-\x7E]|$$|| -- ASCII
$$[\xC2-\xDF][\x80-\xBF]|$$|| -- non-overlong 2-byte
$$\xE0[\xA0-\xBF][\x80-\xBF]|$$|| -- excluding overlongs
$$[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|$$|| -- straight 3-byte
$$\xED[\x80-\x9F][\x80-\xBF]|$$|| -- excluding surrogates
$$\xF0[\x90-\xBF][\x80-\xBF]{2}|$$|| -- planes 1-3
$$[\xF1-\xF3][\x80-\xBF]{3}|$$|| -- planes 4-15
$$\xF4[\x80-\x8F][\x80-\xBF]{2}$$|| -- plane 16
'*)$' );
This seems to do the right thing for me in an SQL_ASCII database.
--
Sam http://samason.me.uk/
Phoenix Kiula wrote:
I wonder: why do you spend so much time complaining instead of
simply locating the buggy data and fixing them?I'd love to fix them. But if I do a search for
SELECT * FROM xyz WHERE col like '%0x80%'
it doesn't work. How should I search for these characters?
I would get GNU libiconv:
http://www.gnu.org/software/libiconv/
Then take a plain text pg_dump of the database in UTF-8.
Let's call it "db.orig.dmp".
Strip all problem characters:
iconv -f UTF-8 -t UTF-8 -c <db.orig.dmp >db.stripped.sql
Compare both files:
diff -u db.orig.dmp db.stripped.sql
That output will show all lines containing a problem character.
Now the tedious part:
Use "db.orig.dmp" to find out which tables they belong to,
locate the records in the database by primary key and fix them.
Alternatively, you can use iconv's auto-repair if you know which
bytes give you a problem.
For example, if you know that the trouble stems only from 0x80 bytes
that should be Euro symbols, you could:
iconv -f UTF-8 -t UTF-8 --byte-subst="<0x%x>" <db.orig.dmp | sed -e 's/<0x80>/EUR/g' >db.fixed.sql
The resulting "db.fixed.sql" could then be loaded into the new database.
Yours,
Laurenz Albe