Dumping in LATIN1 and restoring in UTF-8
Hi all.
Here is my use case: I've an application which uses PostgreSQL as
backend. Up to now, the database was encoded in SQL_ASCII or LATIN1.
Now, we need to migrate to UTF-8.
What we tried, was to:
1) dump the database using pg_dump, in tar format (we had blob);
2) modifying the result, using some conversion tool (like recode)
3) destroying the old database
4) recreating the database with UNICODE setting
5) restoring the database using pg_restore
The result was not what I expected. The pg_restore was using the
LATIN1 encoding to encode the strings, resulting in a LATIN1 encoded
in UTF-8...
The problem lied in the toc.dat file, which stated that the client
encoding was LATIN1, instead of UTF-8.
The solution in the end has been to manually modifying the toc.dat
file, substituting the LATIN1 string with UTF-8 (plus a space, since
the toc.dat is a binary file).
Even though it worked for us, I wonder if there is any other way to
accomplish the same result, at least to specify the encoding for the
restore.
Regards
Marco
--
Marco Bizzarri
http://notenotturne.blogspot.com/
Marco Bizzarri schrieb:
Hi all.
Here is my use case: I've an application which uses PostgreSQL as
backend. Up to now, the database was encoded in SQL_ASCII or LATIN1.
Now, we need to migrate to UTF-8.What we tried, was to:
1) dump the database using pg_dump, in tar format (we had blob);
2) modifying the result, using some conversion tool (like recode)3) destroying the old database
4) recreating the database with UNICODE setting
5) restoring the database using pg_restoreThe result was not what I expected. The pg_restore was using the
LATIN1 encoding to encode the strings, resulting in a LATIN1 encoded
in UTF-8...The problem lied in the toc.dat file, which stated that the client
encoding was LATIN1, instead of UTF-8.The solution in the end has been to manually modifying the toc.dat
file, substituting the LATIN1 string with UTF-8 (plus a space, since
the toc.dat is a binary file).Even though it worked for us, I wonder if there is any other way to
accomplish the same result, at least to specify the encoding for the
restore.
Yes, its actually quite esay: you dump as you feel apropriate,
then create the database with the encoding you want,
restore w/o creating database and you are done.
Restore sets the client encoding to what it actually was
in the dump data (in your case latin-1) and the database
would be utf-8 - postgres automatically recodes. No need
for iconv and friends.
Regards
Tino
On 7/6/06, Tino Wildenhain <tino@wildenhain.de> wrote:
Marco Bizzarri schrieb:
Hi all.
Here is my use case: I've an application which uses PostgreSQL as
backend. Up to now, the database was encoded in SQL_ASCII or LATIN1.
Now, we need to migrate to UTF-8.What we tried, was to:
1) dump the database using pg_dump, in tar format (we had blob);
2) modifying the result, using some conversion tool (like recode)3) destroying the old database
4) recreating the database with UNICODE setting
5) restoring the database using pg_restoreThe result was not what I expected. The pg_restore was using the
LATIN1 encoding to encode the strings, resulting in a LATIN1 encoded
in UTF-8...The problem lied in the toc.dat file, which stated that the client
encoding was LATIN1, instead of UTF-8.The solution in the end has been to manually modifying the toc.dat
file, substituting the LATIN1 string with UTF-8 (plus a space, since
the toc.dat is a binary file).Even though it worked for us, I wonder if there is any other way to
accomplish the same result, at least to specify the encoding for the
restore.Yes, its actually quite esay: you dump as you feel apropriate,
then create the database with the encoding you want,
restore w/o creating database and you are done.
Restore sets the client encoding to what it actually was
in the dump data (in your case latin-1) and the database
would be utf-8 - postgres automatically recodes. No need
for iconv and friends.Regards
Tino
First of all, thank you for your answer. However, I suspect I did not
understand your answer, since the commands I used were:
1) pg_dump -Ft -b -f dump.sql.tar database
2) dropdb database
3) createdb -E UNICODE database
4) pg_restore -d database dump.sql.tar
According to my experience, this produces a "double encoding". As you
can see, I hand-created the database, with the proper encoding.
However, when I reimported the database, the result was a latin1
encoded in utf-8, rather than a pure utf-8.
How my procedure was different with respect to yours?
I will make some test with a sample database, and enabling the
logging, so that I can understand the commands which are issued.
Regards
Marco
--
Marco Bizzarri
http://notenotturne.blogspot.com/
...
Yes, its actually quite esay: you dump as you feel apropriate,
then create the database with the encoding you want,
restore w/o creating database and you are done.
Restore sets the client encoding to what it actually was
in the dump data (in your case latin-1) and the database
would be utf-8 - postgres automatically recodes. No need
for iconv and friends.Regards
TinoFirst of all, thank you for your answer. However, I suspect I did not
understand your answer, since the commands I used were:1) pg_dump -Ft -b -f dump.sql.tar database
2) dropdb database
3) createdb -E UNICODE database
4) pg_restore -d database dump.sql.tarAccording to my experience, this produces a "double encoding". As you
can see, I hand-created the database, with the proper encoding.
However, when I reimported the database, the result was a latin1
encoded in utf-8, rather than a pure utf-8.How my procedure was different with respect to yours?
That was the correct way. I wonder if you have recoding support
enabled? Did you build postgres yourself?
Latin-1 double encoded into utf-8 seems not like possible...
utf-8 barfs on most latin-1 characters, current 8.1 is very
picky about it. So maybe you can work with a small
test table to find out what's going wrong here.
(The changing of the client_enccoding setting in the backup is only
needed in the case when you had data in the wrong encoding
- like SQLAscii filled with latin-1 or something)
Regards
Tino
On 7/6/06, Tino Wildenhain <tino@wildenhain.de> wrote:
...
Yes, its actually quite esay: you dump as you feel apropriate,
then create the database with the encoding you want,
restore w/o creating database and you are done.
Restore sets the client encoding to what it actually was
in the dump data (in your case latin-1) and the database
would be utf-8 - postgres automatically recodes. No need
for iconv and friends.Regards
TinoFirst of all, thank you for your answer. However, I suspect I did not
understand your answer, since the commands I used were:1) pg_dump -Ft -b -f dump.sql.tar database
2) dropdb database
3) createdb -E UNICODE database
4) pg_restore -d database dump.sql.tarAccording to my experience, this produces a "double encoding". As you
can see, I hand-created the database, with the proper encoding.
However, when I reimported the database, the result was a latin1
encoded in utf-8, rather than a pure utf-8.How my procedure was different with respect to yours?
That was the correct way. I wonder if you have recoding support
enabled? Did you build postgres yourself?
Support for recoding? I don't know... I compiled myself postgres,
which is, BTW, 7.4.8. How can I check if auto recoding is enabled?
Latin-1 double encoded into utf-8 seems not like possible...
utf-8 barfs on most latin-1 characters, current 8.1 is very
picky about it. So maybe you can work with a small
test table to find out what's going wrong here.
Yes, I will do... I understand postgresql in later release became much
more "picky" about encoding.
(The changing of the client_enccoding setting in the backup is only
needed in the case when you had data in the wrong encoding
- like SQLAscii filled with latin-1 or something)
Ok, thanks!
Regards
Marco
--
Marco Bizzarri
http://notenotturne.blogspot.com/