Migration error " invalid byte sequence for encoding "UTF8": 0xff " from mysql 5.5 to postgresql 9.1
When I do migration from Mysql to PostgreSQL:
firstly dump data from mysql in cmd(encoding is GBK) is WIN8:
mysqldump -v -nt --complete-insert=TRUE --compatible=postgresql
--default-character-set=utf8 --skip-add-locks --compact --no-create-info
--skip-quote-names -uroot -p test >dbdata.sql
then load data to postgresql in cmd(encoding is GBK) is WIN8:
psql -h localhost -d test -U postgres < dbdata.sql
I got the error:
ERROR: invalid byte sequence for encoding "UTF8": 0xff
I checked in dbdata.sql using UltraEdit, data "0xff" really exists as
followings:
蛾4?4抿\0xfF???±??x¤?
and I use UltraEdit menu-> view -> encoding page to get: 936 GBK.
DDL in Mysql 5.5 is:
CREATE TABLE `personpicture` (
`ID` char(32) NOT NULL,
`Picture` mediumblob,
...
KEY `personId` (`PersonID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
DDL in PostgreSQL 9.1 is:
create database test encoding 'utf8';
CREATE TABLE personpicture (
ID char(32) NOT NULL,
Picture BYTEA,
....
PRIMARY KEY (ID)
);
The error is related to Picture mediumblob and BYTEA, which is stored pic
binary data.
My Operationg system is WIN8 64bit.
I've set postgresql.conf before starting everything, but error still occurs:
backslash_quote = on
escape_string_warning = off
standard_conforming_strings = off
How to resolve it?
Thanks!
peng
sunpeng <bluevaley@gmail.com> wrote:
load data to postgresql in cmd(encoding is GBK) is WIN8:
psql -h localhost -d test -U postgres < dbdata.sql
I got the error:
ERROR: invalid byte sequence for encoding "UTF8": 0xff
If the encoding is GBK then you will get errors (or incorrect
characters) if it is read as UTF8. Try setting the environment
variable PGCLIENTENCODING.
http://www.postgresql.org/docs/9.1/static/app-psql.html
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
I‘v changed cmd (in win8) to encoding utf8 through chcp 65001, but error
still occurs.
And i use the following cmd to dump mysql data:
mysql> select Picture from personpicture where id =
'F2931306D1EE44ca82394CD3BC2404D4' into outfile "d:\\1.txt" ;
I got the ansi file, and use Ultraedit to see first 16 bytes:
FF D8 FF E0 5C 30 10 4A 46 49 46 5C 30 01 01 5C
It's different from mysql workbench to see:
FF D8 FF E0 00 10 4a 46 49 46 00 01 01 00 00 01
peng
On Tue, Jul 1, 2014 at 9:18 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
Show quoted text
sunpeng <bluevaley@gmail.com> wrote:
load data to postgresql in cmd(encoding is GBK) is WIN8:
psql -h localhost -d test -U postgres < dbdata.sql
I got the error:
ERROR: invalid byte sequence for encoding "UTF8": 0xffIf the encoding is GBK then you will get errors (or incorrect
characters) if it is read as UTF8. Try setting the environment
variable PGCLIENTENCODING.http://www.postgresql.org/docs/9.1/static/app-psql.html
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
sunpeng wrote:
load data to postgresql in cmd(encoding is GBK) is WIN8:
psql -h localhost -d test -U postgres < dbdata.sql
I got the error:
ERROR: invalid byte sequence for encoding "UTF8": 0xff
If the encoding is GBK then you will get errors (or incorrect
characters) if it is read as UTF8. Try setting the environment
variable PGCLIENTENCODING.
I‘v changed cmd (in win8) to encoding utf8 through chcp 65001, but error still occurs.
And i use the following cmd to dump mysql data:
mysql> select Picture from personpicture where id = 'F2931306D1EE44ca82394CD3BC2404D4' into outfile
"d:\\1.txt" ;
I got the ansi file, and use Ultraedit to see first 16 bytes:
FF D8 FF E0 5C 30 10 4A 46 49 46 5C 30 01 01 5CIt's different from mysql workbench to see:
FF D8 FF E0 00 10 4a 46 49 46 00 01 01 00 00 01
Changing the terminal code page won't do anything, it's probably the data
that are in a different encoding.
I don't know enough about MySQL to know which encoding it uses when dumping data,
but the man page of "mysqldump" tells me:
--set-charset
Add SET NAMES default_character_set to the output. This option is enabled by default.
So is there a SET NAMES command in the dump? If yes, what is the argument?
You will have to tell PostgreSQL the encoding of the data.
As Kevin pointed out, you can do that by setting the environment variable
PGCLIENT ENCODING to the correct value. Then PostgreSQL will convert the
data automatically.
Yours,
Laurenz Albe
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Thank you, friend, I use --hex-blob :
mysqldump -v -nt --complete-insert=TRUE --compatible=postgresql
--default-character-set=utf8 --skip-add-locks --compact --no-create-info
--skip-quote-names --hex-blob -uroot -p test videorecresult >dbdata.sql
to dump mysql data.
And replace blob data "0x...." into "E'\\xx....'" to load data into
postgresql.
On Fri, Jul 4, 2014 at 3:27 PM, Albe Laurenz <laurenz.albe@wien.gv.at>
wrote:
Show quoted text
sunpeng wrote:
load data to postgresql in cmd(encoding is GBK) is WIN8:
psql -h localhost -d test -U postgres < dbdata.sql
I got the error:
ERROR: invalid byte sequence for encoding "UTF8": 0xffIf the encoding is GBK then you will get errors (or incorrect
characters) if it is read as UTF8. Try setting the environment
variable PGCLIENTENCODING.I‘v changed cmd (in win8) to encoding utf8 through chcp 65001, but error
still occurs.
And i use the following cmd to dump mysql data:
mysql> select Picture from personpicture where id ='F2931306D1EE44ca82394CD3BC2404D4' into outfile
"d:\\1.txt" ;
I got the ansi file, and use Ultraedit to see first 16 bytes:
FF D8 FF E0 5C 30 10 4A 46 49 46 5C 30 01 01 5CIt's different from mysql workbench to see:
FF D8 FF E0 00 10 4a 46 49 46 00 01 01 00 00 01Changing the terminal code page won't do anything, it's probably the data
that are in a different encoding.I don't know enough about MySQL to know which encoding it uses when
dumping data,
but the man page of "mysqldump" tells me:--set-charset
Add SET NAMES default_character_set to the output. This option is
enabled by default.So is there a SET NAMES command in the dump? If yes, what is the argument?
You will have to tell PostgreSQL the encoding of the data.
As Kevin pointed out, you can do that by setting the environment variable
PGCLIENT ENCODING to the correct value. Then PostgreSQL will convert the
data automatically.Yours,
Laurenz Albe
On 7/4/2014 2:12 AM, sunpeng wrote:
Thank you, friend, I use --hex-blob :
mysqldump -v -nt --complete-insert=TRUE --compatible=postgresql
--default-character-set=utf8 --skip-add-locks --compact
--no-create-info --skip-quote-names --hex-blob -uroot -p test
videorecresult >dbdata.sql
to dump mysql data.
And replace blob data "0x...." into "E'\\xx....'" to load data into
postgresql.
regardless of all that, 0xFF is not a valid UTF8 character code. perhaps
you should store the data in a postgres BYTEA, or at least use character
encoding SQLASCII (which is to say, no encoding, bytes in == bytes out.)
--
john r pierce 37N 122W
somewhere on the middle of the left coast
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
John R Pierce wrote:
On 7/4/2014 2:12 AM, sunpeng wrote:
Thank you, friend, I use --hex-blob :
mysqldump -v -nt --complete-insert=TRUE --compatible=postgresql
--default-character-set=utf8 --skip-add-locks --compact
--no-create-info --skip-quote-names --hex-blob -uroot -p test
videorecresult >dbdata.sql
to dump mysql data.
And replace blob data "0x...." into "E'\\xx....'" to load data into
postgresql.
regardless of all that, 0xFF is not a valid UTF8 character code. perhaps
you should store the data in a postgres BYTEA, or at least use character
encoding SQLASCII (which is to say, no encoding, bytes in == bytes out.)
Exactly.
According to mysqldump's man page, the affected field must be
BINARY, VARBINARY, the BLOB types or BIT.
For these PostgreSQL's "bytea" would definitely be the correct data type,
and there wouldn't be any encoding problems with that.
Yours,
Laurenz Albe
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general