Error while loading sql file

Started by Adarsh Sharmaover 14 years ago6 messagesgeneral

adarsh.sharma@orkash.com

over 14 years ago

Dear all,

I am facing a unique issue when I try to load an sql into a postgresql
database :-

ERROR: invalid byte sequence for encoding "UTF8": 0xe32720
HINT: This error can also happen if the byte sequence does not match
the encoding expected by the server, which is controlled by
"client_encoding".
ERROR: invalid byte sequence for encoding "UTF8": 0xe32720
HINT: This error can also happen if the byte sequence does not match
the encoding expected by the server, which is controlled by
"client_encoding".
ERROR: current transaction is aborted, commands ignored until end of
transaction block
ERROR: current transaction is aborted, commands ignored until end of
transaction block

I faced an issue some days ago & I solved the issue by the below command :

( http://blog.e-shell.org/134 )

ERROR: invalid byte sequence for encoding "UTF8": 0xf36e2020
HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".

*cat backup.sql | recode iso-8859-1..u8 > backup.sql*

But this time the byte sequence changes to Japanese , & I fail to solve
the issue. Please let me know how to solve the issue as typing the error
in Google shows only one link:

Thanks

Alban Hertroys

haramrae@gmail.com

over 14 years ago

In reply to: Adarsh Sharma (#1)

Re: Error while loading sql file

On 26 Dec 2011, at 8:22, Adarsh Sharma wrote:

Dear all,

I am facing a unique issue when I try to load an sql into a postgresql database :-

Actually, your issue isn't unique at all. You'll find it reoccurs on this list regularly, although perhaps less frequent these days.

I faced an issue some days ago & I solved the issue by the below command :
cat backup.sql | recode iso-8859-1..u8 > backup.sql

That command assumes that every string in the sql file is encoded as iso-8859-1 (unless it already is unicode).

But this time the byte sequence changes to Japanese , & I fail to solve the issue. Please let me know how to solve the issue as typing the error in Google shows only one link:
( http://blog.e-shell.org/134 )

The above recode command works for the guys in the blog post you linked, as they were converting a database with Spanish data to UTF-8. They knew what encoding they were coming from.

In your case, you have a mixed bag of encodings, going all the way from latin encodings to japanese.

I'm not sure what recode would do to data that's in a different encoding than the specified source encoding - I expect that it will just assume it's in the specified source encoding (it cannot know that this isn't the case for a particular string) and attempt to convert it to UTF-8 _using that encoding_.

Chances are you just converted valid data in a different encoding (than the source encoding you specified) into garbage in UTF-8... I seem to recall that if recode runs into problems recoding a string to UTF-8 it will leave it untouched, but that will NOT happen in all cases. Sometimes it will succeed, even though the result has no meaning to a human.

That's a nasty problem you ran into, I hope the archives provide the wisdom you need.

Alban Hertroys

--
Screwing up is an excellent way to attach something to the ceiling.

Adarsh Sharma

adarsh.sharma@orkash.com

over 14 years ago

In reply to: Alban Hertroys (#2)

Re: Error while loading sql file

Thanks for the Explaination,
I find it hard to determine the way to store data in different encodings
to store in postgresql, below is the demo of some data :-

INSERT INTO conceptnet_frame
VALUES(3884,'ja','{1}?{2}???????',16,3,2140,NULL,NULL,NULL);
INSERT INTO conceptnet_frame
VALUES(3885,'ja','{1}?{2}??????????',31,3,2140,NULL,NULL,NULL);
INSERT INTO conceptnet_frame
VALUES(3886,'ja','{1}?{2}??????????????',31,3,2140,NULL,NULL,NULL);
INSERT INTO conceptnet_frame
VALUES(3887,'ja','{1}?{2}??????????',31,3,2140,NULL,NULL,NULL);
INSERT INTO conceptnet_frame
VALUES(3888,'ja','{1}????????{2}???????',29,3,2140,NULL,NULL,NULL);
INSERT INTO conceptnet_frame
VALUES(3889,'ja','{1}??????{2}??????',28,3,2140,NULL,NULL,NULL);
INSERT INTO conceptnet_frame
VALUES(3890,'ja','{1}??{2}???????????',14,3,2140,NULL,NULL,NULL);
INSERT INTO conceptnet_frame
VALUES(3891,'ja','{2}????????{1}??????',9,3,2140,NULL,NULL,NULL);

Below link explains all the things :-
http://www.depesz.com/index.php/2010/03/07/error-invalid-byte-sequence-for-encoding/

Above link shows the above encoding schemes is in utf16 format but
postgresql-8.4 doesn't support it.
Is there any way to store data in different encoding in a utf-8 database.

Happy Holidays!

Alban Hertroys wrote:

Show quoted text

On 26 Dec 2011, at 8:22, Adarsh Sharma wrote:

Dear all,

I am facing a unique issue when I try to load an sql into a postgresql database :-

Actually, your issue isn't unique at all. You'll find it reoccurs on this list regularly, although perhaps less frequent these days.

I faced an issue some days ago & I solved the issue by the below command :
cat backup.sql | recode iso-8859-1..u8 > backup.sql

That command assumes that every string in the sql file is encoded as iso-8859-1 (unless it already is unicode).

But this time the byte sequence changes to Japanese , & I fail to solve the issue. Please let me know how to solve the issue as typing the error in Google shows only one link:
( http://blog.e-shell.org/134 )

The above recode command works for the guys in the blog post you linked, as they were converting a database with Spanish data to UTF-8. They knew what encoding they were coming from.

In your case, you have a mixed bag of encodings, going all the way from latin encodings to japanese.

I'm not sure what recode would do to data that's in a different encoding than the specified source encoding - I expect that it will just assume it's in the specified source encoding (it cannot know that this isn't the case for a particular string) and attempt to convert it to UTF-8 _using that encoding_.

Chances are you just converted valid data in a different encoding (than the source encoding you specified) into garbage in UTF-8... I seem to recall that if recode runs into problems recoding a string to UTF-8 it will leave it untouched, but that will NOT happen in all cases. Sometimes it will succeed, even though the result has no meaning to a human.

That's a nasty problem you ran into, I hope the archives provide the wisdom you need.

Alban Hertroys

--
Screwing up is an excellent way to attach something to the ceiling.

John R Pierce

pierce@hogranch.com

over 14 years ago

In reply to: Adarsh Sharma (#3)

Re: Error while loading sql file

On 12/26/11 9:32 PM, Adarsh Sharma wrote:

Above link shows the above encoding schemes is in utf16 format but
postgresql-8.4 doesn't support it.
Is there any way to store data in different encoding in a utf-8 database.

afaik, anything in UTF16 can be represented by UTF8, but the inverse is
not true. you'd want to convert the data to the CLIENT_ENCODING,
whatever thats set to, before you send it to postgres.

--
john r pierce N 37, W 122
santa cruz ca mid-left coast

Bèrto ëd Sèra

berto.d.sera@gmail.com

over 14 years ago

In reply to: Adarsh Sharma (#3)

Re: Error while loading sql file

HI Ardash!

INSERT INTO conceptnet_frame
VALUES(3884,'ja','{1}は{2}を持っている。',16,3,2140,NULL,NULL,NULL);
....

Can you still access the database that produced the dump? If so, you may
want to produce a number of dumps for distinct language values. Japanese,
in particular, is a very complicated issue, as Japanese change alphabet to
produce the same effect Latin writers get by using bold and italic. So text
searches are quite a nightmare, since DOG is written in different alphabets
(Latin included), depending on what the author had in mind, in terms of
underlining this specific word/expression.

If you could manage to insulate single languages things would definitely
get less complex, as you can track down specific problems related to a
single language (or maybe to an external source that inserted weird
encodings). If the original DB is not available you could still filter the
dump file for such values (although this is a very boring experience).

One of the good things with this approach (in case you need to do heavy
text searches on your data) is that (as of PG 9.1) you can model structures
like:

CREATE COLLATION "de_DE.utf8" (
LC_COLLATE = "de_DE.utf8",
LC_CTYPE = "de_DE.utf8"
);
CREATE COLLATION "en_GB.utf8" (
LC_COLLATE = "en_GB.utf8",
LC_CTYPE = "en_GB.utf8"
);
# NOTE!! all these locales must pre-exist on your box, you are simply
importing them into PG with this CREATE phase.

CREATE TABLE conceptnet_frame_root (
id BIGINT NOT NULL PRIMARY KEY,
ISO693_1_code CHAR(2) NOT NULL
);

CREATE TABLE conceptnet_frame_en (
linguistic_content TEXT COLLATE "en_GB.utf8" NOT NULL
) INHERITS (conceptnet_frame_root);

CREATE TABLE conceptnet_frame_de (
linguistic_content TEXT COLLATE "de_DE.utf8" NOT NULL
) INHERITS (conceptnet_frame_root);

Then have a view pick up the entire linguistic bouquet, if needed. Sadly
you cannot leave 'linguistic_content' in the root table, because you won't
be able to use an alter table on it later, to alter the collation at single
inherited table level, so you are responsible of ensuring type consistency
on your own. It does add complication to the model, however, if you are
into serious conflicts among different languages this is the best solution
I could come up with, thus far.

You may want to have a look at:
http://www.postgresql.org/docs/9.1/static/collation.html

BTW, ISO 639-1 is a dangerous standard to use, if you are to make extensive
language coverage, see
http://en.wikipedia.org/wiki/ISO_639

Hope this helps
Bèrto

--
==============================
If Pac-Man had affected us as kids, we'd all be running around in a
darkened room munching pills and listening to repetitive music.

Adrian Klaver

adrian.klaver@aklaver.com

over 14 years ago

In reply to: Adarsh Sharma (#3)

Re: Error while loading sql file

On Monday, December 26, 2011 9:32:41 pm Adarsh Sharma wrote:

Thanks for the Explaination,
I find it hard to determine the way to store data in different encodings
to store in postgresql, below is the demo of some data :-

INSERT INTO conceptnet_frame
VALUES(3884,'ja','{1}?{2}???????',16,3,2140,NULL,NULL,NULL);
INSERT INTO conceptnet_frame
VALUES(3885,'ja','{1}?{2}??????????',31,3,2140,NULL,NULL,NULL);

Below link explains all the things :-
http://www.depesz.com/index.php/2010/03/07/error-invalid-byte-sequence-for-
encoding/

Above link shows the above encoding schemes is in utf16 format but
postgresql-8.4 doesn't support it.
Is there any way to store data in different encoding in a utf-8 database.

Use iconv or recode to convert the UTF16 to UTF8 first.

Happy Holidays!

--
Adrian Klaver
adrian.klaver@gmail.com