invalid byte sequence for encoding error

Started by Oliver A. Rojoalmost 20 years ago8 messagesgeneral
Jump to latest
#1Oliver A. Rojo
oliverrojo@anticogroup.com

Hi!

I've just recently upgraded my database from 7.4.8 to 8.0.1. Im dumping
data i got from my old db to my new db but eventually an error occured

ERROR: invalid byte sequence for encoding "UNICODE": 0xd141

I tried setting the client encoding to UNICODE but to no avail.

Please help! Thanks!

--

Oliver A. Rojo

______________________________________________________________

This email and any files transmitted with it are confidential
and intended solely for the use of the individual or entity to
whom they are addressed. If you have received this email in error
please notify the system manager. Please note that any views or
opinions presented in this email are solely those of the author
and do not necessarily represent those of the company. Finally,
the recipient should check this email and any attachments for the
presence of viruses. The company accepts no liability for any
damage caused by any virus transmitted by this email.

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Oliver A. Rojo (#1)
Re: invalid byte sequence for encoding error

"Oliver A. Rojo" <oliverrojo@anticogroup.com> writes:

I've just recently upgraded my database from 7.4.8 to 8.0.1. Im dumping
data i got from my old db to my new db but eventually an error occured

ERROR: invalid byte sequence for encoding "UNICODE": 0xd141

That's definitely not valid UNICODE (UTF-8) data. You need to fix up
your data. My guess is that what you have stored isn't really UTF-8
but something else (maybe one of the LATINn family). If so, adjusting
the "SET client_encoding" line in the dump file and trying again should
help.

PG is certainly moving in the direction of more rather than less
encoding checking, so this is something you gotta deal with.

regards, tom lane

#3Oliver A. Rojo
oliverrojo@anticogroup.com
In reply to: Tom Lane (#2)
Re: invalid byte sequence for encoding error

Tom Lane wrote:

"Oliver A. Rojo" <oliverrojo@anticogroup.com> writes:

I've just recently upgraded my database from 7.4.8 to 8.0.1. Im dumping
data i got from my old db to my new db but eventually an error occured

ERROR: invalid byte sequence for encoding "UNICODE": 0xd141

That's definitely not valid UNICODE (UTF-8) data. You need to fix up
your data. My guess is that what you have stored isn't really UTF-8
but something else (maybe one of the LATINn family). If so, adjusting
the "SET client_encoding" line in the dump file and trying again should
help.

PG is certainly moving in the direction of more rather than less
encoding checking, so this is something you gotta deal with.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

I already set encoding to unicode on my dump data but still not working.
same error encountered.

--

Oliver A. Rojo

______________________________________________________________

This email and any files transmitted with it are confidential
and intended solely for the use of the individual or entity to
whom they are addressed. If you have received this email in error
please notify the system manager. Please note that any views or
opinions presented in this email are solely those of the author
and do not necessarily represent those of the company. Finally,
the recipient should check this email and any attachments for the
presence of viruses. The company accepts no liability for any
damage caused by any virus transmitted by this email.

#4Nis Jørgensen
nis@superlativ.dk
In reply to: Oliver A. Rojo (#1)
Re: invalid byte sequence for encoding error

Oliver A. Rojo wrote:

Hi!

I've just recently upgraded my database from 7.4.8 to 8.0.1. Im dumping
data i got from my old db to my new db but eventually an error occured

ERROR: invalid byte sequence for encoding "UNICODE": 0xd141

I tried setting the client encoding to UNICODE but to no avail.

I had exactly the same problem (7.4.7 -> 8.1.3). Apparently older pg
versions allowed for storing invalid utf-8 in text fields (ie the engine
did not check the bytes going in).

I fixed it by fixing the original db and dumping again. If this is not
desirable you can of course restore to a clean 7.4.8, fix, dump.

/Nis

#5Oliver A. Rojo
oliverrojo@anticogroup.com
In reply to: Nis Jørgensen (#4)
Re: invalid byte sequence for encoding error

Nis Jorgensen wrote:

Oliver A. Rojo wrote:

Hi!

I've just recently upgraded my database from 7.4.8 to 8.0.1. Im
dumping data i got from my old db to my new db but eventually an
error occured

ERROR: invalid byte sequence for encoding "UNICODE": 0xd141

I tried setting the client encoding to UNICODE but to no avail.

I had exactly the same problem (7.4.7 -> 8.1.3). Apparently older pg
versions allowed for storing invalid utf-8 in text fields (ie the
engine did not check the bytes going in).

I fixed it by fixing the original db and dumping again. If this is not
desirable you can of course restore to a clean 7.4.8, fix, dump.

/Nis

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

hi!
how do you fix your original db?

--

Oliver A. Rojo

______________________________________________________________

This email and any files transmitted with it are confidential
and intended solely for the use of the individual or entity to
whom they are addressed. If you have received this email in error
please notify the system manager. Please note that any views or
opinions presented in this email are solely those of the author
and do not necessarily represent those of the company. Finally,
the recipient should check this email and any attachments for the
presence of viruses. The company accepts no liability for any
damage caused by any virus transmitted by this email.

#6Nis Jørgensen
nis@superlativ.dk
In reply to: Oliver A. Rojo (#5)
Re: invalid byte sequence for encoding error

Oliver A. Rojo wrote:

Nis Jorgensen wrote:

Oliver A. Rojo wrote:

Hi!

I've just recently upgraded my database from 7.4.8 to 8.0.1. Im
dumping data i got from my old db to my new db but eventually an
error occured

I fixed it by fixing the original db and dumping again. If this is not
desirable you can of course restore to a clean 7.4.8, fix, dump.

hi!
how do you fix your original db?

Since I had only 3 occurrences of the error, I used hand-crafted update
statements. The fact that the replacement for the invalid characters was
constant and plain ascii made this very easy.

If you have many occurrences of the error, or if you need to do
replacement based on the invalid bytes, things become trickier. You
might even be better of working on the dump file directly using
perl/<favourite scripting language>

/Nis

#7Markus Wollny
Markus.Wollny@computec.de
In reply to: Nis Jørgensen (#6)
Re: invalid byte sequence for encoding error

Nis Jorgensen wrote:

Oliver A. Rojo wrote:

how do you fix your original db?

Since I had only 3 occurrences of the error, I used
hand-crafted update statements. The fact that the replacement
for the invalid characters was constant and plain ascii made
this very easy.

If you have many occurrences of the error, or if you need to
do replacement based on the invalid bytes, things become
trickier. You might even be better of working on the dump
file directly using perl/<favourite scripting language>

I had the exact same problem with my upgrade - and a lot more than just
a couple of occurences. The solution is quite easy however, so if you're
prepared to simply eliminate the offending bytes, you'll find that iconv
will be a very fast solution. However at least on my systems (Debian
Sarge) iconv didn't like my >5GB sized dump files. So in order to
successfully reimport the dumps, I had to "split --line-bytese0m" the
SQL-file, pass the parts through iconv -c -f UTF8 -t UTF8 and
concatenate them back into one file again. There were no more errors on
feeding the dump back into psql and I didn't come across any missing
data during my tests, so this has definitely done the trick for me.

You should be aware that this will simply omit the illegal byte
sequences from the dump. So if you've got some string
"foo[non-UTF-8-bytes]bar", it will be converted to a simple "foobar" in
the result. So if you really need to keep things 100% accurate, you'll
have to actually identify each of these byte-sequences, then find the
corresponding UTF-8 representation and use some search&replace-scripting
on the dump before reloading it.

Kind regards

Markus

#8Oliver A. Rojo
oliverrojo@anticogroup.com
In reply to: Markus Wollny (#7)
Re: invalid byte sequence for encoding error

Markus Wollny wrote:

Nis Jorgensen wrote:

Oliver A. Rojo wrote:

how do you fix your original db?

Since I had only 3 occurrences of the error, I used
hand-crafted update statements. The fact that the replacement
for the invalid characters was constant and plain ascii made
this very easy.

If you have many occurrences of the error, or if you need to
do replacement based on the invalid bytes, things become
trickier. You might even be better of working on the dump
file directly using perl/<favourite scripting language>

I had the exact same problem with my upgrade - and a lot more than just
a couple of occurences. The solution is quite easy however, so if you're
prepared to simply eliminate the offending bytes, you'll find that iconv
will be a very fast solution. However at least on my systems (Debian
Sarge) iconv didn't like my >5GB sized dump files. So in order to
successfully reimport the dumps, I had to "split --line-bytese0m" the
SQL-file, pass the parts through iconv -c -f UTF8 -t UTF8 and
concatenate them back into one file again. There were no more errors on
feeding the dump back into psql and I didn't come across any missing
data during my tests, so this has definitely done the trick for me.

You should be aware that this will simply omit the illegal byte
sequences from the dump. So if you've got some string
"foo[non-UTF-8-bytes]bar", it will be converted to a simple "foobar" in
the result. So if you really need to keep things 100% accurate, you'll
have to actually identify each of these byte-sequences, then find the
corresponding UTF-8 representation and use some search&replace-scripting
on the dump before reloading it.

Kind regards

Markus

ahh! the real problem is this: It usually says "invalid byte sequence
for encoding" error when it encounters � character. What should I
supposed to do here?

--

Oliver A. Rojo

______________________________________________________________

This email and any files transmitted with it are confidential
and intended solely for the use of the individual or entity to
whom they are addressed. If you have received this email in error
please notify the system manager. Please note that any views or
opinions presented in this email are solely those of the author
and do not necessarily represent those of the company. Finally,
the recipient should check this email and any attachments for the
presence of viruses. The company accepts no liability for any
damage caused by any virus transmitted by this email.