invalid UTF-8 byte sequence detected

Started by Markus Wollnyover 20 years ago3 messagesgeneral
Jump to latest
#1Markus Wollny
Markus.Wollny@computec.de

Hello!

I am currently testdriving migration of our PostgreSQL 8.0 databases to 8.1; in this process I have stumbled a couple of times over certain errors in text-fields that lead to error-messages during import of the dump like these:

<2005-11-09 14:57:34 CET - 9354: [local]@community_unicode>ERROR: invalid UTF-8 byte sequence detected near byte 0xb4
<2005-11-09 14:57:34 CET - 9354: [local]@community_unicode>CONTEXT: COPY board_message, line 1125662, column text: "HI

Besteht ein gewisser Nachteil, wenn ich nur eins von den beiden kaufe, da in beiden Addon▒s viel..."
<2005-11-09 14:57:34 CET - 9354: [local]@community_unicode>STATEMENT: COPY board_message (board_id, thread_id, father_id, message_id, user_id, title, signat
ure, follow_up, count_reply, last_reply, created, article_id, logged_ip, state_id, user_login, user_status, user_rank, user_rank_description, user_rank_picture, user_rights, text, deleted_user_id, deleted_date, deleted_login, user_created, poll_id, idxfti) FROM stdin;

<2005-11-09 14:57:49 CET - 9354: [local]@community_unicode>ERROR: invalid UTF-8 byte sequence detected near byte 0x98
<2005-11-09 14:57:49 CET - 9354: [local]@community_unicode>CONTEXT: COPY kidszone_tournament2005_user, line 427, column phone: "02302▒74"
<2005-11-09 14:57:49 CET - 9354: [local]@community_unicode>STATEMENT: COPY kidszone_tournament2005_user (id, first_name, last_name, adress, birthday, phone,
email, place, permission, ude, ude_box, invited) FROM stdin;

There are not too many occurrences of the same type - five altogether in a 1.8GB compressed dumpfile, but still it has me worried and leaves me with some questions:

1.) How could I have prevented insertion of these invalid byte-sequences in the first place? We're using UTF-8 encoded databases, data is mostly inserted by users via browser applications, our websites are UTF-8 encoded, too, but still we cannot really make 100% sure that all clients behave as expected; on the other hand, it would be extremely inconvenient if we had to check each and every text input for character set conformance in the application, so is there a way to ascertain "sane" data via some database-setting? pg_restore does throw this error and indeed terminates after that (I used custom dump format for pg_dump), psql on the other hand just continues with the import (using a pgdumpall-output that generates a standard SQL-script), although it too throws the error.

2.) How does this really affect the value of the database-dumps? psql continues with import after the error, but the table where this error occurred remains empty, as the affected COPY-statement has failed altogether due to this error. So a plain no-worries import in my case would present me a result with five tables empty - one of them quite large... Is there some kind of magic, maybe involving some perl or whatever, that could help to clean up the dump before the import, so I can accomplish a full restore?

Kind regards,

Markus

#2Bruce Momjian
bruce@momjian.us
In reply to: Markus Wollny (#1)
Re: invalid UTF-8 byte sequence detected

I am also confused how invalid UTF8 sequences got into your database.
It shouldn't have been possible.

---------------------------------------------------------------------------

Markus Wollny wrote:

Hello!

I am currently testdriving migration of our PostgreSQL 8.0 databases to 8.1; in this process I have stumbled a couple of times over certain errors in text-fields that lead to error-messages during import of the dump like these:

<2005-11-09 14:57:34 CET - 9354: [local]@community_unicode>ERROR: invalid UTF-8 byte sequence detected near byte 0xb4
<2005-11-09 14:57:34 CET - 9354: [local]@community_unicode>CONTEXT: COPY board_message, line 1125662, column text: "HI

Besteht ein gewisser Nachteil, wenn ich nur eins von den beiden kaufe, da in beiden Addon?s viel..."
<2005-11-09 14:57:34 CET - 9354: [local]@community_unicode>STATEMENT: COPY board_message (board_id, thread_id, father_id, message_id, user_id, title, signat
ure, follow_up, count_reply, last_reply, created, article_id, logged_ip, state_id, user_login, user_status, user_rank, user_rank_description, user_rank_picture, user_rights, text, deleted_user_id, deleted_date, deleted_login, user_created, poll_id, idxfti) FROM stdin;

<2005-11-09 14:57:49 CET - 9354: [local]@community_unicode>ERROR: invalid UTF-8 byte sequence detected near byte 0x98
<2005-11-09 14:57:49 CET - 9354: [local]@community_unicode>CONTEXT: COPY kidszone_tournament2005_user, line 427, column phone: "02302?74"
<2005-11-09 14:57:49 CET - 9354: [local]@community_unicode>STATEMENT: COPY kidszone_tournament2005_user (id, first_name, last_name, adress, birthday, phone,
email, place, permission, ude, ude_box, invited) FROM stdin;

There are not too many occurrences of the same type - five altogether in a 1.8GB compressed dumpfile, but still it has me worried and leaves me with some questions:

1.) How could I have prevented insertion of these invalid byte-sequences in the first place? We're using UTF-8 encoded databases, data is mostly inserted by users via browser applications, our websites are UTF-8 encoded, too, but still we cannot really make 100% sure that all clients behave as expected; on the other hand, it would be extremely inconvenient if we had to check each and every text input for character set conformance in the application, so is there a way to ascertain "sane" data via some database-setting? pg_restore does throw this error and indeed terminates after that (I used custom dump format for pg_dump), psql on the other hand just continues with the import (using a pgdumpall-output that generates a standard SQL-script), although it too throws the error.

2.) How does this really affect the value of the database-dumps? psql continues with import after the error, but the table where this error occurred remains empty, as the affected COPY-statement has failed altogether due to this error. So a plain no-worries import in my case would present me a result with five tables empty - one of them quite large... Is there some kind of magic, maybe involving some perl or whatever, that could help to clean up the dump before the import, so I can accomplish a full restore?

Kind regards,

Markus

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#3Markus Wollny
Markus.Wollny@computec.de
In reply to: Bruce Momjian (#2)
Re: invalid UTF-8 byte sequence detected

Hi!

-----Ursprüngliche Nachricht-----
Von: Bruce Momjian [mailto:pgman@candle.pha.pa.us]
Gesendet: Dienstag, 15. November 2005 19:46
An: Markus Wollny
Cc: pgsql-general@postgresql.org
Betreff: Re: [GENERAL] invalid UTF-8 byte sequence detected

I am also confused how invalid UTF8 sequences got into your database.
It shouldn't have been possible.

Our databases were originally encoded in ASCII, though they should have been LATIN9 - or UNICODE; this has been remedied a long time ago using iconv on the dumps; our PostgreSQL-version then was 7.4 and we converted databases and dumps to UTF-8. Maybe the invalid byte sequences have been entered sometimes during our migration from ODBC to JDBC while our encoding was still a mess - though I would have thought that this should have been resolved by dump&iconv&restore then. However, I do suspect that the cause of the issue was really more or less a bug in PostgreSQL <8.1, which accepted certain illegal byte sequences. I our case, I found that the re-import of the dump errored out on ISO-8859-1 encoded backticks (´) - certain mournfully misled individuals do use this "degu"-character instead of the apostrophe even tough it's more difficult to type on a german keyboard layout. And quite wrong, too.

Anyway, I found some reference in the hackers-list that encoding-consistency for Unicode has been tightened down (see http://archives.postgresql.org/pgsql-hackers/2005-10/msg00972.php ). Both a solution and a suggestion have been posted in this thread; Christopher Kings-Lynne has suggested to include a reference to this issue in the 'upgrading/back compatibiliy' section for these release notes - I do strongly second his suggestion :)

The suggested solution was to feed the plain dump again through iconv; however at least on my systems (Debian Sarge) iconv didn't like my >5GB sized dump files. So in order to successfully reimport the dumps, I had to "split --line-bytes=650m" the SQL-file, pass the parts through iconv -c -f UTF8 -t UTF8 and concatenate them back into one file again. There were no more errors on feeding the dump back into psql and I didn't come across any missing data during my tests, so this has definitely done the trick for me.

As 8.1 has tightened down encoding-consistency for Unicode, I believe that the databases should be safe from any illegal byte-sequences in text-fields from now on.

Kind regards

Markus