UTF8 frustrations

Started by Nonameover 18 years ago6 messagesgeneral
Jump to latest
#1Noname
jesse.waters@gmail.com

Trying to do pg_restore from one UTF8 encoded db to another UTF8 encoded db
DB_source:
Red Hat Enterprise Linux AS release 4 (Nahant Update 4)
psql 8.2.4

DB Destination:
Debian GNU/Linux 4.0
psql 8.1.9

I've tried:
pg_dump from the source box & from destination box

from destination server
pg_dump -i -h source_server db > db.8.1.9.sql

No matter which way I issue pg_dump command it always fails on the same record.

ERROR:
pg_restore: ERROR: invalid byte sequence for encoding "UTF8": 0xdf69
HINT: This error can also happen if the byte sequence does not match
the encoding expected by the server, which is controlled by
"client_encoding".
CONTEXT: COPY logs, line 69238382
pg_restore: [archiver (db)] error returned by PQendcopy: ERROR:
invalid byte sequence for encoding "UTF8": 0xdf69
HINT: This error can also happen if the byte sequence does not match
the encoding expected by the server, which is controlled by
"client_encoding".
CONTEXT: COPY logs, line 69238382

UTF8 to UTF8 and everything in between is UTF8, and it still fails,
I'm tempted to delete this record and keep going but I'm determined
to find a solution.

TIA for any help,

Jesse Waters

#2Michael Glaesemann
grzm@seespotcode.net
In reply to: Noname (#1)
Re: UTF8 frustrations

On Sep 4, 2007, at 14:48 , jesse.waters@gmail.com wrote:

Trying to do pg_restore from one UTF8 encoded db to another UTF8
encoded db
DB_source:
Red Hat Enterprise Linux AS release 4 (Nahant Update 4)
psql 8.2.4

DB Destination:
Debian GNU/Linux 4.0
psql 8.1.9

UTF8 to UTF8 and everything in between is UTF8, and it still fails,
I'm tempted to delete this record and keep going but I'm determined
to find a solution.

(1) PostgreSQL does not support dump and restore from a new version
of PostgreSQL to an older version.

(2) There were changes to UTF8 handling between 8.1 and 8.2. I'm a
bit surprised you get an error like this as 8.1 is more lax than 8.2:
it accepts some invalid sequences as valid.

If you're really determined to do this, I recommend editing the dump
file by hand to remove or edit this record so it does not contain the
invalid sequence (as you've suggested).

Michael Glaesemann
grzm seespotcode net

#3Scott Marlowe
scott.marlowe@gmail.com
In reply to: Noname (#1)
Re: UTF8 frustrations

On 9/4/07, jesse.waters@gmail.com <jesse.waters@gmail.com> wrote:

Trying to do pg_restore from one UTF8 encoded db to another UTF8 encoded db
DB_source:
Red Hat Enterprise Linux AS release 4 (Nahant Update 4)
psql 8.2.4

DB Destination:
Debian GNU/Linux 4.0
psql 8.1.9

I've tried:
pg_dump from the source box & from destination box

from destination server
pg_dump -i -h source_server db > db.8.1.9.sql

Are you issuing the dump and psql / restore command on the same machine?
As previously mentioned, dumping from newer to older is not supported.
pg_dump from 8.1 might not understand the data structures it finds in
an 8.2 db, and pg_dump from 8.2 might create a dump that 8.1 doesn't
support. Note that going the other way you are encouraged to use
pg_dump from 8.2 to dump the 8.1 database for importing to 8.2.

However, In this case I'm guessing that the problem is that you've got
different client encodings on each end. i.e. you're dumping with one
encoding setting and restoring with another. Note that pgsql
autoconverts from the server's encoding to the client's encoding at
the request of the client. so, you'll need to check your client
encoding from psql on the source and target machines to see if they
match.

show client_encoding ;

will tell you what your client encoding is.

#4Noname
jesse.waters@gmail.com
In reply to: Michael Glaesemann (#2)
Re: UTF8 frustrations

Thanks all for the quick replies.

Here is the latest issue, to verify that the pg_dump works, I'm going
to do dump and restore on the same host/cluster.

Source:
DB_source:
Red Hat Enterprise Linux AS release 4 (Nahant Update 4)
psql 8.2.4
Destination:
same machine different db name

echo $LANG
en_US.UTF-8

SET client_encoding = 'UTF8';

Command used:
pg_dump -Fc srcdb > db.dump
pg_restore -d devdb db.dump

Results, same error. Now I'm really concerned.

pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 1625; 0 16680 TABLE
DATA logs watersj
pg_restore: [archiver (db)] COPY failed: ERROR: invalid byte sequence
for encoding "UTF8": 0xdf69
HINT: This error can also happen if the byte sequence does not match
the encoding expected by the server, which is controlled by
"client_encoding
CONTEXT: COPY logs, line 69238382
WARNING: errors ignored on restore: 1

Jesse Waters

#5Scott Marlowe
scott.marlowe@gmail.com
In reply to: Noname (#4)
Re: UTF8 frustrations

On 9/5/07, jesse.waters@gmail.com <jesse.waters@gmail.com> wrote:

Thanks all for the quick replies.

SET client_encoding = 'UTF8';

Results, same error. Now I'm really concerned.

How / where are you setting the client_encoding? I'm wonder if it's
really set when the pg_dump / pg_restore commands are being issued, or
if you're inheriting a different encoding along with the new
connection it gets.

You can try:

alter user set client_encoding='UTF8'

for the postgres user to see if that helps.

#6Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Noname (#1)
Re: UTF8 frustrations

jesse.waters@gmail.com wrote:

Here is the latest issue, to verify that the pg_dump works, I'm going
to do dump and restore on the same host/cluster.

Source:
DB_source:
Red Hat Enterprise Linux AS release 4 (Nahant Update 4)
psql 8.2.4
Destination:
same machine different db name

echo $LANG
en_US.UTF-8

SET client_encoding = 'UTF8';

Command used:
pg_dump -Fc srcdb > db.dump
pg_restore -d devdb db.dump

Results, same error. Now I'm really concerned.

pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 1625; 0 16680 TABLE

DATA logs watersj

pg_restore: [archiver (db)] COPY failed: ERROR: invalid byte sequence

for encoding "UTF8": 0xdf69

HINT: This error can also happen if the byte sequence does not match

the encoding expected by the server, which is controlled by
"client_encoding

CONTEXT: COPY logs, line 69238382
WARNING: errors ignored on restore: 1

I can recreate this behaviour with 8.2.4 (UTF-8).

psql> CREATE TABLE test (id serial PRIMARY KEY, val text);

psql> INSERT INTO test (val) VALUES (E'\xdf\x69');

psql> \q

$ pg_dump -F c -f x.dmp -t test testdb

$ pg_restore -c -d testdb x.dmp

pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 1840; 0 45883 TABLE
DATA test laurenz
pg_restore: [archiver (db)] COPY failed: ERROR: invalid byte sequence
for encoding "UTF8": 0xdf69
HINT: This error can also happen if the byte sequence does not match
the encoding expected by the server, which is controlled by
"client_encoding".
CONTEXT: COPY test, line 1
WARNING: errors ignored on restore: 1

The problem is that in (at least) one record in your table
watersj.logs, there is a corrupt string.

Unfortunately (as demonstrated above) it is possible to enter
corrupt data into a PostgreSQL database, this is what must have
happened in your case.

I suggest that you identify and correct this string in the original
database, then everything should work fine.

You can extract the offending row from the dump, that should
help to identify it. 69238382 rows is a little unwieldy, but
tools like awk can help:

pg_restore db.dump | awk '/^COPY logs /,/^\\\.$/ { if (lineno==69238382)
print $0; ++lineno }'

I think there is the desire to fix problems like this in 8.3,
but I don't think that's done yet.

Yours,
Laurenz Albe