utf8 issue

Started by Tom Hartabout 18 years ago5 messagesgeneral

tomhart@coopfed.org

about 18 years ago

Hello everybody. I recently converted my db from ASCII encoding to UTF8
(we have a lot of spanish-speaking members, and need the extra character
support). Everything was working great, but I noticed this error, while
trying to COPY one of our tables from a csv.

ERROR: invalid byte sequence for encoding "UTF8": 0xb9
SQL state: 22021
Hint: This error can also happen if the byte sequence does not match the
encoding expected by the server, which is controlled by "client_encoding".
Context: COPY transaction_import, line 59358

Is there anything I can do to look for these sorts of errors and fix
them? I'm really not great with character encodings and I'm not sure
where to go on this. Any help?

--
Tom Hart
IT Specialist
Cooperative Federal
723 Westcott St.
Syracuse, NY 13210
(315) 471-1116 ext. 202
(315) 476-0567 (fax)

Richard Huxton

dev@archonet.com

about 18 years ago

In reply to: Tom Hart (#1)

Re: utf8 issue

Tom Hart wrote:

Hello everybody. I recently converted my db from ASCII encoding to UTF8
(we have a lot of spanish-speaking members, and need the extra character
support). Everything was working great, but I noticed this error, while
trying to COPY one of our tables from a csv.

ERROR: invalid byte sequence for encoding "UTF8": 0xb9

Not converted, I fear.

Why not use latin9 rather than utf8 - that should cover everything and
is probably what it's in anyway.

Is there anything I can do to look for these sorts of errors and fix
them? I'm really not great with character encodings and I'm not sure
where to go on this. Any help?

Google a bit for iconv (and postgresql) to get your characters valid.

--
Richard Huxton
Archonet Ltd

Alvaro Herrera

alvherre@2ndquadrant.com

about 18 years ago

In reply to: Tom Hart (#1)

Re: utf8 issue

Tom Hart wrote:

Hello everybody. I recently converted my db from ASCII encoding to UTF8
(we have a lot of spanish-speaking members, and need the extra character
support). Everything was working great, but I noticed this error, while
trying to COPY one of our tables from a csv.

ERROR: invalid byte sequence for encoding "UTF8": 0xb9
SQL state: 22021
Hint: This error can also happen if the byte sequence does not match the
encoding expected by the server, which is controlled by
"client_encoding".
Context: COPY transaction_import, line 59358

Did you try setting client_encoding beforehand? I think a simple

PGOPTIONS="client_encoding=latin9" pg_restore ...

should suffice.

Now, if you have mixed UTF8 and Latin1/Latin9 data, you are hosed
(meaning you'll have to wade through the stuff and figure out what data
is in what encoding, and fix it).

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Tom Hart

tomhart@coopfed.org

about 18 years ago

In reply to: Richard Huxton (#2)

Re: utf8 issue

Richard Huxton wrote:

Tom Hart wrote:

Hello everybody. I recently converted my db from ASCII encoding to
UTF8 (we have a lot of spanish-speaking members, and need the extra
character support). Everything was working great, but I noticed this
error, while trying to COPY one of our tables from a csv.

ERROR: invalid byte sequence for encoding "UTF8": 0xb9

Not converted, I fear.

Why not use latin9 rather than utf8 - that should cover everything and
is probably what it's in anyway.

I was specifically requested to use UTF-8, and the work's already been
done, so converting again is not my favorite option.

Is there anything I can do to look for these sorts of errors and fix
them? I'm really not great with character encodings and I'm not sure
where to go on this. Any help?

Google a bit for iconv (and postgresql) to get your characters valid.

I already have a php script that does some data scrubbing before the
copy. I added this line to the script and things seem to be working
better now

$line = iconv("ISO-8859-1", "UTF-8", $line);

Thanks for the help guys :-)

--
Tom Hart
IT Specialist
Cooperative Federal
723 Westcott St.
Syracuse, NY 13210
(315) 471-1116 ext. 202
(315) 476-0567 (fax)

Dean Gibson (DB Administrator)

postgresql@ultimeth.com

about 18 years ago

In reply to: Tom Hart (#4)

Re: utf8 issue

On 2008-02-26 13:04, Tom Hart wrote:

I already have a php script that does some data scrubbing before the
copy. I added this line to the script and things seem to be working
better now

$line = iconv("ISO-8859-1", "UTF-8", $line);

Thanks for the help guys :-)

Read up on the difference between PostgreSQL's server_encoding and
client_encoding.

The "server_encoding" is how the data is stored in the server, and can
be anything compatible (UTF-8, ISO-8859-1, whatever will hold your
character set).

The "client_encoding" is how the incoming (or outgoing) data is
treated/assumed. PostgreSQL does the necessary conversion for you.

You can set/change the "client_encoding" in so many ways, it gives you
total flexibility, in order of increasing priority:

1. You can set it as the default for any database (see ALTER DATABASE ...).
2. You can set it in an environment variable, which means the client
utilities (and I believe the libraries) use that.
3. In PSQL, you can set it with the "\encoding" statement (which applies
to the session or until changed), or the "SET [SESSION | LOCAL ]
client_encoding TO ...", which will set it for the session or just the
current transaction.

I just went through this, and while I initially used "iconv" to get up
and running, I've removed most of those in my scripts and just use the
PostgreSQL conversion instead.

--
Mail to my list address MUST be sent via the mailing list.
All other mail to my list address will bounce.