v7.4 pg_dump(all) need to encode from SQL_ASCII to UTF8

Started by Ralph Smithabout 18 years ago4 messagesgeneral
Jump to latest
#1Ralph Smith
smithrn@washington.edu

I'm looking at the v7.4 manuals and I don't see how to encode for
importing into a v8 DB using UTF8.

Maybe I'm making this hard on myself?
The old DB is using SQL_ASCII.
We'd like the new one to use UTF8.

As development proceeds, I'm going to have to do this regularly, both
the entire DB and by tables.
If not for the encoding, I've got all that down, even automated.

Thanks for any help!
Ralph

-------------------------------
p.s. Isn't there a 16 bit Unicode for postgreSQL?

smithrn at u dot washington dot edu

#2Dean Gibson (DB Administrator)
postgresql@ultimeth.com
In reply to: Ralph Smith (#1)
Re: v7.4 pg_dump(all) need to encode from SQL_ASCII to UTF8

On 2008-02-22 17:57, Ralph Smith wrote:

I'm looking at the v7.4 manuals and I don't see how to encode for
importing into a v8 DB using UTF8.

Maybe I'm making this hard on myself?
The old DB is using SQL_ASCII.
We'd like the new one to use UTF8.

As development proceeds, I'm going to have to do this regularly, both
the entire DB and by tables.
If not for the encoding, I've got all that down, even automated.

Thanks for any help!
Ralph

-------------------------------
p.s. Isn't there a 16 bit Unicode for postgreSQL?

smithrn at u dot washington dot edu

There's nothing to do. Dump the database in ASCII; create the new
database with a server_encoding of UTF-8; and import the data (which is
marked as having a client_encoding of SQL_ASCII).

ps: No.

--
Mail to my list address MUST be sent via the mailing list.
All other mail to my list address will bounce.

#3Ralph Smith
smithrn@washington.edu
In reply to: Dean Gibson (DB Administrator) (#2)
Re: v7.4 pg_dump(all) need to encode from SQL_ASCII to UTF8

I'm not sure if you're saying I should ignore these errors...

I'm using dumps from DB airaburst.

postgres=# \l

List of databases
Name | Owner | Encoding
------------+----------+-----------
airburst | root | SQL_ASCII
lt_dev1 | postgres | UTF8
lt_dev2 | postgres | UTF8
lt_reports | postgres | UTF8
postgres | postgres | UTF8
template0 | postgres | UTF8
template1 | postgres | UTF8
(7 rows)

============================================
Looking at the dump of one table there...

--
-- PostgreSQL database dump
--

-- Started on 2008-02-25 12:28:48 PST

SET client_encoding = 'UTF8';
SET standard_conforming_strings = off;
SET check_function_bodies = false;
SET client_min_messages = warning;
SET escape_string_warning = off;

SET search_path = public, pg_catalog;

SET default_tablespace = '';

SET default_with_oids = false;

--
-- TOC entry 1455 (class 1259 OID 20378)
-- Dependencies: 1839 2
-- Name: board_posts; Type: TABLE; Schema: public; Owner: airburst;
Tablespace:

============================================
The error I get on trying to import it.

postgres@flexo:~/WORKING_DATA/airburst_tables$ psql lt_dev1 -f ./
table_board_posts.sql
SET
SET
SET
SET
SET
SET
SET
SET
CREATE TABLE
ALTER TABLE
psql:./table_board_posts.sql:248: ERROR: invalid byte sequence for
encoding "UTF8": 0x91
HINT: This error can also happen if the byte sequence does not match
the encoding expected by the server, which is controlled by
"client_encoding".
CONTEXT: COPY board_posts, line 8
CREATE INDEX

Thank you!

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Ralph Smith (#3)
Re: v7.4 pg_dump(all) need to encode from SQL_ASCII to UTF8

Ralph Smith <smithrn@washington.edu> writes:

I'm not sure if you're saying I should ignore these errors...

No, not at all.

I'm using dumps from DB airaburst.

Doesn't look like that --- you have

Name | Owner | Encoding
------------+----------+-----------
airburst | root | SQL_ASCII

but the dump contains

SET client_encoding = 'UTF8';

which indicates that it came from a database that claimed to have UTF8
encoding. (Hmm .... although it's just barely possible that you have
PGCLIENTENCODING set in pg_dump's environment?)

psql:./table_board_posts.sql:248: ERROR: invalid byte sequence for
encoding "UTF8": 0x91

In any case, this failure is pretty strong evidence that what is in the
dump is actually *not* UTF8 data, or at least not all of it is. (I'd bet
on this particular value being in some LATINn encoding.) What you're
going to need to do is figure out exactly what encoding the data really
has. If you're lucky and it's all the same encoding, you can adjust it
to UTF8 by running the dump file through iconv, or just edit the SET
client_encoding command in the dump to match the true encoding (then
PG will take care of converting it to UTF8 during the load).

If you're not lucky, you have a mismash of differently encoded data,
and I'm afraid you're in for some unpleasant tedium getting it all into
one encoding.

The reason you're suffering this pain is that 7.x was not very good
about checking or enforcing encoding validity. Current PG is much
stricter; cleaning up the data will cost you some pain now but it'll be
a good investment in the long run.

Alternatively, if you don't particularly *care* about encoding issues
and feel that everything was working fine before, you can create your
new DB with SQL_ASCII encoding (which actually means "no known
encoding") and PG will be just as lax as it was before. But if you want
to say that the database uses UTF8 encoding, you need to present validly
encoded data.

regards, tom lane