Linux x Windows LOCALE/ENCODING compatibility

Started by Carlos Henrique Reimerover 15 years ago4 messagesgeneral
Jump to latest
#1Carlos Henrique Reimer
carlos.reimer@opendb.com.br

Hi,

I currently have my PostgreSQL server running in a windows box and now we're
migrating it to a Linux operational system.

Current windows configuration:
pg_controldata shows the LC_COLLATE and LC_CTYPE are Portuguese_Brasil.1252
psql \l command shows we have databases with encoding WIN1252 and UTF8

New Linux box:
Which --locale and --encoding options should I use for the INITDB command to
have a compatible locale and encoding cluster?
Which --encoding option should I use in the CREATEDB command to have a
compatible encoding database?

Thank you!
Reimer

#2Filip Rembiałkowski
filip.rembialkowski@gmail.com
In reply to: Carlos Henrique Reimer (#1)
Re: Linux x Windows LOCALE/ENCODING compatibility

2010/11/1 Carlos Henrique Reimer <carlos.reimer@opendb.com.br>

Hi,

I currently have my PostgreSQL server running in a windows box and now we're migrating it to a Linux operational system.

Current windows configuration:
pg_controldata shows the LC_COLLATE and LC_CTYPE are Portuguese_Brasil.1252
psql \l command shows we have databases with encoding WIN1252 and UTF8

New Linux box:
Which --locale and --encoding options should I use for the INITDB command to have a compatible locale and encoding cluster?
Which --encoding option should I use in the CREATEDB command to have a compatible encoding database?

from your description I assume you have 8.3 on Windows,
but you did not tell us which PostgreSQL version you are planning to
deploy on Linux. I'd recommend 8.4.

main difference is that on Windows, UTF-8 encoding can be used with
any locale. on Unix, not.

generally, it is safe to do ``initdb --locale=pt_BR.UTF-8'' and create
fresh databases with default settings.

then, use Linux pg_dump and psql to clone Windows databases. like this:
pg_dump -h windowsbox -p 5432 -U postgres DBNAME | psql DBNAME

UTF-8 is compatible with all client encodings.

you will get compatible bahaviour for your existing apps with
SET client_encoding TO 'whatever'; /* this can be entered as pre-SQL
on every client connection */

see http://www.postgresql.org/docs/8.4/static/multibyte.html,
22.2.3. Automatic Character Set Conversion Between Server and Client

HTH

--
Filip Rembiałkowski
JID,mailto:filip.rembialkowski@gmail.com
http://filip.rembialkowski.net/

#3Carlos Henrique Reimer
carlos.reimer@opendb.com.br
In reply to: Filip Rembiałkowski (#2)
Re: Linux x Windows LOCALE/ENCODING compatibility

Hi,

I'm currently in the process of moving the data from the Windows server to
the new Linux box but facing some problems with the encoding.

Additional configuration information: Windows is running PG 8.3 and the new
Linux box is PG 8.4.

Windows dump command:
pg_dump -U postgres -Fc -v -f "f:\backup

Linux restore command:
pg_restore -v -n brasil -d mapas /backup

pg_restore error:
pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 3574; 0 40805 TABLE DATA
cidade
postgres
pg_restore: [archiver (db)] COPY failed: ERROR: character 0x81 of encoding
"WIN
1252" has no equivalent in "UTF8"
CONTEXT: COPY cidade, line 6

I also tried to dump using pg_dump -E UTF8 but then I got:
pg_dump: SQL command failed
pg_dump: Error message from server: ERROR: character 0x81 of encoding
"WIN1252" has no equivalent in "UTF8"
pg_dump: The command was: COPY brasil.cidade (gid, "municpio", "municpi0",
uf, longitude, latitude, the_geom) TO stdout;
pg_dump: *** aborted because of error

How can I fix this error?

Thank you!

2010/11/1 Filip Rembiałkowski <filip.rembialkowski@gmail.com>

2010/11/1 Carlos Henrique Reimer <carlos.reimer@opendb.com.br>

Hi,

I currently have my PostgreSQL server running in a windows box and now

we're migrating it to a Linux operational system.

Current windows configuration:
pg_controldata shows the LC_COLLATE and LC_CTYPE are

Portuguese_Brasil.1252

psql \l command shows we have databases with encoding WIN1252 and UTF8

New Linux box:
Which --locale and --encoding options should I use for the INITDB command

to have a compatible locale and encoding cluster?

Which --encoding option should I use in the CREATEDB command to have a

compatible encoding database?

from your description I assume you have 8.3 on Windows,
but you did not tell us which PostgreSQL version you are planning to
deploy on Linux. I'd recommend 8.4.

main difference is that on Windows, UTF-8 encoding can be used with
any locale. on Unix, not.

generally, it is safe to do ``initdb --locale=pt_BR.UTF-8'' and create
fresh databases with default settings.

then, use Linux pg_dump and psql to clone Windows databases. like this:
pg_dump -h windowsbox -p 5432 -U postgres DBNAME | psql DBNAME

UTF-8 is compatible with all client encodings.

you will get compatible bahaviour for your existing apps with
SET client_encoding TO 'whatever'; /* this can be entered as pre-SQL
on every client connection */

see http://www.postgresql.org/docs/8.4/static/multibyte.html,
22.2.3. Automatic Character Set Conversion Between Server and Client

HTH

--
Filip Rembiałkowski
JID,mailto:filip.rembialkowski@gmail.com
http://filip.rembialkowski.net/

--
Reimer
47-3347-1724 47-9183-0547 msn: carlos.reimer@opendb.com.br

#4Marco Colombo
pgsql@esiway.net
In reply to: Carlos Henrique Reimer (#3)
Re: Linux x Windows LOCALE/ENCODING compatibility

On 11/09/2010 02:31 AM, Carlos Henrique Reimer wrote:

Hi,

I'm currently in the process of moving the data from the Windows server
to the new Linux box but facing some problems with the encoding.

Additional configuration information: Windows is running PG 8.3 and the
new Linux box is PG 8.4.

Windows dump command:
pg_dump -U postgres -Fc -v -f "f:\backup

Linux restore command:
pg_restore -v -n brasil -d mapas /backup

pg_restore error:
pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 3574; 0 40805 TABLE
DATA cidade
postgres
pg_restore: [archiver (db)] COPY failed: ERROR: character 0x81 of
encoding "WIN
1252" has no equivalent in "UTF8"
CONTEXT: COPY cidade, line 6

I also tried to dump using pg_dump -E UTF8 but then I got:
pg_dump: SQL command failed
pg_dump: Error message from server: ERROR: character 0x81 of encoding
"WIN1252" has no equivalent in "UTF8"
pg_dump: The command was: COPY brasil.cidade (gid, "municpio",
"municpi0", uf, longitude, latitude, the_geom) TO stdout;
pg_dump: *** aborted because of error

How can I fix this error?

Well, that's pretty much evidence that the text you have on windows is
NOT win1252-encoded. Or some parts of it, at least. According to this page:

http://en.wikipedia.org/wiki/Windows-1252

0x81 is undefined in win1252.

Please note that if the old DB has SQL_ASCII encoding, you may have
mixed encoding text in the same database (either on purpose or by
mistake) and you have to either keep using SQL_ASCII in the new DB
(which means keeping potentially invalid data around), or do proper
conversion to UTF-8 (which possibly has do be done differently for
different tables, or even different rows in the same table) BEFORE
importing it in the new DB.

To convert from one encoding to another, you can use, surprisingly, the
'convert' function. The tricky part is that once you have text of
unknown encoding, you can only guess. That is, try to convert it from
some reasonable encodings to UTF-8, and look at the result. If it looks
right, that may be the right encoding. :)

BTW, 0x81 is a rather weird char, it's invalid in many common encodings.
It's valid in win-1251, where it represents U+0403 (CYRILLIC CAPITAL
LETTER GJE), Ѓ (if you don't have the font, you won't be able to see
it), but I think it's an unlikely candidate.

In both CP437 and CP850 (old DOS encodings for western languages) it
represents U+00FC (LATIN SMALL LETTER U WITH DIAERESIS), ü (again, I
hope you can see it). That's a better candidate, it's possible someone
accessed the DB either directly, or via a web application, from a
Windows ME or older system.

Unfortunately, AFAIK, PostgreSQL doesn't support the cp850 encoding.
Maybe you can dump the data (even a single table) in text format and use
a text editor (or shell utils like perl, sed, tr) to replace offending
charaters with their proper win-1252 counterparts (e.g., 0x81 can be
replaced with 0xfc, which is u with diaresis in win-1252).

I hope it helps.

.TM.