MSSQL to PostgreSQL : Encoding problem
Hi list !
I already posted this as "COPY FROM encoding error", but I have
been doing some more tests since then.
I'm trying to export data from MS SQL Server to PostgreSQL.
The tables are quite big (>20M rows), so a CSV export and a "COPY
FROM3 import seems to be the only reasonable solution.
In DTS, I have 3 options to export a table as a text file : ANSI,
OEM and UNICODE.
I tried all these options (and I have three files, one for each).
I then try to import into PostgreSQL. The farther I can get is
when using the UNICODE export, and importing it using a
client_encoding set to UTF8 (I tried WIN1252, LATIN9, LATIN1, ...).
The copy then stops with an error :
ERROR: invalid byte sequence for encoding "UTF8": 0xff
�tat SQL :22021
The problematic character is the euro currency symbol.
Does anyone know how I can solve this ?
Thanks a lot !
--
Arnaud
Arnaud Lesauvage wrote:
Hi list !
I already posted this as "COPY FROM encoding error", but I have been
doing some more tests since then.I'm trying to export data from MS SQL Server to PostgreSQL.
The tables are quite big (>20M rows), so a CSV export and a "COPY FROM3
import seems to be the only reasonable solution.
Or go via MS-Access/Perl and ODBC/DBI perhaps?
In DTS, I have 3 options to export a table as a text file : ANSI, OEM
and UNICODE.
I tried all these options (and I have three files, one for each).
Well, what character-set is your database in?
I then try to import into PostgreSQL. The farther I can get is when
using the UNICODE export, and importing it using a client_encoding set
to UTF8 (I tried WIN1252, LATIN9, LATIN1, ...).
The copy then stops with an error :
ERROR: invalid byte sequence for encoding "UTF8": 0xff
État SQL :22021The problematic character is the euro currency symbol.
You'll want UTF-8 or LATIN9 for the euro symbol. LATIN1 supports that
character-number but it is used for a different symbol.
Your first step needs to be to find out what character-set your data is in.
Your second is then to decide what char-set you want to use to store it
in PG.
Then you can decide how to get there.
--
Richard Huxton
Archonet Ltd
Arnaud Lesauvage wrote:
I then try to import into PostgreSQL. The farther I can get is when
using the UNICODE export, and importing it using a client_encoding
set to UTF8 (I tried WIN1252, LATIN9, LATIN1, ...).
The copy then stops with an error :
ERROR: invalid byte sequence for encoding "UTF8": 0xff
�tat SQL :22021The problematic character is the euro currency symbol.
Exporting from MS SQL server as unicode is going to give you full
Unicode, not UTF8. Full unicde is 2 bytes per character and UTF8 is 1,
same as ASCII.
You will have to encode the Unicode data to UTF8
I have done this in Delphi using it's built in UTF8 encoding and
decoding routines. You can get a free copy of Delphi Turbo Explorer
which includes components for MS SQL server and ODBC, so it would be
pretty straight forward to get this working.
The actual method in Delphi is system.UTF8Encode(widestring). This will
encode unicode to UTF8 which is compatible with a Postgresql UTF8 database.
I am sure Perl could do it also.
Hope this helps.
--
Tony Caduto
AM Software Design
http://www.amsoftwaredesign.com
Home of PG Lightning Admin for Postgresql
Your best bet for Postgresql Administration
Tony Caduto wrote:
Arnaud Lesauvage wrote:
I then try to import into PostgreSQL. The farther I can get is when
using the UNICODE export, and importing it using a client_encoding
set to UTF8 (I tried WIN1252, LATIN9, LATIN1, ...).
The copy then stops with an error :
ERROR: invalid byte sequence for encoding "UTF8": 0xff
État SQL :22021The problematic character is the euro currency symbol.
Exporting from MS SQL server as unicode is going to give you full
Unicode, not UTF8. Full unicde is 2 bytes per character and UTF8 is 1,
same as ASCII.
You will have to encode the Unicode data to UTF8
Well, UTF8 is a minimum of one byte, but can be longer for non-ASCII
characters. The idea being that chars below 128 map to ASCII. There's
also UTF16 and I believe UTF32 with 2+ and four byte characters.
I have done this in Delphi using it's built in UTF8 encoding and
decoding routines. You can get a free copy of Delphi Turbo Explorer
which includes components for MS SQL server and ODBC, so it would be
pretty straight forward to get this working.The actual method in Delphi is system.UTF8Encode(widestring). This will
encode unicode to UTF8 which is compatible with a Postgresql UTF8 database.
Ah, that's useful to know. Windows just doesn't have the same quantity
of tools installed as a *nix platform.
I am sure Perl could do it also.
And in one line if you're clever enough no doubt ;-)
--
Richard Huxton
Archonet Ltd
Richard Huxton a �crit :
Arnaud Lesauvage wrote:
Hi list !
I already posted this as "COPY FROM encoding error", but I have been
doing some more tests since then.I'm trying to export data from MS SQL Server to PostgreSQL.
The tables are quite big (>20M rows), so a CSV export and a "COPY FROM3
import seems to be the only reasonable solution.Or go via MS-Access/Perl and ODBC/DBI perhaps?
Yes, I think it would work. The problem is that the DB is
too big for this king of export. Using DTS from MSSQL to
export directly to PostgreSQL using psqlODBC Unicode Driver,
I exported ~1000 rows per second in a 2-columns table with
~20M rows. That means several days just for this table, and
I have bigger ones !
In DTS, I have 3 options to export a table as a text file : ANSI, OEM
and UNICODE.
I tried all these options (and I have three files, one for each).Well, what character-set is your database in?
Collation in MSSQL is SQL_Latin1_General_CP1_Cl_AS.
DTS documentation tells me that exporting in ANSI should
export using the current codepage.
According to my local setting, my codepage is Windows-1252.
This file is not correctly read by COPY when using
client_encoding of WIN1252 though...
I then try to import into PostgreSQL. The farther I can get is when
using the UNICODE export, and importing it using a client_encoding set
to UTF8 (I tried WIN1252, LATIN9, LATIN1, ...).
The copy then stops with an error :
ERROR: invalid byte sequence for encoding "UTF8": 0xff
�tat SQL :22021The problematic character is the euro currency symbol.
You'll want UTF-8 or LATIN9 for the euro symbol. LATIN1 supports that
character-number but it is used for a different symbol.Your first step needs to be to find out what character-set your data is in.
Your second is then to decide what char-set you want to use to store it
in PG.
Then you can decide how to get there.
In PG, UTF8 was my choice (the DB already exists, I am just
adding some tables that are still stored in MSSQL), and
according to what you say this was the right choice.
The problem is really about reading this file I think.
I thought that given the character sets available in
PostgreSQL, I would be able to COPY directly from my
exported files.
If I have to convert them using some third party tool, I'll
do that, but that's a bit more painful...
--
Arnaud
Arnaud Lesauvage wrote:
Richard Huxton a écrit :
Or go via MS-Access/Perl and ODBC/DBI perhaps?
Yes, I think it would work. The problem is that the DB is too big for
this king of export. Using DTS from MSSQL to export directly to
PostgreSQL using psqlODBC Unicode Driver, I exported ~1000 rows per
second in a 2-columns table with ~20M rows. That means several days just
for this table, and I have bigger ones !
Well it's about 0.25 days, but if it's too long, it's too long.
In DTS, I have 3 options to export a table as a text file : ANSI, OEM
and UNICODE.
I tried all these options (and I have three files, one for each).Well, what character-set is your database in?
Collation in MSSQL is SQL_Latin1_General_CP1_Cl_AS.
DTS documentation tells me that exporting in ANSI should export using
the current codepage.
According to my local setting, my codepage is Windows-1252.
This file is not correctly read by COPY when using client_encoding of
WIN1252 though...
Hmm. Odd that they don't agree on what WIN1252 is. I'm not sure how to
check the file and confirm one way or the other. Anyone else on the list
got an idea?
--
Richard Huxton
Archonet Ltd
Richard Huxton a �crit :
Arnaud Lesauvage wrote:
Richard Huxton a �crit :
Or go via MS-Access/Perl and ODBC/DBI perhaps?
Yes, I think it would work. The problem is that the DB is too big for
this king of export. Using DTS from MSSQL to export directly to
PostgreSQL using psqlODBC Unicode Driver, I exported ~1000 rows per
second in a 2-columns table with ~20M rows. That means several days just
for this table, and I have bigger ones !Well it's about 0.25 days, but if it's too long, it's too long.
Sure, sorry for the confusion, the problem is with the other
tables (same number of rows but a lot of columns, some very
large).
Collation in MSSQL is SQL_Latin1_General_CP1_Cl_AS.
DTS documentation tells me that exporting in ANSI should export using
the current codepage.
According to my local setting, my codepage is Windows-1252.
This file is not correctly read by COPY when using client_encoding of
WIN1252 though...Hmm. Odd that they don't agree on what WIN1252 is. I'm not sure how to
check the file and confirm one way or the other. Anyone else on the list
got an idea?
I just downloaded the GnuWin32 version of iconv.
I'm giving it a try and I'll tell you haw it went.
--
Arnaud
2006/11/21, Arnaud Lesauvage <thewild@freesurf.fr>:
Hi list !
I already posted this as "COPY FROM encoding error", but I have
been doing some more tests since then.I'm trying to export data from MS SQL Server to PostgreSQL.
The tables are quite big (>20M rows), so a CSV export and a "COPY
FROM3 import seems to be the only reasonable solution.
I believe you might have more luck working without files altogether.
Use an ETL tool like kettle or even DTS with the pgsql ODBC driver.
That's exactly what those tools are for.
You still have to get the encodings right, though.
I suggest unicode for pgsql, but only you know how the MSSQL database
is encoded.
t.n.a.
Tomi NA a �crit :
2006/11/21, Arnaud Lesauvage <thewild@freesurf.fr>:
Hi list !
I already posted this as "COPY FROM encoding error", but I have
been doing some more tests since then.I'm trying to export data from MS SQL Server to PostgreSQL.
The tables are quite big (>20M rows), so a CSV export and a "COPY
FROM3 import seems to be the only reasonable solution.I believe you might have more luck working without files altogether.
Use an ETL tool like kettle or even DTS with the pgsql ODBC driver.
That's exactly what those tools are for.You still have to get the encodings right, though.
I suggest unicode for pgsql, but only you know how the MSSQL database
is encoded.
I think I'll go this way... No other choice, actually !
The MSSQL database is in SQL_Latin1_General_CP1_Cl_AS.
I don't really understand what this is. It supports the euro
symbol, so it is probably not pure LATIN1, right ?
--
Arnaud
2006/11/22, Arnaud Lesauvage <thewild@freesurf.fr>:
Tomi NA a écrit :
2006/11/21, Arnaud Lesauvage <thewild@freesurf.fr>:
Hi list !
I already posted this as "COPY FROM encoding error", but I have
been doing some more tests since then.I'm trying to export data from MS SQL Server to PostgreSQL.
The tables are quite big (>20M rows), so a CSV export and a "COPY
FROM3 import seems to be the only reasonable solution.I believe you might have more luck working without files altogether.
Use an ETL tool like kettle or even DTS with the pgsql ODBC driver.
That's exactly what those tools are for.You still have to get the encodings right, though.
I suggest unicode for pgsql, but only you know how the MSSQL database
is encoded.I think I'll go this way... No other choice, actually !
The MSSQL database is in SQL_Latin1_General_CP1_Cl_AS.
I don't really understand what this is. It supports the euro
symbol, so it is probably not pure LATIN1, right ?
I suppose you'd have to look at the latin1 codepage character table
somewhere...I'm a UTF-8 guy so I'm not well suited to respond to the
question. :)
t.n.a.
Tomi NA a �crit :
I think I'll go this way... No other choice, actually !
The MSSQL database is in SQL_Latin1_General_CP1_Cl_AS.
I don't really understand what this is. It supports the euro
symbol, so it is probably not pure LATIN1, right ?I suppose you'd have to look at the latin1 codepage character table
somewhere...I'm a UTF-8 guy so I'm not well suited to respond to the
question. :)
Yep, http://en.wikipedia.org/wiki/Latin-1 tells me that
LATIN1 is missing the euro sign...
Grrrrr I hate this !!!
--
Arnaud
Arnaud Lesauvage wrote:
Tomi NA a �crit :
I think I'll go this way... No other choice, actually !
The MSSQL database is in SQL_Latin1_General_CP1_Cl_AS.
I don't really understand what this is. It supports the euro
symbol, so it is probably not pure LATIN1, right ?I suppose you'd have to look at the latin1 codepage character table
somewhere...I'm a UTF-8 guy so I'm not well suited to respond to the
question. :)Yep, http://en.wikipedia.org/wiki/Latin-1 tells me that
LATIN1 is missing the euro sign...
Grrrrr I hate this !!!
So use Latin9 ...
--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.
I have done this in Delphi using it's built in UTF8 encoding and
decoding routines. You can get a free copy of DelphiTurbo Explorer
which includes components for MS SQL server and ODBC, so it
would be
pretty straight forward to get this working.
The actual method in Delphi is system.UTF8Encode(widestring). This
will encode unicode to UTF8 which is compatible with aPostgresql UTF8 database.
Ah, that's useful to know. Windows just doesn't have the same
quantity of tools installed as a *nix platform.
If your file is small enough, you can just open it up in Notepad and
re-save it as UTF8. It might play funny with the BOMs though
(byte-order-marks).
There is also, IIRC, an iconv binary available for Windows that should
be able to do such a conversion. Can't rememebr where thuogh :-)
//Magnus
I already posted this as "COPY FROM encoding error", but I
have been
doing some more tests since then.
I'm trying to export data from MS SQL Server to PostgreSQL.
The tables are quite big (>20M rows), so a CSV export and a "COPY
FROM3 import seems to be the only reasonable solution.Or go via MS-Access/Perl and ODBC/DBI perhaps?
Yes, I think it would work. The problem is that the DB is too
big for this king of export. Using DTS from MSSQL to export
directly to PostgreSQL using psqlODBC Unicode Driver, I
exported ~1000 rows per second in a 2-columns table with ~20M
rows. That means several days just for this table, and I have
bigger ones !
Interesting. What did you set the "Inser batch size" to? (I think that's
available for all transformatino tasks). And did you remember to check
the box for "use transactions"?
While it's never as fast as a COPY, it should be possible to make it
faster than that, Ithink.
Another option is to just BCP the file out, and then COPY it into
postgresql. No nice GUI, but you can tune almost everything with BCP.
//Magnus
Alvaro Herrera a �crit :
Arnaud Lesauvage wrote:
Tomi NA a �crit :
I think I'll go this way... No other choice, actually !
The MSSQL database is in SQL_Latin1_General_CP1_Cl_AS.
I don't really understand what this is. It supports the euro
symbol, so it is probably not pure LATIN1, right ?I suppose you'd have to look at the latin1 codepage character table
somewhere...I'm a UTF-8 guy so I'm not well suited to respond to the
question. :)Yep, http://en.wikipedia.org/wiki/Latin-1 tells me that
LATIN1 is missing the euro sign...
Grrrrr I hate this !!!So use Latin9 ...
Of course, but it doesn't work !!!
Whatever client encoding I choose in postgresql before
COPYing, I get the 'invalid byte sequence error'.
The farther I can get is exporting to UNICODE and importing
as UTF8. Then COPY only breaks on the euro symbol (otherwise
it breaks very early, I think on the first "non-ascii"
character).
--
Arnaud
Arnaud Lesauvage wrote:
Alvaro Herrera a �crit :
Arnaud Lesauvage wrote:
Tomi NA a �crit :
I think I'll go this way... No other choice, actually !
The MSSQL database is in SQL_Latin1_General_CP1_Cl_AS.
I don't really understand what this is. It supports the euro
symbol, so it is probably not pure LATIN1, right ?I suppose you'd have to look at the latin1 codepage character table
somewhere...I'm a UTF-8 guy so I'm not well suited to respond to the
question. :)Yep, http://en.wikipedia.org/wiki/Latin-1 tells me that
LATIN1 is missing the euro sign...
Grrrrr I hate this !!!So use Latin9 ...
Of course, but it doesn't work !!!
Whatever client encoding I choose in postgresql before
COPYing, I get the 'invalid byte sequence error'.
Humm ... how are you choosing the client encoding? Is it actually
working? I don't see how choosing Latin1 or Latin9 and feeding whatever
byte sequence would give you an "invalid byte sequence". These charsets
don't have any way to validate the bytes, as opposed to what UTF-8 can
do. So you could end up with invalid bytes if you choose the wrong
client encoding, but that's a different error.
--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Of course, but it doesn't work !!!
Whatever client encoding I choose in postgresql before COPYing, I get
the 'invalid byte sequence error'.The farther I can get is exporting to UNICODE and importing as UTF8.
Then COPY only breaks on the euro symbol (otherwise it breaks very
early, I think on the first "non-ascii" character).--
Like I said before UNICODE <> UTF8 That's why the COPY command breaks
on the Euro symbol.
You will have to export as UNICODE, then encode it as UTF8, then you
won't get the breakage.
UTF8 is simply a means to store UNICODE pretty much as ASCII text.
You could grab a copy of Delphi TurboExplorer and create a import
routine using the dbGO ADO components and the PG ODBC driver.
Basicly you need to encode any UNICODE data going to the PG server with
the system.utf8encode function:
[Delphi] function *UTF8Encode*(const WS: WideString): UTF8String;
Call Utf8Encode to convert a Unicode string to UTF-8. WS is the Unicode
string to convert. Utf8Encode returns the corresponding UTF-8 string.
I would imagine that Perl also has such routines, but I don't know for
sure. These routines might be in FreePascal as well.
--
Tony Caduto
AM Software Design
http://www.amsoftwaredesign.com
Home of PG Lightning Admin for Postgresql
Your best bet for Postgresql Administration
Magnus Hagander a �crit :
I have done this in Delphi using it's built in UTF8 encoding and
decoding routines. You can get a free copy of DelphiTurbo Explorer
which includes components for MS SQL server and ODBC, so it
would be
pretty straight forward to get this working.
The actual method in Delphi is system.UTF8Encode(widestring). This
will encode unicode to UTF8 which is compatible with aPostgresql UTF8 database.
Ah, that's useful to know. Windows just doesn't have the same
quantity of tools installed as a *nix platform.If your file is small enough, you can just open it up in Notepad and
re-save it as UTF8. It might play funny with the BOMs though
(byte-order-marks).There is also, IIRC, an iconv binary available for Windows that should
be able to do such a conversion. Can't rememebr where thuogh :-)
The file is way too big for notepad. It is even too big for
notepad++.
I do have the GnuWin32 version of iconv (*great* software
collection, BTW), but still no go...
I tried iconv -f "CP1252" -t "UTF-8"
detailrecherche_ansi.csv >detailrecherche_cp1252utf8.csv
and iconv -f "LATIN-9" -t "UTF-8" detailrecherche_ansi.csv
detailrecherche_latin9utf8.csv
Both don't want to load as UTF8 (invalid byte sequence x00).
I am desperate...
--
Arnaud
Alvaro Herrera a �crit :
Arnaud Lesauvage wrote:
Alvaro Herrera a �crit :
Arnaud Lesauvage wrote:
Tomi NA a �crit :
I think I'll go this way... No other choice, actually !
The MSSQL database is in SQL_Latin1_General_CP1_Cl_AS.
I don't really understand what this is. It supports the euro
symbol, so it is probably not pure LATIN1, right ?I suppose you'd have to look at the latin1 codepage character table
somewhere...I'm a UTF-8 guy so I'm not well suited to respond to the
question. :)Yep, http://en.wikipedia.org/wiki/Latin-1 tells me that
LATIN1 is missing the euro sign...
Grrrrr I hate this !!!So use Latin9 ...
Of course, but it doesn't work !!!
Whatever client encoding I choose in postgresql before
COPYing, I get the 'invalid byte sequence error'.Humm ... how are you choosing the client encoding? Is it actually
working? I don't see how choosing Latin1 or Latin9 and feeding whatever
byte sequence would give you an "invalid byte sequence". These charsets
don't have any way to validate the bytes, as opposed to what UTF-8 can
do. So you could end up with invalid bytes if you choose the wrong
client encoding, but that's a different error.
mydb=# SET client_encoding TO LATIN9;
SET
mydb=# COPY statistiques.detailrecherche (log_gid,
champrecherche, valeurrecherche) FROM
'E:\\Production\\Temp\\detailrecherche_ansi.csv' CSV;
ERROR: invalid byte sequence for encoding "LATIN9": 0x00
HINT: This error can also happen if the byte sequence does
not match the encoding expected by the server, which is
controlled by "client_encoding".
CONTEXT: COPY detailrecherche, line 9212
mydb=# SET client_encoding TO WIN1252;
SET
mydb=# COPY statistiques.detailrecherche (log_gid,
champrecherche, valeurrecherche) FROM
'E:\\Production\\Temp\\detailrecherche_ansi.csv' CSV;
ERROR: invalid byte sequence for encoding "WIN1252": 0x00
HINT: This error can also happen if the byte sequence does
not match the encoding expected by the server, which is
controlled by "client_encoding".
CONTEXT: COPY detailrecherche, line 9212
Really, I'd rather have another error, but this is all I can
get.
This is with the "ANSI" export.
With the "UNICODE" export :
mydb=# SET client_encoding TO UTF8;
SET
mydb=# COPY statistiques.detailrecherche (log_gid,
champrecherche, valeurrecherche) FROM
'E:\\Production\\Temp\\detailrecherche_unicode.csv' CSV;
ERROR: invalid byte sequence for encoding "UTF8": 0xff
HINT: This error can also happen if the byte sequence does
not match the encoding expected by the server, which is
controlled by "client_encoding".
CONTEXT: COPY detailrecherche, line 592680
So, line 592680 is *a lot* better, but it is still not good!
--
Arnaud
Or go via MS-Access/Perl and ODBC/DBI perhaps?
Yes, I think it would work. The problem is that the DB is too big for
this king of export. Using DTS from MSSQL to export directly to
PostgreSQL using psqlODBC Unicode Driver, I exported ~1000 rows per
second in a 2-columns table with ~20M rows. That means several days just
for this table, and I have bigger ones !Well it's about 0.25 days, but if it's too long, it's too long.
Sure, sorry for the confusion, the problem is with the other tables (same
number of rows but a lot of columns, some very large).
well, if its too slow, then you will have to dump the db to a textfile (DTS
does this for you) and then convert the textfile to utf8 manually before
importing it to pgsql. iconv for win32 will help you there. i found tho it
removes some wanted special characters, so watch out.
a less "scientific" approach would be using an unicode-aware texteditor to
convert it (ultraedit does this pretty nicely, for example). have had good
results with it.
loading several million rows will always take some time, tho.
- thomas