COPY command use UTF-8 encoding and NOT UNICODE(16bits)... please confirm. Should postgresql add :set CLIENT_ENCODING to 'UTF-8'; to avoid confusion

Started by David Gagnonabout 21 years ago8 messagesgeneral

dgagnon@siunik.com

about 21 years ago

Hi all,

I ran into this problem and want to share and have a confirmation.

I tried to use COPY function to load bulk data. I craft myself a
UNICODE file from a MSSQL db. I can't load it into the postgresql. I
always get the error: CONTEXT: COPY vd, line 1, column vdnum: "ï¿½ï¿½1"

The problem is that both file are exactly the same... I found that
pg_dump create in fact a UTF-8 (Confirm please) file with is UNICODE but
with variable length encoding (Ie: Some character user 8 bytes and other
16 bytes ...). See for detail:
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8. The file I crafted
is a true UNICODE (16 bytes or *UCS-2) file (Confirm please)*

So here is the content of the file:
UTF-8 (Postgresql dump):
1 1 1 AC COLUMNï¿½ACNUMï¿½ACDESCï¿½ACDELPAIï¿½

UNICODE (crafted from mssql)
1 1 1 AC COLUMNï¿½ACNUMï¿½ACDESCï¿½ACDELPAIï¿½

HEX representation UTF-8 (Postgresql dump):

00000000:31 09 31 09 31 09 41 43 09 43 4f 4c 55 4d 4e c3 1.1.1.AC.COLUMNï¿½
00000010:bf 41 43 4e 55 4d c3 bf 41 43 44 45 53 43 c3 bf ï¿½ACNUMÃ¿ACDESCÃ¿
00000020:41 43 44 45 4c 50 41 49 c3 bf ACDELPAIÃ¿

HEX representation UNICODE (crafted from mssql)
00000000:ff fe 31 00 09 00 31 00 09 00 31 00 09 00 41 00 ï¿½ï¿½1...1...1...A.
00000010:43 00 09 00 43 00 4f 00 4c 00 55 00 4d 00 4e 00 C...C.O.L.U.M.N.
00000020:ff 00 41 00 43 00 4e 00 55 00 4d 00 ff 00 41 00 ï¿½.A.C.N.U.M.ï¿½.A.
00000030:43 00 44 00 45 00 53 00 43 00 ff 00 41 00 43 00 C.D.E.S.C.ï¿½.A.C.
00000040:44 00 45 00 4c 00 50 00 41 00 49 00 ff 00 D.E.L.P.A.I.ï¿½.

So postgresql bug with the FF FE that start the UNICODE document. Is
that normal UNICODE file starts with this FF FE ?! Note that I tried to
delete those character but they aren`t visible...

So am I right ? Is Postgresql using UTF-8 and don`t really understand
UNICODE file (UCS-2)? Is there a way I can make the COPY command with a
UNICODE UCS-2 encoding

Thanks for your help
/David

Tom Lane

tgl@sss.pgh.pa.us

about 21 years ago

In reply to: David Gagnon (#1)

Re: COPY command use UTF-8 encoding and NOT UNICODE(16bits)... please confirm. Should postgresql add :set CLIENT_ENCODING to 'UTF-8'; to avoid confusion

David Gagnon <dgagnon@siunik.com> writes:

So am I right ? Is Postgresql using UTF-8 and don`t really understand
UNICODE file (UCS-2)? Is there a way I can make the COPY command with a
UNICODE UCS-2 encoding

Postgres only supports UTF-8, not any other encoding of Unicode. Sorry.

regards, tom lane

Tino Wildenhain

tino@wildenhain.de

about 21 years ago

In reply to: David Gagnon (#1)

Re: COPY command use UTF-8 encoding and NOT

Am Mittwoch, den 06.04.2005, 18:12 -0400 schrieb David Gagnon:

Hi all,

I ran into this problem and want to share and have a confirmation.

I tried to use COPY function to load bulk data. I craft myself a
UNICODE file from a MSSQL db. I can't load it into the postgresql. I
always get the error: CONTEXT: COPY vd, line 1, column vdnum: "ï¿œï¿œ1"

The problem is that both file are exactly the same... I found that
pg_dump create in fact a UTF-8 (Confirm please) file with is UNICODE
but with variable length encoding (Ie: Some character user 8 bytes and
other 16 bytes ...). See for detail:
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8. The file I crafted
is a true UNICODE (16 bytes or UCS-2) file (Confirm please)

So here is the content of the file:
UTF-8 (Postgresql dump):
1 1 1 AC COLUMNï¿œACNUMï¿œACDESCï¿œACDELPAIï¿œ

UNICODE (crafted from mssql)
1 1 1 AC COLUMNï¿œACNUMï¿œACDESCï¿œACDELPAIï¿œ

HEX representation UTF-8 (Postgresql dump):

00000000:31 09 31 09 31 09 41 43 09 43 4f 4c 55 4d 4e c3
1.1.1.AC.COLUMNï¿œ
00000010:bf 41 43 4e 55 4d c3 bf 41 43 44 45 53 43 c3 bf
ï¿œACNUMÃ¿ACDESCÃ¿
00000020:41 43 44 45 4c 50 41 49 c3 bf
ACDELPAIÃ¿

HEX representation UNICODE (crafted from mssql)
00000000:ff fe 31 00 09 00 31 00 09 00 31 00 09 00 41 00
ï¿œï¿œ1...1...1...A.
00000010:43 00 09 00 43 00 4f 00 4c 00 55 00 4d 00 4e 00
C...C.O.L.U.M.N.
00000020:ff 00 41 00 43 00 4e 00 55 00 4d 00 ff 00 41 00
ï¿œ.A.C.N.U.M.ï¿œ.A.
00000030:43 00 44 00 45 00 53 00 43 00 ff 00 41 00 43 00
C.D.E.S.C.ï¿œ.A.C.
00000040:44 00 45 00 4c 00 50 00 41 00 49 00 ff 00
D.E.L.P.A.I.ï¿œ.

So postgresql bug with the FF FE that start the UNICODE document. Is
that normal UNICODE file starts with this FF FE ?! Note that I tried
to delete those character but they aren`t visible...

So am I right ? Is Postgresql using UTF-8 and don`t really understand
UNICODE file (UCS-2)? Is there a way I can make the COPY command with
a UNICODE UCS-2 encoding

Yes, postgres Unicode means utf-8.
Windows programs which store unicode text are usually prepending
the file with a BOM (byte order mark) (just google for unicode
and BOM)
So you would need something to convert. For example I know python
can read BOM so it can be a matter of open it, read, encode it
into utf-8 and write it out again.

Regards
Tino Wildenhain

Mike Rylander

mrylander@gmail.com

about 21 years ago

In reply to: Tom Lane (#2)

Re: COPY command use UTF-8 encoding and NOT UNICODE(16bits)... please confirm. Should postgresql add :set CLIENT_ENCODING to 'UTF-8'; to avoid confusion

On Apr 6, 2005 10:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

David Gagnon <dgagnon@siunik.com> writes:

So am I right ? Is Postgresql using UTF-8 and don`t really understand
UNICODE file (UCS-2)? Is there a way I can make the COPY command with a
UNICODE UCS-2 encoding

Postgres only supports UTF-8, not any other encoding of Unicode. Sorry.

You can use iconv to convert the file on a *nix-like system, and there
may even be a cygwin build if you're on Windows.

http://www.gnu.org/software/libiconv/

--
Mike Rylander
mrylander@gmail.com
GPLS -- PINES Development
Database Developer
http://open-ils.org

David Gagnon

dgagnon@siunik.com

about 21 years ago

In reply to: Tino Wildenhain (#3)

Improvement for COPY command .. unless it already exists (If yes please tell me!)

Hi all,

Thanks for your reply on my yesterday's question regarding UTF-8 as a
UNICODE implementation in postgresql.

Is there a way to specify default values in the COPY command? In my
example example VDVSSRC and VDVSNUM are the same for the 1500000 rows of
the file. If it was possible to default their value instead of putting
it in the file, it would have been great!

COPY vd (VDNUM=DEFAULT, VDVSSRC=2, VDVSNUM=1, VDKEY, VDDATA)
FROM '/cygdrive/f/projects/web-catalog/var/vdOk.backup';

Thanks anyway .. your doing a wonderfull job with postgresql!
/David

Tom Lane

tgl@sss.pgh.pa.us

about 21 years ago

In reply to: David Gagnon (#5)

Re: Improvement for COPY command .. unless it already exists (If yes please tell me!)

David Gagnon <dgagnon@siunik.com> writes:

Is there a way to specify default values in the COPY command?

There's always ALTER TABLE ... SET DEFAULT.

regards, tom lane

David Gagnon

dgagnon@siunik.com

about 21 years ago

In reply to: Tom Lane (#6)

Re: Improvement for COPY command .. unless it already exists

Hi,

I want to default some columns for this COPY command only. So DEFAULT
is not appropriate in this case. In fact I think you suggested a
workaround right ? Doing an alter table before and another one after
will work ... not fully clean thought since my user may not have the
write to modify the table. Being able to specify default values
directly in the COPY command will be really nice :-) I just wanted to
share my tought :-)

Being eable to read zipped file will be cool to .. but I'm not sure it's
Chrismas yet right so I didn't ask ;-)

Thanks for your help
/David

Tom Lane wrote:

Show quoted text

David Gagnon <dgagnon@siunik.com> writes:

Is there a way to specify default values in the COPY command?

There's always ALTER TABLE ... SET DEFAULT.

regards, tom lane

Sean Davis

sdavis2@mail.nih.gov

about 21 years ago

In reply to: David Gagnon (#7)

Re: Improvement for COPY command .. unless it already exists

On Apr 7, 2005, at 10:33 AM, David Gagnon wrote:

Hi,

I want to default some columns for this COPY command only. So DEFAULT
is not appropriate in this case. In fact I think you suggested a
workaround right ? Doing an alter table before and another one after
will work ... not fully clean thought since my user may not have the
write to modify the table. Being able to specify default values
directly in the COPY command will be really nice :-) I just wanted to
share my tought :-)

Being eable to read zipped file will be cool to .. but I'm not sure
it's Chrismas yet right so I didn't ask ;-)

Thanks for your help
/David

David,

If you use a simple wrapper around copy using DBI (as either an
untrusted perl database function) or straight perl, you could read
zipped files as a filehandle.

As for defaults, you could also write a trigger to accomplish the
"default" behavior or, if you use the above perl-based loader, you
could implement defaults as well....

Sean

Show quoted text

Tom Lane wrote:

David Gagnon <dgagnon@siunik.com> writes:

Is there a way to specify default values in the COPY command?

There's always ALTER TABLE ... SET DEFAULT.

regards, tom lane

---------------------------(end of
broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

COPY command use UTF-8 encoding and NOT UNICODE(16bits)... please confirm. Should postgresql add :set CLIENT_ENCODING to 'UTF-8'; to avoid confusion

Attachments: