COPY command use UTF-8 encoding and NOT UNICODE(16bits)... please confirm. Should postgresql add :set CLIENT_ENCODING to 'UTF-8'; to avoid confusion
Hi all,
I ran into this problem and want to share and have a confirmation.
I tried to use COPY function to load bulk data. I craft myself a
UNICODE file from a MSSQL db. I can't load it into the postgresql. I
always get the error: CONTEXT: COPY vd, line 1, column vdnum: "��1"
The problem is that both file are exactly the same... I found that
pg_dump create in fact a UTF-8 (Confirm please) file with is UNICODE but
with variable length encoding (Ie: Some character user 8 bytes and other
16 bytes ...). See for detail:
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8. The file I crafted
is a true UNICODE (16 bytes or *UCS-2) file (Confirm please)*
So here is the content of the file:
UTF-8 (Postgresql dump):
1 1 1 AC COLUMN�ACNUM�ACDESC�ACDELPAI�
UNICODE (crafted from mssql)
1 1 1 AC COLUMN�ACNUM�ACDESC�ACDELPAI�
HEX representation UTF-8 (Postgresql dump):
00000000:31 09 31 09 31 09 41 43 09 43 4f 4c 55 4d 4e c3 1.1.1.AC.COLUMN�
00000010:bf 41 43 4e 55 4d c3 bf 41 43 44 45 53 43 c3 bf �ACNUMÿACDESCÿ
00000020:41 43 44 45 4c 50 41 49 c3 bf ACDELPAIÿ
HEX representation UNICODE (crafted from mssql)
00000000:ff fe 31 00 09 00 31 00 09 00 31 00 09 00 41 00 ��1...1...1...A.
00000010:43 00 09 00 43 00 4f 00 4c 00 55 00 4d 00 4e 00 C...C.O.L.U.M.N.
00000020:ff 00 41 00 43 00 4e 00 55 00 4d 00 ff 00 41 00 �.A.C.N.U.M.�.A.
00000030:43 00 44 00 45 00 53 00 43 00 ff 00 41 00 43 00 C.D.E.S.C.�.A.C.
00000040:44 00 45 00 4c 00 50 00 41 00 49 00 ff 00 D.E.L.P.A.I.�.
So postgresql bug with the FF FE that start the UNICODE document. Is
that normal UNICODE file starts with this FF FE ?! Note that I tried to
delete those character but they aren`t visible...
So am I right ? Is Postgresql using UTF-8 and don`t really understand
UNICODE file (UCS-2)? Is there a way I can make the COPY command with a
UNICODE UCS-2 encoding
Thanks for your help
/David
David Gagnon <dgagnon@siunik.com> writes:
So am I right ? Is Postgresql using UTF-8 and don`t really understand
UNICODE file (UCS-2)? Is there a way I can make the COPY command with a
UNICODE UCS-2 encoding
Postgres only supports UTF-8, not any other encoding of Unicode. Sorry.
regards, tom lane
Am Mittwoch, den 06.04.2005, 18:12 -0400 schrieb David Gagnon:
Hi all,
I ran into this problem and want to share and have a confirmation.
I tried to use COPY function to load bulk data. I craft myself a
UNICODE file from a MSSQL db. I can't load it into the postgresql. I
always get the error: CONTEXT: COPY vd, line 1, column vdnum: "ᅵᅵ1"The problem is that both file are exactly the same... I found that
pg_dump create in fact a UTF-8 (Confirm please) file with is UNICODE
but with variable length encoding (Ie: Some character user 8 bytes and
other 16 bytes ...). See for detail:
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8. The file I crafted
is a true UNICODE (16 bytes or UCS-2) file (Confirm please)So here is the content of the file:
UTF-8 (Postgresql dump):
1 1 1 AC COLUMNᅵACNUMᅵACDESCᅵACDELPAIᅵUNICODE (crafted from mssql)
1 1 1 AC COLUMNᅵACNUMᅵACDESCᅵACDELPAIᅵHEX representation UTF-8 (Postgresql dump):
00000000:31 09 31 09 31 09 41 43 09 43 4f 4c 55 4d 4e c3
1.1.1.AC.COLUMNᅵ
00000010:bf 41 43 4e 55 4d c3 bf 41 43 44 45 53 43 c3 bf
ᅵACNUMÿACDESCÿ
00000020:41 43 44 45 4c 50 41 49 c3 bf
ACDELPAIÿHEX representation UNICODE (crafted from mssql)
00000000:ff fe 31 00 09 00 31 00 09 00 31 00 09 00 41 00
ᅵᅵ1...1...1...A.
00000010:43 00 09 00 43 00 4f 00 4c 00 55 00 4d 00 4e 00
C...C.O.L.U.M.N.
00000020:ff 00 41 00 43 00 4e 00 55 00 4d 00 ff 00 41 00
ᅵ.A.C.N.U.M.ᅵ.A.
00000030:43 00 44 00 45 00 53 00 43 00 ff 00 41 00 43 00
C.D.E.S.C.ᅵ.A.C.
00000040:44 00 45 00 4c 00 50 00 41 00 49 00 ff 00
D.E.L.P.A.I.ᅵ.So postgresql bug with the FF FE that start the UNICODE document. Is
that normal UNICODE file starts with this FF FE ?! Note that I tried
to delete those character but they aren`t visible...So am I right ? Is Postgresql using UTF-8 and don`t really understand
UNICODE file (UCS-2)? Is there a way I can make the COPY command with
a UNICODE UCS-2 encoding
Yes, postgres Unicode means utf-8.
Windows programs which store unicode text are usually prepending
the file with a BOM (byte order mark) (just google for unicode
and BOM)
So you would need something to convert. For example I know python
can read BOM so it can be a matter of open it, read, encode it
into utf-8 and write it out again.
Regards
Tino Wildenhain
On Apr 6, 2005 10:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
David Gagnon <dgagnon@siunik.com> writes:
So am I right ? Is Postgresql using UTF-8 and don`t really understand
UNICODE file (UCS-2)? Is there a way I can make the COPY command with a
UNICODE UCS-2 encodingPostgres only supports UTF-8, not any other encoding of Unicode. Sorry.
You can use iconv to convert the file on a *nix-like system, and there
may even be a cygwin build if you're on Windows.
http://www.gnu.org/software/libiconv/
--
Mike Rylander
mrylander@gmail.com
GPLS -- PINES Development
Database Developer
http://open-ils.org
Hi all,
Thanks for your reply on my yesterday's question regarding UTF-8 as a
UNICODE implementation in postgresql.
Is there a way to specify default values in the COPY command? In my
example example VDVSSRC and VDVSNUM are the same for the 1500000 rows of
the file. If it was possible to default their value instead of putting
it in the file, it would have been great!
COPY vd (VDNUM=DEFAULT, VDVSSRC=2, VDVSNUM=1, VDKEY, VDDATA)
FROM '/cygdrive/f/projects/web-catalog/var/vdOk.backup';
Thanks anyway .. your doing a wonderfull job with postgresql!
/David
David Gagnon <dgagnon@siunik.com> writes:
Is there a way to specify default values in the COPY command?
There's always ALTER TABLE ... SET DEFAULT.
regards, tom lane
Hi,
I want to default some columns for this COPY command only. So DEFAULT
is not appropriate in this case. In fact I think you suggested a
workaround right ? Doing an alter table before and another one after
will work ... not fully clean thought since my user may not have the
write to modify the table. Being able to specify default values
directly in the COPY command will be really nice :-) I just wanted to
share my tought :-)
Being eable to read zipped file will be cool to .. but I'm not sure it's
Chrismas yet right so I didn't ask ;-)
Thanks for your help
/David
Tom Lane wrote:
Show quoted text
David Gagnon <dgagnon@siunik.com> writes:
Is there a way to specify default values in the COPY command?
There's always ALTER TABLE ... SET DEFAULT.
regards, tom lane
On Apr 7, 2005, at 10:33 AM, David Gagnon wrote:
Hi,
I want to default some columns for this COPY command only. So DEFAULT
is not appropriate in this case. In fact I think you suggested a
workaround right ? Doing an alter table before and another one after
will work ... not fully clean thought since my user may not have the
write to modify the table. Being able to specify default values
directly in the COPY command will be really nice :-) I just wanted to
share my tought :-)Being eable to read zipped file will be cool to .. but I'm not sure
it's Chrismas yet right so I didn't ask ;-)Thanks for your help
/David
David,
If you use a simple wrapper around copy using DBI (as either an
untrusted perl database function) or straight perl, you could read
zipped files as a filehandle.
As for defaults, you could also write a trigger to accomplish the
"default" behavior or, if you use the above perl-based loader, you
could implement defaults as well....
Sean
Show quoted text
Tom Lane wrote:
David Gagnon <dgagnon@siunik.com> writes:
Is there a way to specify default values in the COPY command?
There's always ALTER TABLE ... SET DEFAULT.
regards, tom lane
---------------------------(end of
broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)