pgsql cannot read utf8 files moved from windows correctly!

Started by bookman bookmanover 18 years ago6 messagesgeneral

bookmanb@gmail.com

over 18 years ago

H i,

I copied a table in sqlserver2005 to a txt file(There were many
chinese words in it).I saved it as a file encoded by ANSI,but I cant
open it in ubuntu.I tried GBK,GB18030,
UTF8,It just could not be opened.

Then I save it in windows with encoding UTF8,then I can open it in
ubuntu.I copied it to postgresql,but the file could not be read
correctly.For example,here is a file:

--book.txt
bookid(int) bookname(varchar(30))
1 Java

I created a table "book" in postgre,then I input the command line:
copy book from '/home/postgres/data/book.txt'
The error was:
error:invalid input syntax for integer:" 1";
context:line 1,column bookid
I know that every line of utf8 files is started with "fffe" or "feff"
and ended with "\r\n" in windows but not in linux,so the character
"1" has a space before it in the error line.

Is there any way I can transfer utf8 file in windows to linux system?

Thank you!

Martijn van Oosterhout

kleptog@svana.org

over 18 years ago

In reply to: bookman bookman (#1)

Re: pgsql cannot read utf8 files moved from windows correctly!

On Tue, Dec 18, 2007 at 02:53:16PM +0800, bookman bookman wrote:

I know that every line of utf8 files is started with "fffe" or "feff"
and ended with "\r\n" in windows but not in linux,so the character
"1" has a space before it in the error line.

Err, no. In UTF-16 files it is common to begin the *file* with that
character, but UTF-8 doesn't have that character anywhere, it's
illegal. Just stripping them out should be fine.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Those who make peaceful revolution impossible will make violent revolution inevitable.
-- John F Kennedy

Trevor Talbot

quension@gmail.com

over 18 years ago

In reply to: Martijn van Oosterhout (#2)

Re: pgsql cannot read utf8 files moved from windows correctly!

On 12/20/07, Martijn van Oosterhout <kleptog@svana.org> wrote:

On Tue, Dec 18, 2007 at 02:53:16PM +0800, bookman bookman wrote:

I know that every line of utf8 files is started with "fffe" or "feff"
and ended with "\r\n" in windows but not in linux,so the character
"1" has a space before it in the error line.

Err, no. In UTF-16 files it is common to begin the *file* with that
character, but UTF-8 doesn't have that character anywhere, it's
illegal. Just stripping them out should be fine.

A BOM is perfectly legal in UTF-8, and it's commonly used as a
signature to indicate the text is UTF-8 instead of another encoding.
But yes, it is at the beginning of the file only.

http://unicode.org/faq/utf_bom.html#29

Martin Gainty

mgainty@hotmail.com

over 18 years ago

In reply to: bookman bookman (#1)

Re: pgsql cannot read utf8 files moved from windows correctly!

it seems the use of BOM in UTF-8 is discouraged
http://unicode.org/faq/utf_bom.html#BOM
FF FE is UTF16-Little Endian
FE FF is UTF16-Big Endian

Please verify-
Bedankt/
Martin-
----- Original Message -----
From: "Trevor Talbot" <quension@gmail.com>
To: <pgsql-general@postgresql.org>
Sent: Sunday, December 23, 2007 10:39 AM
Subject: Re: [GENERAL] pgsql cannot read utf8 files moved from windows
correctly!

Show quoted text

On 12/20/07, Martijn van Oosterhout <kleptog@svana.org> wrote:

On Tue, Dec 18, 2007 at 02:53:16PM +0800, bookman bookman wrote:

I know that every line of utf8 files is started with "fffe" or "feff"
and ended with "\r\n" in windows but not in linux,so the character
"1" has a space before it in the error line.

Err, no. In UTF-16 files it is common to begin the *file* with that
character, but UTF-8 doesn't have that character anywhere, it's
illegal. Just stripping them out should be fine.

A BOM is perfectly legal in UTF-8, and it's commonly used as a
signature to indicate the text is UTF-8 instead of another encoding.
But yes, it is at the beginning of the file only.

http://unicode.org/faq/utf_bom.html#29

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org/

brian

brian@zijn-digital.com

over 18 years ago

In reply to: bookman bookman (#1)

Re: pgsql cannot read utf8 files moved from windows correctly!

bookman bookman wrote:

H i,

I copied a table in sqlserver2005 to a txt file(There were many
chinese words in it).I saved it as a file encoded by ANSI,but I cant
open it in ubuntu.I tried GBK,GB18030,
UTF8,It just could not be opened.

Then I save it in windows with encoding UTF8,then I can open it in
ubuntu.I copied it to postgresql,but the file could not be read
correctly.For example,here is a file:

--book.txt
bookid(int) bookname(varchar(30))
1 Java

I created a table "book" in postgre,then I input the command line:
copy book from '/home/postgres/data/book.txt'
The error was:
error:invalid input syntax for integer:" 1";
context:line 1,column bookid
I know that every line of utf8 files is started with "fffe" or "feff"
and ended with "\r\n" in windows but not in linux,so the character
"1" has a space before it in the error line.

Not long ago i ran into a similar problem with UTF-8 and BOM. It turned
out that a client of mine had edited some files in an old version of
Homesite for Windows, which has a bit of an issue in this area:

http://kb.adobe.com/selfservice/viewContent.do?externalId=tn_19059&sliceId=1

Perhaps yours is a related problem?

brian

Trevor Talbot

quension@gmail.com

over 18 years ago

In reply to: Martin Gainty (#4)

Re: pgsql cannot read utf8 files moved from windows correctly!

On 12/23/07, Martin Gainty <mgainty@hotmail.com> wrote:

it seems the use of BOM in UTF-8 is discouraged
http://unicode.org/faq/utf_bom.html#BOM

Where do you see it being discouraged?