pgsql cannot read utf8 files moved from windows correctly!
H i,
I copied a table in sqlserver2005 to a txt file(There were many
chinese words in it).I saved it as a file encoded by ANSI,but I cant
open it in ubuntu.I tried GBK,GB18030,
UTF8,It just could not be opened.
Then I save it in windows with encoding UTF8,then I can open it in
ubuntu.I copied it to postgresql,but the file could not be read
correctly.For example,here is a file:
--book.txt
bookid(int) bookname(varchar(30))
1 Java
I created a table "book" in postgre,then I input the command line:
copy book from '/home/postgres/data/book.txt'
The error was:
error:invalid input syntax for integer:" 1";
context:line 1,column bookid
I know that every line of utf8 files is started with "fffe" or "feff"
and ended with "\r\n" in windows but not in linux,so the character
"1" has a space before it in the error line.
Is there any way I can transfer utf8 file in windows to linux system?
Thank you!
On Tue, Dec 18, 2007 at 02:53:16PM +0800, bookman bookman wrote:
I know that every line of utf8 files is started with "fffe" or "feff"
and ended with "\r\n" in windows but not in linux,so the character
"1" has a space before it in the error line.
Err, no. In UTF-16 files it is common to begin the *file* with that
character, but UTF-8 doesn't have that character anywhere, it's
illegal. Just stripping them out should be fine.
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
Those who make peaceful revolution impossible will make violent revolution inevitable.
-- John F Kennedy
On 12/20/07, Martijn van Oosterhout <kleptog@svana.org> wrote:
On Tue, Dec 18, 2007 at 02:53:16PM +0800, bookman bookman wrote:
I know that every line of utf8 files is started with "fffe" or "feff"
and ended with "\r\n" in windows but not in linux,so the character
"1" has a space before it in the error line.
Err, no. In UTF-16 files it is common to begin the *file* with that
character, but UTF-8 doesn't have that character anywhere, it's
illegal. Just stripping them out should be fine.
A BOM is perfectly legal in UTF-8, and it's commonly used as a
signature to indicate the text is UTF-8 instead of another encoding.
But yes, it is at the beginning of the file only.
it seems the use of BOM in UTF-8 is discouraged
http://unicode.org/faq/utf_bom.html#BOM
FF FE is UTF16-Little Endian
FE FF is UTF16-Big Endian
Please verify-
Bedankt/
Martin-
----- Original Message -----
From: "Trevor Talbot" <quension@gmail.com>
To: <pgsql-general@postgresql.org>
Sent: Sunday, December 23, 2007 10:39 AM
Subject: Re: [GENERAL] pgsql cannot read utf8 files moved from windows
correctly!
Show quoted text
On 12/20/07, Martijn van Oosterhout <kleptog@svana.org> wrote:
On Tue, Dec 18, 2007 at 02:53:16PM +0800, bookman bookman wrote:
I know that every line of utf8 files is started with "fffe" or "feff"
and ended with "\r\n" in windows but not in linux,so the character
"1" has a space before it in the error line.Err, no. In UTF-16 files it is common to begin the *file* with that
character, but UTF-8 doesn't have that character anywhere, it's
illegal. Just stripping them out should be fine.A BOM is perfectly legal in UTF-8, and it's commonly used as a
signature to indicate the text is UTF-8 instead of another encoding.
But yes, it is at the beginning of the file only.http://unicode.org/faq/utf_bom.html#29
---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?
bookman bookman wrote:
H i,
I copied a table in sqlserver2005 to a txt file(There were many
chinese words in it).I saved it as a file encoded by ANSI,but I cant
open it in ubuntu.I tried GBK,GB18030,
UTF8,It just could not be opened.Then I save it in windows with encoding UTF8,then I can open it in
ubuntu.I copied it to postgresql,but the file could not be read
correctly.For example,here is a file:--book.txt
bookid(int) bookname(varchar(30))
1 JavaI created a table "book" in postgre,then I input the command line:
copy book from '/home/postgres/data/book.txt'
The error was:
error:invalid input syntax for integer:" 1";
context:line 1,column bookid
I know that every line of utf8 files is started with "fffe" or "feff"
and ended with "\r\n" in windows but not in linux,so the character
"1" has a space before it in the error line.
Not long ago i ran into a similar problem with UTF-8 and BOM. It turned
out that a client of mine had edited some files in an old version of
Homesite for Windows, which has a bit of an issue in this area:
http://kb.adobe.com/selfservice/viewContent.do?externalId=tn_19059&sliceId=1
Perhaps yours is a related problem?
brian
On 12/23/07, Martin Gainty <mgainty@hotmail.com> wrote:
it seems the use of BOM in UTF-8 is discouraged
http://unicode.org/faq/utf_bom.html#BOM
Where do you see it being discouraged?