Support UTF-8 files with BOM in COPY FROM

Started by Itagaki Takahiroover 14 years ago28 messageshackers
Jump to latest
#1Itagaki Takahiro
itagaki.takahiro@gmail.com

Hi,

I'd like to support UTF-8 text or csv files that has BOM (byte order mark)
in COPY FROM command. BOM will be automatically detected and ignored
if the file encoding is UTF-8. WIP patch attached.

I'm thinking about only COPY FROM for reads, but if someone wants to add
BOM in COPY TO, we might also support COPY TO WITH BOM for writes.

Comments welcome.

--
Itagaki Takahiro

Attachments:

copy_from_bom.patchapplication/octet-stream; name=copy_from_bom.patchDownload+12-0
#2David E. Wheeler
david@kineticode.com
In reply to: Itagaki Takahiro (#1)
Re: Support UTF-8 files with BOM in COPY FROM

On Sep 25, 2011, at 9:58 PM, Itagaki Takahiro wrote:

I'd like to support UTF-8 text or csv files that has BOM (byte order mark)
in COPY FROM command. BOM will be automatically detected and ignored
if the file encoding is UTF-8. WIP patch attached.

By my reading of http://unicode.org/faq/utf_bom.html#bom5, I'd say +1

So I think what you propose makes sense.

I'm thinking about only COPY FROM for reads, but if someone wants to add
BOM in COPY TO, we might also support COPY TO WITH BOM for writes.

I think it would have to be optional, since "some recipients of UTF-8 encoded data do not expect a BOM."

Best,

David

#3Magnus Hagander
magnus@hagander.net
In reply to: Itagaki Takahiro (#1)
Re: Support UTF-8 files with BOM in COPY FROM

On Mon, Sep 26, 2011 at 06:58, Itagaki Takahiro
<itagaki.takahiro@gmail.com> wrote:

Hi,

I'd like to support UTF-8 text or csv files that has BOM (byte order mark)
in COPY FROM command. BOM will be automatically detected and ignored
if the file encoding is UTF-8. WIP patch attached.

I'm thinking about only COPY FROM for reads, but if someone wants to add
BOM in COPY TO, we might also support COPY TO WITH BOM for writes.

Comments welcome.

I like it in general. But if we're looking at the BOM, shouldn't we
also look and *reject* the file if it's a BOM for a non-UTF8 file? Say
if the BOM claims it's UTF16?

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

#4Itagaki Takahiro
itagaki.takahiro@gmail.com
In reply to: Magnus Hagander (#3)
Re: Support UTF-8 files with BOM in COPY FROM

On Mon, Sep 26, 2011 at 20:12, Magnus Hagander <magnus@hagander.net> wrote:

I like it in general. But if we're looking at the BOM, shouldn't we
also look and *reject* the file if it's a BOM for a non-UTF8 file? Say
if the BOM claims it's UTF16?

-1 because we're depending on manual configuration for now.
It would be reasonable if we had used automatic detection of
character encoding, but we don't. In addition, some crazy
encoding might use BOM codes as a valid character.

--
Itagaki Takahiro

#5Magnus Hagander
magnus@hagander.net
In reply to: Itagaki Takahiro (#4)
Re: Support UTF-8 files with BOM in COPY FROM

On Mon, Sep 26, 2011 at 13:36, Itagaki Takahiro
<itagaki.takahiro@gmail.com> wrote:

On Mon, Sep 26, 2011 at 20:12, Magnus Hagander <magnus@hagander.net> wrote:

I like it in general. But if we're looking at the BOM, shouldn't we
also look and *reject* the file if it's a BOM for a non-UTF8 file? Say
if the BOM claims it's UTF16?

-1 because we're depending on manual configuration for now.
It would be reasonable if we had used automatic detection of
character encoding, but we don't. In addition, some crazy
encoding might use BOM codes as a valid character.

Does such an encoding really exist? And the code only executes when
the user thinks he's in UTF8, right? So it would still only happen if
the incorrect encoding was specified..

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

#6Andrew Dunstan
andrew@dunslane.net
In reply to: Magnus Hagander (#3)
Re: Support UTF-8 files with BOM in COPY FROM

On 09/26/2011 07:12 AM, Magnus Hagander wrote:

On Mon, Sep 26, 2011 at 06:58, Itagaki Takahiro
<itagaki.takahiro@gmail.com> wrote:

Hi,

I'd like to support UTF-8 text or csv files that has BOM (byte order mark)
in COPY FROM command. BOM will be automatically detected and ignored
if the file encoding is UTF-8. WIP patch attached.

I'm thinking about only COPY FROM for reads, but if someone wants to add
BOM in COPY TO, we might also support COPY TO WITH BOM for writes.

Comments welcome.

I like it in general. But if we're looking at the BOM, shouldn't we
also look and *reject* the file if it's a BOM for a non-UTF8 file? Say
if the BOM claims it's UTF16?

It should be rejected as invalidly encoded anyway, as a non-utf8 BOM is
not valid utf-8. We shouldn't check in non-unicode cases where the
sequence might be valid in those encodings (e.g. ISO-8859-1).

cheers

andrew

#7Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Itagaki Takahiro (#1)
Re: Support UTF-8 files with BOM in COPY FROM

I'd like to support UTF-8 text or csv files that has BOM (byte order mark)
in COPY FROM command. BOM will be automatically detected and ignored
if the file encoding is UTF-8. WIP patch attached.

From RFC3629(http://tools.ietf.org/html/rfc3629#section-6):

o A protocol SHOULD forbid use of U+FEFF as a signature for those
textual protocol elements that the protocol mandates to be always
UTF-8, the signature function being totally useless in those cases.

COPY explicitly specifies the encoding (to be UTF-8 in this case). So
I think we should not regard U+FEFF as "BOM" in COPY, rather we should
regard U+FEFF as "ZERO WIDTH NO-BREAK SPACE".
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: David E. Wheeler (#2)
Re: Support UTF-8 files with BOM in COPY FROM

"David E. Wheeler" <david@kineticode.com> <CAJW2+qdYg1+xLaHDqnJs3AcKmCSVCDkv_LCAPWUtwmxL9dzVhQ@mail.gmail.com> writes:

On Sep 25, 2011, at 9:58 PM, Itagaki Takahiro wrote:

I'm thinking about only COPY FROM for reads, but if someone wants to add
BOM in COPY TO, we might also support COPY TO WITH BOM for writes.

I think it would have to be optional, since "some recipients of UTF-8 encoded data do not expect a BOM."

Putting a BOM into UTF8 data is flat out invalid per spec --- the fact
that Microsloth does it does not make it standards-conformant.

I think that accepting it on input can be sensible, on the principle of
"be liberal in what you accept", but the other side of that is "be
conservative in what you send". No BOMs in output, please.

regards, tom lane

#9Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tom Lane (#8)
Re: Support UTF-8 files with BOM in COPY FROM

"David E. Wheeler" <david@kineticode.com> <CAJW2+qdYg1+xLaHDqnJs3AcKmCSVCDkv_LCAPWUtwmxL9dzVhQ@mail.gmail.com> writes:

On Sep 25, 2011, at 9:58 PM, Itagaki Takahiro wrote:

I'm thinking about only COPY FROM for reads, but if someone wants to add
BOM in COPY TO, we might also support COPY TO WITH BOM for writes.

I think it would have to be optional, since "some recipients of UTF-8 encoded data do not expect a BOM."

Putting a BOM into UTF8 data is flat out invalid per spec --- the fact
that Microsloth does it does not make it standards-conformant.

I think that accepting it on input can be sensible, on the principle of
"be liberal in what you accept", but the other side of that is "be
conservative in what you send". No BOMs in output, please.

Suppose a user uses brain-dead editor, which does not accept UTF-8
without BOM. He decides to save his editor data into PostgreSQL using
COPY FROM. He extracts the data using COPY TO. Now he finds that his
stupid editor does not accept his data any more.

So I think if we decide to accept UTF-8 with BOM, we should keep BOM
when importing the data and output the data with BOM. If we don't want
to output UTF-8 with BOM, we should not accept UTF-8 with BOM. It
seems we don't have much choice...
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

#10Robert Haas
robertmhaas@gmail.com
In reply to: Tatsuo Ishii (#9)
Re: Support UTF-8 files with BOM in COPY FROM

On Mon, Sep 26, 2011 at 11:09 AM, Tatsuo Ishii <ishii@postgresql.org> wrote:

"David E. Wheeler" <david@kineticode.com> <CAJW2+qdYg1+xLaHDqnJs3AcKmCSVCDkv_LCAPWUtwmxL9dzVhQ@mail.gmail.com> writes:

On Sep 25, 2011, at 9:58 PM, Itagaki Takahiro wrote:

I'm thinking about only COPY FROM for reads, but if someone wants to add
BOM in COPY TO, we might also support COPY TO WITH BOM for writes.

I think it would have to be optional, since "some recipients of UTF-8 encoded data do not expect a BOM."

Putting a BOM into UTF8 data is flat out invalid per spec --- the fact
that Microsloth does it does not make it standards-conformant.

I think that accepting it on input can be sensible, on the principle of
"be liberal in what you accept", but the other side of that is "be
conservative in what you send".  No BOMs in output, please.

Suppose a user uses brain-dead editor, which does not accept UTF-8
without BOM.  He decides to save his editor data into PostgreSQL using
COPY FROM. He extracts the data using COPY TO. Now he finds that his
stupid editor does not accept his data any more.

So I think if we decide to accept UTF-8 with BOM, we should keep BOM
when importing the data and output the data with BOM. If we don't want
to output UTF-8 with BOM, we should not accept UTF-8 with BOM. It
seems we don't have much choice...

Maybe this needs to be an optional behavior, controlled by some COPY option.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#10)
Re: Support UTF-8 files with BOM in COPY FROM

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Sep 26, 2011 at 11:09 AM, Tatsuo Ishii <ishii@postgresql.org> wrote:

Suppose a user uses brain-dead editor, which does not accept UTF-8
without BOM.

Maybe this needs to be an optional behavior, controlled by some COPY option.

I'm not excited about emitting non-standards-conformant output on the
strength of a hypothetical argument about users and editors that may or
may not exist. I believe that there's a use-case for reading BOMs, but
I have seen no field complaints demonstrating that we need to write
them. Even if we had a couple, "use a less brain dead editor" might be
the best response. We cannot promise to be compatible with arbitrarily
broken software.

regards, tom lane

#12Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#11)
Re: Support UTF-8 files with BOM in COPY FROM

On Mon, Sep 26, 2011 at 1:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Sep 26, 2011 at 11:09 AM, Tatsuo Ishii <ishii@postgresql.org> wrote:

Suppose a user uses brain-dead editor, which does not accept UTF-8
without BOM.

Maybe this needs to be an optional behavior, controlled by some COPY option.

I'm not excited about emitting non-standards-conformant output on the
strength of a hypothetical argument about users and editors that may or
may not exist.  I believe that there's a use-case for reading BOMs, but
I have seen no field complaints demonstrating that we need to write
them.  Even if we had a couple, "use a less brain dead editor" might be
the best response.  We cannot promise to be compatible with arbitrarily
broken software.

The thing that makes me doubt that is this comment from Tatsuo Ishii:

TI> COPY explicitly specifies the encoding (to be UTF-8 in this case). So
TI> I think we should not regard U+FEFF as "BOM" in COPY, rather we should
TI> regard U+FEFF as "ZERO WIDTH NO-BREAK SPACE".

If a BOM is confusable with valid data, then I think recognizing it
and discarding it unconditionally is no good - you could end up where
COPY OUT, TRUNCATE, COPY IN changes the table contents.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#12)
Re: Support UTF-8 files with BOM in COPY FROM

Robert Haas <robertmhaas@gmail.com> writes:

The thing that makes me doubt that is this comment from Tatsuo Ishii:
TI> COPY explicitly specifies the encoding (to be UTF-8 in this case). So
TI> I think we should not regard U+FEFF as "BOM" in COPY, rather we should
TI> regard U+FEFF as "ZERO WIDTH NO-BREAK SPACE".

Yeah, that's a reasonable argument for rejecting the patch altogether.
I'm not qualified to decide whether it outweighs the "we need to be able
to read Notepad output" argument. I do observe that
http://en.wikipedia.org/wiki/Byte_order_mark
says Unicode 3.2 has deprecated the no-break-space interpretation,
but on the other hand you're right that we can't really assume that
the character is not present in people's data.

regards, tom lane

#14Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#13)
Re: Support UTF-8 files with BOM in COPY FROM

On Mon, Sep 26, 2011 at 1:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

The thing that makes me doubt that is this comment from Tatsuo Ishii:
TI> COPY explicitly specifies the encoding (to be UTF-8 in this case).  So
TI> I think we should not regard U+FEFF as "BOM" in COPY, rather we should
TI> regard U+FEFF as "ZERO WIDTH NO-BREAK SPACE".

Yeah, that's a reasonable argument for rejecting the patch altogether.

Yeah, or for making the behavior optional.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#14)
Re: Support UTF-8 files with BOM in COPY FROM

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Sep 26, 2011 at 1:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

The thing that makes me doubt that is this comment from Tatsuo Ishii:
TI> COPY explicitly specifies the encoding (to be UTF-8 in this case). �So
TI> I think we should not regard U+FEFF as "BOM" in COPY, rather we should
TI> regard U+FEFF as "ZERO WIDTH NO-BREAK SPACE".

Yeah, that's a reasonable argument for rejecting the patch altogether.

Yeah, or for making the behavior optional.

Sorry, I should have been clearer: it's an argument for rejecting *this*
patch. A patch that introduced a "BOM" option for COPY (which logically
could apply just as well to input or output) would be a different patch.

BTW, another issue with the patch-as-proposed is that it assumes,
without even checking, that fseek() will work (for that matter, it would
also fail pretty miserably on a file shorter than 3 bytes). We could
dodge that problem with an option since it would be reasonable to define
the option as meaning that there MUST be a BOM there. I would envision
it as acting much like the CSV HEADER option.

regards, tom lane

#16Peter Eisentraut
peter_e@gmx.net
In reply to: Tatsuo Ishii (#9)
Re: Support UTF-8 files with BOM in COPY FROM

On tis, 2011-09-27 at 00:09 +0900, Tatsuo Ishii wrote:

Suppose a user uses brain-dead editor, which does not accept UTF-8
without BOM.

I would first like to see evidence that such an editor exists.

#17Peter Eisentraut
peter_e@gmx.net
In reply to: Robert Haas (#12)
Re: Support UTF-8 files with BOM in COPY FROM

On mån, 2011-09-26 at 13:19 -0400, Robert Haas wrote:

The thing that makes me doubt that is this comment from Tatsuo Ishii:

TI> COPY explicitly specifies the encoding (to be UTF-8 in this case).
So
TI> I think we should not regard U+FEFF as "BOM" in COPY, rather we
should
TI> regard U+FEFF as "ZERO WIDTH NO-BREAK SPACE".

If a BOM is confusable with valid data, then I think recognizing it
and discarding it unconditionally is no good - you could end up where
COPY OUT, TRUNCATE, COPY IN changes the table contents.

We did recently accept a patch for psql -f to skip over a UTF-8
byte-order mark. We had a lot of this same discussion there.

#18Robert Haas
robertmhaas@gmail.com
In reply to: Peter Eisentraut (#17)
Re: Support UTF-8 files with BOM in COPY FROM

On Mon, Sep 26, 2011 at 2:38 PM, Peter Eisentraut <peter_e@gmx.net> wrote:

On mån, 2011-09-26 at 13:19 -0400, Robert Haas wrote:

The thing that makes me doubt that is this comment from Tatsuo Ishii:

TI> COPY explicitly specifies the encoding (to be UTF-8 in this case).
So
TI> I think we should not regard U+FEFF as "BOM" in COPY, rather we
should
TI> regard U+FEFF as "ZERO WIDTH NO-BREAK SPACE".

If a BOM is confusable with valid data, then I think recognizing it
and discarding it unconditionally is no good - you could end up where
COPY OUT, TRUNCATE, COPY IN changes the table contents.

We did recently accept a patch for psql -f to skip over a UTF-8
byte-order mark.  We had a lot of this same discussion there.

But that case is different, because zero-width, non-breaking space has
no particular meaning in an SQL script - it's either going to be
ignored as a BOM, ignored as whitespace, or an error. But inside a
file being subjected to COPY it might be confusable with data that the
user wanted to end up in some table.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#19Andrew Dunstan
andrew@dunslane.net
In reply to: Peter Eisentraut (#17)
Re: Support UTF-8 files with BOM in COPY FROM

On 09/26/2011 02:38 PM, Peter Eisentraut wrote:

On mån, 2011-09-26 at 13:19 -0400, Robert Haas wrote:

The thing that makes me doubt that is this comment from Tatsuo Ishii:

TI> COPY explicitly specifies the encoding (to be UTF-8 in this case).
So
TI> I think we should not regard U+FEFF as "BOM" in COPY, rather we
should
TI> regard U+FEFF as "ZERO WIDTH NO-BREAK SPACE".

If a BOM is confusable with valid data, then I think recognizing it
and discarding it unconditionally is no good - you could end up where
COPY OUT, TRUNCATE, COPY IN changes the table contents.

We did recently accept a patch for psql -f to skip over a UTF-8
byte-order mark. We had a lot of this same discussion there.

Yes, but wasn't part of the rationale that this was safe because a
leading BOM could not possibly be mistaken for anything else legitimate
in an SQL source file? That's quite different from a data file. ISTM.

cheers

andrew

#20Peter Eisentraut
peter_e@gmx.net
In reply to: Robert Haas (#18)
Re: Support UTF-8 files with BOM in COPY FROM

On mån, 2011-09-26 at 14:44 -0400, Robert Haas wrote:

We did recently accept a patch for psql -f to skip over a UTF-8
byte-order mark. We had a lot of this same discussion there.

But that case is different, because zero-width, non-breaking space has
no particular meaning in an SQL script - it's either going to be
ignored as a BOM, ignored as whitespace, or an error. But inside a
file being subjected to COPY it might be confusable with data that the
user wanted to end up in some table.

Yes, my point was more directed toward the discussion about whether BOM
in UTF-8 are valid at all. But your point pretty much kills this
altogether. If I store a BOM in row 1, column 1 of my table, because,
well, maybe it's an XML document or something, then it needs to be able
to survive a copy out and in. The only way we could proceed with this
would be if we prohibited BOMs in all user-data.

#21Brar Piening
brar@gmx.de
In reply to: Tom Lane (#8)
#22Brar Piening
brar@gmx.de
In reply to: Tom Lane (#13)
#23Brar Piening
brar@gmx.de
In reply to: Robert Haas (#12)
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Brar Piening (#23)
#25Brar Piening
brar@gmx.de
In reply to: Tom Lane (#24)
#26Brar Piening
brar@gmx.de
In reply to: Brar Piening (#25)
#27Peter Eisentraut
peter_e@gmx.net
In reply to: Peter Eisentraut (#20)
#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#27)