UTF8 with BOM support in psql

Started by ITAGAKI Takahiroover 16 years ago48 messageshackers
Jump to latest
#1ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp

UTF8 encoding text files with BOM (Byte Order Mark) are commonly
used in Windows, though BOM was designed for UTF16 text originally.
However, psql cannot read such format even if we set client encoding
to UTF8. Is it worth supporting those format in psql?

When psql opens a file with -f or \i, it checks first 3 bytes of the
file. If they are BOM, discard the 3 bytes and change client encoding
to UTF8 automatically.

Is this change reasonable? Comments welcome.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

#2Bruce Momjian
bruce@momjian.us
In reply to: ITAGAKI Takahiro (#1)
Re: UTF8 with BOM support in psql

Itagaki Takahiro wrote:

UTF8 encoding text files with BOM (Byte Order Mark) are commonly
used in Windows, though BOM was designed for UTF16 text originally.
However, psql cannot read such format even if we set client encoding
to UTF8. Is it worth supporting those format in psql?

When psql opens a file with -f or \i, it checks first 3 bytes of the
file. If they are BOM, discard the 3 bytes and change client encoding
to UTF8 automatically.

Is this change reasonable? Comments welcome.

Seems there is community support for accepting BOM:

http://archives.postgresql.org/pgsql-hackers/2009-09/msg01625.php

Should I add this as a TODO item?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#3ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Bruce Momjian (#2)
Re: UTF8 with BOM support in psql

Bruce Momjian <bruce@momjian.us> wrote:

Itagaki Takahiro wrote:

When psql opens a file with -f or \i, it checks first 3 bytes of the
file. If they are BOM, discard the 3 bytes and change client encoding
to UTF8 automatically.

Seems there is community support for accepting BOM:
http://archives.postgresql.org/pgsql-hackers/2009-09/msg01625.php

Thank yor for information.
I read the thread that we discussed about BOM handling in *data types*.
I agree the decision in the thead that we should not skip BOM characters,
but we can handle BOM in a different way in the head of *files* for psql
and COPY input.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

#4Peter Eisentraut
peter_e@gmx.net
In reply to: ITAGAKI Takahiro (#1)
Re: UTF8 with BOM support in psql

On Tue, 2009-10-20 at 14:41 +0900, Itagaki Takahiro wrote:

UTF8 encoding text files with BOM (Byte Order Mark) are commonly
used in Windows, though BOM was designed for UTF16 text originally.
However, psql cannot read such format even if we set client encoding
to UTF8. Is it worth supporting those format in psql?

psql doesn't have a problem, but the backend's lexer doesn't parse the
BOM as whitespace. Since the lexer is byte-based, it will presumably
have problems with anything outside of ASCII that Unicode considers
whitespace.

When psql opens a file with -f or \i, it checks first 3 bytes of the
file. If they are BOM, discard the 3 bytes and change client encoding
to UTF8 automatically.

While I see that the Unicode standard supports using a UTF-8 encoded BOM
as UTF-8 signature, I wonder if those bytes can usefully appear in a
leading position in other encodings.

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#2)
Re: UTF8 with BOM support in psql

Bruce Momjian <bruce@momjian.us> writes:

Seems there is community support for accepting BOM:
http://archives.postgresql.org/pgsql-hackers/2009-09/msg01625.php

That discussion has approximately nothing to do with the
much-more-invasive change that Itagaki-san is suggesting.

In particular I think an automatic change of client_encoding isn't
particularly a good idea --- wouldn't you have to change it back later,
and is there any possibility of creating a security issue from such
behavior? Remember that client_encoding *IS* tied to security issues
such as backslash escape handling.

regards, tom lane

#6Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#5)
Re: UTF8 with BOM support in psql

Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

Seems there is community support for accepting BOM:
http://archives.postgresql.org/pgsql-hackers/2009-09/msg01625.php

That discussion has approximately nothing to do with the
much-more-invasive change that Itagaki-san is suggesting.

In particular I think an automatic change of client_encoding isn't
particularly a good idea --- wouldn't you have to change it back later,
and is there any possibility of creating a security issue from such
behavior? Remember that client_encoding *IS* tied to security issues
such as backslash escape handling.

Yeah, I don't think we should be second-guessing the user on the encoding.

What I think we might sensibly do is to eat the leading BOM of an SQL
file iff the client encoding is UTF8, and otherwise treat it as just
bytes in whatever the encoding is.

Should we also do the same for files passed via \copy? What about
streams on stdin? What about files read from the backend via COPY?

cheers

andrew

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#6)
Re: UTF8 with BOM support in psql

Andrew Dunstan <andrew@dunslane.net> writes:

What I think we might sensibly do is to eat the leading BOM of an SQL
file iff the client encoding is UTF8, and otherwise treat it as just
bytes in whatever the encoding is.

That seems relatively non-risky.

Should we also do the same for files passed via \copy? What about
streams on stdin? What about files read from the backend via COPY?

Not thrilled about doing this on stdin --- you have no good
justification for assuming that start of stdin corresponds to a file
boundary somewhere. COPY files, maybe.

regards, tom lane

#8Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Andrew Dunstan (#6)
Re: UTF8 with BOM support in psql

Andrew Dunstan <andrew@dunslane.net> wrote:

What I think we might sensibly do is to eat the leading BOM of an
SQL file iff the client encoding is UTF8, and otherwise treat it as
just bytes in whatever the encoding is.

Only at the beginning of the file or stream? What happens when people
concatenate files? Would it make sense to treat BOM as whitespace in
UTF-8, or maybe ignore it entirely?

-Kevin

#9Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#7)
Re: UTF8 with BOM support in psql

2009/10/20 Tom Lane <tgl@sss.pgh.pa.us>:

Andrew Dunstan <andrew@dunslane.net> writes:

What I think we might sensibly do is to eat the leading BOM of an SQL
file iff the client encoding is UTF8, and otherwise treat it as just
bytes in whatever the encoding is.

That seems relatively non-risky.

+1.

Should we also do the same for files passed via \copy? What about
streams on stdin? What about files read from the backend via COPY?

Not thrilled about doing this on stdin --- you have no good
justification for assuming that start of stdin corresponds to a file
boundary somewhere.  COPY files, maybe.

Yeah, that seems a lot more error-prone.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#10David Christensen
david@endpoint.com
In reply to: Tom Lane (#7)
Re: UTF8 with BOM support in psql

On Oct 20, 2009, at 10:51 AM, Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

What I think we might sensibly do is to eat the leading BOM of an SQL
file iff the client encoding is UTF8, and otherwise treat it as just
bytes in whatever the encoding is.

That seems relatively non-risky.

Is that only when the default client encoding is set to UTF8
(PGCLIENTENCODING, whatever), or will it be coded to work with the
following:

$ PGCLIENTENCODING=...nonutf8...
$ psql -f <file>

Where <file> is:
<BOM>
...

SET CLIENT ENCODING 'utf8';

...
EOF

Regards,

David
--
David Christensen
End Point Corporation
david@endpoint.com

#11ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: David Christensen (#10)
Re: UTF8 with BOM support in psql

David Christensen <david@endpoint.com> wrote:

Is that only when the default client encoding is set to UTF8
(PGCLIENTENCODING, whatever), or will it be coded to work with the
following:

$ psql -f <file>
Where <file> is:
<BOM>
SET CLIENT ENCODING 'utf8';

Sure. Client encoding is declared in body of a file, but BOM is
in head of the file. So, we should always ignore BOM sequence
at the file head no matter what client encoding is used.

The attached patch replace BOM with while spaces, but it does not
change client encoding automatically. I think we can always ignore
client encoding at the replacement because SQL command cannot start
with BOM sequence. If we don't ignore the sequence, execution of
the script must fail with syntax error.

This patch does nothing about COPY and \copy commands. It might be
possible to add BOM handling code around AllocateFile() in CopyFrom()
to support "COPY FROM 'utf8file-with-bom.tsv'", but we need another
approach for "COPY FROM STDIN". For example,
$ echo utf8bom-1.tsv utf8bom-2.tsv | psql -c "COPY FROM STDIN"
might contain BOM sequence in the middle of input stream.
Anyway, those changes would come from another patch in the future.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Attachments:

psql-utf8bom_20091021.patchapplication/octet-stream; name=psql-utf8bom_20091021.patchDownload+12-12
#12Peter Eisentraut
peter_e@gmx.net
In reply to: ITAGAKI Takahiro (#11)
Re: UTF8 with BOM support in psql

On Wed, 2009-10-21 at 13:11 +0900, Itagaki Takahiro wrote:

The attached patch replace BOM with while spaces, but it does not
change client encoding automatically. I think we can always ignore
client encoding at the replacement because SQL command cannot start
with BOM sequence. If we don't ignore the sequence, execution of
the script must fail with syntax error.

I feel that psql is the wrong place to fix this. BOMs in UTF-8 should
be ignored everywhere, all the time.

#13Andrew Dunstan
andrew@dunslane.net
In reply to: Peter Eisentraut (#12)
Re: UTF8 with BOM support in psql

Peter Eisentraut wrote:

On Wed, 2009-10-21 at 13:11 +0900, Itagaki Takahiro wrote:

The attached patch replace BOM with while spaces, but it does not
change client encoding automatically. I think we can always ignore
client encoding at the replacement because SQL command cannot start
with BOM sequence. If we don't ignore the sequence, execution of
the script must fail with syntax error.

I feel that psql is the wrong place to fix this. BOMs in UTF-8 should
be ignored everywhere, all the time.

I suggest you re-read the Unicode FAQ on the subject. That is not the
conclusion I came to after I read it. Quite the reverse in fact.

cheers

andrew

#14Peter Eisentraut
peter_e@gmx.net
In reply to: ITAGAKI Takahiro (#11)
Re: UTF8 with BOM support in psql

On ons, 2009-10-21 at 13:11 +0900, Itagaki Takahiro wrote:

So, we should always ignore BOM sequence
at the file head no matter what client encoding is used.

I think we can't do that. That byte sequence might be valid user data
in other encodings.

#15Peter Eisentraut
peter_e@gmx.net
In reply to: ITAGAKI Takahiro (#11)
Re: UTF8 with BOM support in psql

On ons, 2009-10-21 at 13:11 +0900, Itagaki Takahiro wrote:

Client encoding is declared in body of a file, but BOM is
in head of the file. So, we should always ignore BOM sequence
at the file head no matter what client encoding is used.

The attached patch replace BOM with while spaces, but it does not
change client encoding automatically. I think we can always ignore
client encoding at the replacement because SQL command cannot start
with BOM sequence. If we don't ignore the sequence, execution of
the script must fail with syntax error.

I don't know what the best solution is here. The BOM encoded as UTF-8
is valid data in other encodings. Of course, there is your point that
such data cannot be at the start of an SQL command.

There is also the notion of how files are handled on Unix. Normally,
you'd assume that all of

psql -f file.sql
psql < file.sql
cat file.sql | psql
cat file1.sql file2.sql | psql

behave consistently. That would require that the BOM is ignored in the
middle of the data stream (which is legal and required per Unicode
standard) and that this only happens if the character set is actually
Unicode.

Any ideas?

#16Andrew Dunstan
andrew@dunslane.net
In reply to: Peter Eisentraut (#15)
Re: UTF8 with BOM support in psql

Peter Eisentraut wrote:

On ons, 2009-10-21 at 13:11 +0900, Itagaki Takahiro wrote:

Client encoding is declared in body of a file, but BOM is
in head of the file. So, we should always ignore BOM sequence
at the file head no matter what client encoding is used.

The attached patch replace BOM with while spaces, but it does not
change client encoding automatically. I think we can always ignore
client encoding at the replacement because SQL command cannot start
with BOM sequence. If we don't ignore the sequence, execution of
the script must fail with syntax error.

I don't know what the best solution is here. The BOM encoded as UTF-8
is valid data in other encodings. Of course, there is your point that
such data cannot be at the start of an SQL command.

There is also the notion of how files are handled on Unix. Normally,
you'd assume that all of

psql -f file.sql
psql < file.sql
cat file.sql | psql
cat file1.sql file2.sql | psql

behave consistently. That would require that the BOM is ignored in the
middle of the data stream (which is legal and required per Unicode
standard) and that this only happens if the character set is actually
Unicode.

Cases 2 and 3 should be indistinguishable from psql's POV, although case
3 wins a "Useless Use of cat" award.

If we are only eating a BOM at the start of a file, which was the
consensus IIRC, and we treat STDIN as a file for this purpose, then we
would eat the leading BOM on file.sql and file1.sql in all the cases
above but not on file2.sql since we would not have any idea where the
file boundary was. That last case strikes me as a not very likely usage
(I'm pretty sure I've never used it, at least). A file containing:

\i file1.sql
\i file2.sql

would be the workaround if needed.

As for handling the fact that client encoding can't be set in a script
until after the leading BOM, there is always

PGOPTIONS="-c client_encoding=UTF8"

or similar.

cheers

andrew

#17Peter Eisentraut
peter_e@gmx.net
In reply to: ITAGAKI Takahiro (#11)
Re: UTF8 with BOM support in psql

On ons, 2009-10-21 at 13:11 +0900, Itagaki Takahiro wrote:

Sure. Client encoding is declared in body of a file, but BOM is
in head of the file. So, we should always ignore BOM sequence
at the file head no matter what client encoding is used.

The attached patch replace BOM with while spaces, but it does not
change client encoding automatically. I think we can always ignore
client encoding at the replacement because SQL command cannot start
with BOM sequence. If we don't ignore the sequence, execution of
the script must fail with syntax error.

OK, I think the consensus here is:

- Eat BOM at beginning of file (as you implemented)

- Only when client encoding is UTF-8 --> please fix that

I'm not sure if replacing a BOM by three spaces is a good way to
implement "eating", because it might throw off a column indicator
somewhere, say, but I couldn't reproduce a problem. Note that the U
+FEFF character is defined as *zero-width* non-breaking space.

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#17)
Re: UTF8 with BOM support in psql

Peter Eisentraut <peter_e@gmx.net> writes:

I'm not sure if replacing a BOM by three spaces is a good way to
implement "eating", because it might throw off a column indicator
somewhere, say, but I couldn't reproduce a problem. Note that the U
+FEFF character is defined as *zero-width* non-breaking space.

So wouldn't it be better to remove the three bytes, rather than
replace with spaces? The latter will certainly confuse clients that
think that "column 1" means what they think is the first character.
A syntax error in the first line of the file should be sufficient
to demonstrate the issue.

regards, tom lane

#19ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Peter Eisentraut (#17)
Re: UTF8 with BOM support in psql

Peter Eisentraut <peter_e@gmx.net> wrote:

OK, I think the consensus here is:
- Eat BOM at beginning of file (as you implemented)
- Only when client encoding is UTF-8 --> please fix that

Are they AND condition? If so, this patch will be useless.
Please remember \encoding or SET client_encoding appear
*after* BOM at beginning of file. I'll agree if the condition is
"Eat BOM at beginning of file and <<set client encoding to UTF-8>>",
like:
Defining Python Source Code Encodings:
http://www.python.org/dev/peps/pep-0263/

I'm not sure if replacing a BOM by three spaces is a good way to
implement "eating", because it might throw off a column indicator
somewhere, say, but I couldn't reproduce a problem. Note that the U
+FEFF character is defined as *zero-width* non-breaking space.

I assumed psql discards whitespaces automatically, but I see it is
more robust to remove BOM bytes explitly. I'll fix it.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: ITAGAKI Takahiro (#19)
Re: UTF8 with BOM support in psql

Itagaki Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:

Please remember \encoding or SET client_encoding appear
*after* BOM at beginning of file. I'll agree if the condition is
"Eat BOM at beginning of file and <<set client encoding to UTF-8>>",

As has been stated multiple times, that will not get accepted,
because it will *break* files in other encodings that chance to
match the BOM pattern.

regards, tom lane

#21Andrew Dunstan
andrew@dunslane.net
In reply to: ITAGAKI Takahiro (#19)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#21)
#23ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Tom Lane (#22)
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: ITAGAKI Takahiro (#23)
#25ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Tom Lane (#24)
#26Peter Eisentraut
peter_e@gmx.net
In reply to: ITAGAKI Takahiro (#25)
#27ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Peter Eisentraut (#26)
#28Chuck McDevitt
cmcdevitt@greenplum.com
In reply to: Peter Eisentraut (#15)
#29Andrew Dunstan
andrew@dunslane.net
In reply to: ITAGAKI Takahiro (#27)
#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#26)
#31Peter Eisentraut
peter_e@gmx.net
In reply to: ITAGAKI Takahiro (#19)
#32Peter Eisentraut
peter_e@gmx.net
In reply to: Chuck McDevitt (#28)
#33Chuck McDevitt
cmcdevitt@greenplum.com
In reply to: Peter Eisentraut (#32)
#34Andrew Dunstan
andrew@dunslane.net
In reply to: Peter Eisentraut (#32)
#35Chuck McDevitt
cmcdevitt@greenplum.com
In reply to: Andrew Dunstan (#34)
#36Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#34)
#37Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#36)
#38Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#37)
#39ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Peter Eisentraut (#31)
#40ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Andrew Dunstan (#29)
#41Andrew Dunstan
andrew@dunslane.net
In reply to: ITAGAKI Takahiro (#40)
#42ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Andrew Dunstan (#41)
#43Peter Eisentraut
peter_e@gmx.net
In reply to: ITAGAKI Takahiro (#39)
#44Peter Eisentraut
peter_e@gmx.net
In reply to: Andrew Dunstan (#41)
#45Andrew Dunstan
andrew@dunslane.net
In reply to: Peter Eisentraut (#44)
#46Peter Eisentraut
peter_e@gmx.net
In reply to: Andrew Dunstan (#45)
#47Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#46)
#48Peter Eisentraut
peter_e@gmx.net
In reply to: Peter Eisentraut (#17)