UTF8 with BOM support in psql
UTF8 encoding text files with BOM (Byte Order Mark) are commonly
used in Windows, though BOM was designed for UTF16 text originally.
However, psql cannot read such format even if we set client encoding
to UTF8. Is it worth supporting those format in psql?
When psql opens a file with -f or \i, it checks first 3 bytes of the
file. If they are BOM, discard the 3 bytes and change client encoding
to UTF8 automatically.
Is this change reasonable? Comments welcome.
Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center
Itagaki Takahiro wrote:
UTF8 encoding text files with BOM (Byte Order Mark) are commonly
used in Windows, though BOM was designed for UTF16 text originally.
However, psql cannot read such format even if we set client encoding
to UTF8. Is it worth supporting those format in psql?When psql opens a file with -f or \i, it checks first 3 bytes of the
file. If they are BOM, discard the 3 bytes and change client encoding
to UTF8 automatically.Is this change reasonable? Comments welcome.
Seems there is community support for accepting BOM:
http://archives.postgresql.org/pgsql-hackers/2009-09/msg01625.php
Should I add this as a TODO item?
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <bruce@momjian.us> wrote:
Itagaki Takahiro wrote:
When psql opens a file with -f or \i, it checks first 3 bytes of the
file. If they are BOM, discard the 3 bytes and change client encoding
to UTF8 automatically.Seems there is community support for accepting BOM:
http://archives.postgresql.org/pgsql-hackers/2009-09/msg01625.php
Thank yor for information.
I read the thread that we discussed about BOM handling in *data types*.
I agree the decision in the thead that we should not skip BOM characters,
but we can handle BOM in a different way in the head of *files* for psql
and COPY input.
Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center
On Tue, 2009-10-20 at 14:41 +0900, Itagaki Takahiro wrote:
UTF8 encoding text files with BOM (Byte Order Mark) are commonly
used in Windows, though BOM was designed for UTF16 text originally.
However, psql cannot read such format even if we set client encoding
to UTF8. Is it worth supporting those format in psql?
psql doesn't have a problem, but the backend's lexer doesn't parse the
BOM as whitespace. Since the lexer is byte-based, it will presumably
have problems with anything outside of ASCII that Unicode considers
whitespace.
When psql opens a file with -f or \i, it checks first 3 bytes of the
file. If they are BOM, discard the 3 bytes and change client encoding
to UTF8 automatically.
While I see that the Unicode standard supports using a UTF-8 encoded BOM
as UTF-8 signature, I wonder if those bytes can usefully appear in a
leading position in other encodings.
Bruce Momjian <bruce@momjian.us> writes:
Seems there is community support for accepting BOM:
http://archives.postgresql.org/pgsql-hackers/2009-09/msg01625.php
That discussion has approximately nothing to do with the
much-more-invasive change that Itagaki-san is suggesting.
In particular I think an automatic change of client_encoding isn't
particularly a good idea --- wouldn't you have to change it back later,
and is there any possibility of creating a security issue from such
behavior? Remember that client_encoding *IS* tied to security issues
such as backslash escape handling.
regards, tom lane
Tom Lane wrote:
Bruce Momjian <bruce@momjian.us> writes:
Seems there is community support for accepting BOM:
http://archives.postgresql.org/pgsql-hackers/2009-09/msg01625.phpThat discussion has approximately nothing to do with the
much-more-invasive change that Itagaki-san is suggesting.In particular I think an automatic change of client_encoding isn't
particularly a good idea --- wouldn't you have to change it back later,
and is there any possibility of creating a security issue from such
behavior? Remember that client_encoding *IS* tied to security issues
such as backslash escape handling.
Yeah, I don't think we should be second-guessing the user on the encoding.
What I think we might sensibly do is to eat the leading BOM of an SQL
file iff the client encoding is UTF8, and otherwise treat it as just
bytes in whatever the encoding is.
Should we also do the same for files passed via \copy? What about
streams on stdin? What about files read from the backend via COPY?
cheers
andrew
Andrew Dunstan <andrew@dunslane.net> writes:
What I think we might sensibly do is to eat the leading BOM of an SQL
file iff the client encoding is UTF8, and otherwise treat it as just
bytes in whatever the encoding is.
That seems relatively non-risky.
Should we also do the same for files passed via \copy? What about
streams on stdin? What about files read from the backend via COPY?
Not thrilled about doing this on stdin --- you have no good
justification for assuming that start of stdin corresponds to a file
boundary somewhere. COPY files, maybe.
regards, tom lane
Andrew Dunstan <andrew@dunslane.net> wrote:
What I think we might sensibly do is to eat the leading BOM of an
SQL file iff the client encoding is UTF8, and otherwise treat it as
just bytes in whatever the encoding is.
Only at the beginning of the file or stream? What happens when people
concatenate files? Would it make sense to treat BOM as whitespace in
UTF-8, or maybe ignore it entirely?
-Kevin
2009/10/20 Tom Lane <tgl@sss.pgh.pa.us>:
Andrew Dunstan <andrew@dunslane.net> writes:
What I think we might sensibly do is to eat the leading BOM of an SQL
file iff the client encoding is UTF8, and otherwise treat it as just
bytes in whatever the encoding is.That seems relatively non-risky.
+1.
Should we also do the same for files passed via \copy? What about
streams on stdin? What about files read from the backend via COPY?Not thrilled about doing this on stdin --- you have no good
justification for assuming that start of stdin corresponds to a file
boundary somewhere. COPY files, maybe.
Yeah, that seems a lot more error-prone.
--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/
On Oct 20, 2009, at 10:51 AM, Tom Lane wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
What I think we might sensibly do is to eat the leading BOM of an SQL
file iff the client encoding is UTF8, and otherwise treat it as just
bytes in whatever the encoding is.That seems relatively non-risky.
Is that only when the default client encoding is set to UTF8
(PGCLIENTENCODING, whatever), or will it be coded to work with the
following:
$ PGCLIENTENCODING=...nonutf8...
$ psql -f <file>
Where <file> is:
<BOM>
...
SET CLIENT ENCODING 'utf8';
...
EOF
Regards,
David
--
David Christensen
End Point Corporation
david@endpoint.com
David Christensen <david@endpoint.com> wrote:
Is that only when the default client encoding is set to UTF8
(PGCLIENTENCODING, whatever), or will it be coded to work with the
following:$ psql -f <file>
Where <file> is:
<BOM>
SET CLIENT ENCODING 'utf8';
Sure. Client encoding is declared in body of a file, but BOM is
in head of the file. So, we should always ignore BOM sequence
at the file head no matter what client encoding is used.
The attached patch replace BOM with while spaces, but it does not
change client encoding automatically. I think we can always ignore
client encoding at the replacement because SQL command cannot start
with BOM sequence. If we don't ignore the sequence, execution of
the script must fail with syntax error.
This patch does nothing about COPY and \copy commands. It might be
possible to add BOM handling code around AllocateFile() in CopyFrom()
to support "COPY FROM 'utf8file-with-bom.tsv'", but we need another
approach for "COPY FROM STDIN". For example,
$ echo utf8bom-1.tsv utf8bom-2.tsv | psql -c "COPY FROM STDIN"
might contain BOM sequence in the middle of input stream.
Anyway, those changes would come from another patch in the future.
Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center
Attachments:
psql-utf8bom_20091021.patchapplication/octet-stream; name=psql-utf8bom_20091021.patchDownload+12-12
On Wed, 2009-10-21 at 13:11 +0900, Itagaki Takahiro wrote:
The attached patch replace BOM with while spaces, but it does not
change client encoding automatically. I think we can always ignore
client encoding at the replacement because SQL command cannot start
with BOM sequence. If we don't ignore the sequence, execution of
the script must fail with syntax error.
I feel that psql is the wrong place to fix this. BOMs in UTF-8 should
be ignored everywhere, all the time.
Peter Eisentraut wrote:
On Wed, 2009-10-21 at 13:11 +0900, Itagaki Takahiro wrote:
The attached patch replace BOM with while spaces, but it does not
change client encoding automatically. I think we can always ignore
client encoding at the replacement because SQL command cannot start
with BOM sequence. If we don't ignore the sequence, execution of
the script must fail with syntax error.I feel that psql is the wrong place to fix this. BOMs in UTF-8 should
be ignored everywhere, all the time.
I suggest you re-read the Unicode FAQ on the subject. That is not the
conclusion I came to after I read it. Quite the reverse in fact.
cheers
andrew
On ons, 2009-10-21 at 13:11 +0900, Itagaki Takahiro wrote:
So, we should always ignore BOM sequence
at the file head no matter what client encoding is used.
I think we can't do that. That byte sequence might be valid user data
in other encodings.
On ons, 2009-10-21 at 13:11 +0900, Itagaki Takahiro wrote:
Client encoding is declared in body of a file, but BOM is
in head of the file. So, we should always ignore BOM sequence
at the file head no matter what client encoding is used.The attached patch replace BOM with while spaces, but it does not
change client encoding automatically. I think we can always ignore
client encoding at the replacement because SQL command cannot start
with BOM sequence. If we don't ignore the sequence, execution of
the script must fail with syntax error.
I don't know what the best solution is here. The BOM encoded as UTF-8
is valid data in other encodings. Of course, there is your point that
such data cannot be at the start of an SQL command.
There is also the notion of how files are handled on Unix. Normally,
you'd assume that all of
psql -f file.sql
psql < file.sql
cat file.sql | psql
cat file1.sql file2.sql | psql
behave consistently. That would require that the BOM is ignored in the
middle of the data stream (which is legal and required per Unicode
standard) and that this only happens if the character set is actually
Unicode.
Any ideas?
Peter Eisentraut wrote:
On ons, 2009-10-21 at 13:11 +0900, Itagaki Takahiro wrote:
Client encoding is declared in body of a file, but BOM is
in head of the file. So, we should always ignore BOM sequence
at the file head no matter what client encoding is used.The attached patch replace BOM with while spaces, but it does not
change client encoding automatically. I think we can always ignore
client encoding at the replacement because SQL command cannot start
with BOM sequence. If we don't ignore the sequence, execution of
the script must fail with syntax error.I don't know what the best solution is here. The BOM encoded as UTF-8
is valid data in other encodings. Of course, there is your point that
such data cannot be at the start of an SQL command.There is also the notion of how files are handled on Unix. Normally,
you'd assume that all ofpsql -f file.sql
psql < file.sql
cat file.sql | psql
cat file1.sql file2.sql | psqlbehave consistently. That would require that the BOM is ignored in the
middle of the data stream (which is legal and required per Unicode
standard) and that this only happens if the character set is actually
Unicode.
Cases 2 and 3 should be indistinguishable from psql's POV, although case
3 wins a "Useless Use of cat" award.
If we are only eating a BOM at the start of a file, which was the
consensus IIRC, and we treat STDIN as a file for this purpose, then we
would eat the leading BOM on file.sql and file1.sql in all the cases
above but not on file2.sql since we would not have any idea where the
file boundary was. That last case strikes me as a not very likely usage
(I'm pretty sure I've never used it, at least). A file containing:
\i file1.sql
\i file2.sql
would be the workaround if needed.
As for handling the fact that client encoding can't be set in a script
until after the leading BOM, there is always
PGOPTIONS="-c client_encoding=UTF8"
or similar.
cheers
andrew
On ons, 2009-10-21 at 13:11 +0900, Itagaki Takahiro wrote:
Sure. Client encoding is declared in body of a file, but BOM is
in head of the file. So, we should always ignore BOM sequence
at the file head no matter what client encoding is used.The attached patch replace BOM with while spaces, but it does not
change client encoding automatically. I think we can always ignore
client encoding at the replacement because SQL command cannot start
with BOM sequence. If we don't ignore the sequence, execution of
the script must fail with syntax error.
OK, I think the consensus here is:
- Eat BOM at beginning of file (as you implemented)
- Only when client encoding is UTF-8 --> please fix that
I'm not sure if replacing a BOM by three spaces is a good way to
implement "eating", because it might throw off a column indicator
somewhere, say, but I couldn't reproduce a problem. Note that the U
+FEFF character is defined as *zero-width* non-breaking space.
Peter Eisentraut <peter_e@gmx.net> writes:
I'm not sure if replacing a BOM by three spaces is a good way to
implement "eating", because it might throw off a column indicator
somewhere, say, but I couldn't reproduce a problem. Note that the U
+FEFF character is defined as *zero-width* non-breaking space.
So wouldn't it be better to remove the three bytes, rather than
replace with spaces? The latter will certainly confuse clients that
think that "column 1" means what they think is the first character.
A syntax error in the first line of the file should be sufficient
to demonstrate the issue.
regards, tom lane
Peter Eisentraut <peter_e@gmx.net> wrote:
OK, I think the consensus here is:
- Eat BOM at beginning of file (as you implemented)
- Only when client encoding is UTF-8 --> please fix that
Are they AND condition? If so, this patch will be useless.
Please remember \encoding or SET client_encoding appear
*after* BOM at beginning of file. I'll agree if the condition is
"Eat BOM at beginning of file and <<set client encoding to UTF-8>>",
like:
Defining Python Source Code Encodings:
http://www.python.org/dev/peps/pep-0263/
I'm not sure if replacing a BOM by three spaces is a good way to
implement "eating", because it might throw off a column indicator
somewhere, say, but I couldn't reproduce a problem. Note that the U
+FEFF character is defined as *zero-width* non-breaking space.
I assumed psql discards whitespaces automatically, but I see it is
more robust to remove BOM bytes explitly. I'll fix it.
Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center
Itagaki Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:
Please remember \encoding or SET client_encoding appear
*after* BOM at beginning of file. I'll agree if the condition is
"Eat BOM at beginning of file and <<set client encoding to UTF-8>>",
As has been stated multiple times, that will not get accepted,
because it will *break* files in other encodings that chance to
match the BOM pattern.
regards, tom lane