BUG #3819: UTF8 can't handle \000
The following bug has been logged online:
Bug reference: 3819
Logged by: Franklin Schmidt
Email address: fschmidt@gmail.com
PostgreSQL version: 8.2
Operating system: XP & Linux
Description: UTF8 can't handle \000
Details:
Trying to store \000 in a text field with UTF8 encoding causes an error. I
assume this is because Postgres is written in C, but it's still wrong. A
solution was suggested here:
http://www.nabble.com/invalid-byte-sequence-for-encoding-%22UTF8%22%3A-0x00-
tp9058998p9096326.html
"I can think of some ways the server could support it without extensive
changes .. e.g. use a modified UTF8 representation which stores \u0000 as
0xc0 0x80 internally"
Franklin Schmidt wrote:
The following bug has been logged online:
Bug reference: 3819
Logged by: Franklin Schmidt
Email address: fschmidt@gmail.com
PostgreSQL version: 8.2
Operating system: XP & Linux
Description: UTF8 can't handle \000
Details:Trying to store \000 in a text field with UTF8 encoding causes an error. I
assume this is because Postgres is written in C, but it's still wrong. A
solution was suggested here:http://www.nabble.com/invalid-byte-sequence-for-encoding-%22UTF8%22%3A-0x00-
tp9058998p9096326.html"I can think of some ways the server could support it without extensive
changes .. e.g. use a modified UTF8 representation which stores \u0000 as
0xc0 0x80 internally"
Uh, as far as I know 0x00 is not a valid UTF8 byte value. I suggest you
use bytea to store 0x00.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
On Dec 17, 2007 12:54 AM, Bruce Momjian <bruce@momjian.us> wrote:
Uh, as far as I know 0x00 is not a valid UTF8 byte value.
I think it is a valid value. RFC 3629 says:
"Character numbers from U+0000 to U+007F (US-ASCII repertoire)
correspond to octets 00 to 7F (7 bit US-ASCII values)."
Franklin Schmidt wrote:
On Dec 17, 2007 12:54 AM, Bruce Momjian <bruce@momjian.us> wrote:
Uh, as far as I know 0x00 is not a valid UTF8 byte value.
I think it is a valid value. RFC 3629 says:
"Character numbers from U+0000 to U+007F (US-ASCII repertoire)
correspond to octets 00 to 7F (7 bit US-ASCII values)."
Well, I realize 0x00 is a valid ASCII value and therefore a valid UTF8
value but we have never had anyone complain they can't store the 0x00
character because it doesn't mean anything in ASCII. They use bytea to
store binary data like 0x00.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
On Dec 17, 2007 1:28 AM, Bruce Momjian <bruce@momjian.us> wrote:
Well, I realize 0x00 is a valid ASCII value and therefore a valid UTF8
value but we have never had anyone complain they can't store the 0x00
character because it doesn't mean anything in ASCII. They use bytea to
store binary data like 0x00.
Here are a few complaints:
http://www.nabble.com/-tp9058998.html
http://www.nabble.com/-tp11750041.html
http://www.nabble.com/-tp8414157.html
I agree that storing 0x00 in a UTF8 string is weird, but I am
converting a huge database to postgres, and in a huge database, weird
things happen. Using bytea for a text field just because one in a
million records has a 0x00 doesn't make sense to me. I did hack
around it in my conversion code to remove the 0x00 but I expect that
anyone else who tries converting a big database to postgres will also
confront this issue.
Franklin Schmidt wrote:
I agree that storing 0x00 in a UTF8 string is weird, but I am
converting a huge database to postgres, and in a huge database, weird
things happen. Using bytea for a text field just because one in a
million records has a 0x00 doesn't make sense to me. I did hack
around it in my conversion code to remove the 0x00 but I expect that
anyone else who tries converting a big database to postgres will also
confront this issue.
That's the right solution. If you have 0x00 bytes in your text fields,
you're much better off cleaning them away anyway, than trying to work
around them.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com