bytea and character encoding when inserting escaped literals

Started by Lee Feigenbaumalmost 18 years ago8 messagesgeneral
Jump to latest
#1Lee Feigenbaum
lee@thefigtrees.net

Hi,

I've searched the archives a fair amount on this topic, but have not
found quite the answer / explanation I'm looking for. I attribute this
to my eternal confusion over character encoding issues in all
environments, so I apologize in advance for what might be a stupid
question. :)

I'mm running Postgresql 8.3.1 on WinXP. I have a UTF8 database into
which I'm trying to execute a series of INSERT INTO DDL statements. One
of the columns in the table I'm inserting into is a BYTEA column,
intended to hold the bytes that are the representation of a (small)
image.[1]Actually, this DDL has been converted from that for a different DB that uses LONGVARBINARY for this. BYTEA was my best guess for the Postgresql equivalent.

I had thought -- apparently erroneously -- that because this is not a
text based column, that I could send any string of bytes (octets) via my
INSERT statement to populate values in this column. I'm using escaped
string literals with hexadecimal representation so my INSERTs look
something like:

INSERT INTO myTable VALUES (..., E'\x15\x1C\x2F\x00\x02...', ...) ;

As you might be able to guess, I'm getting the error:

ERROR: Invalid byte sequence for encoding "UTF8": 0x00

(I get the error whether I attempt this via JDBC or via the command-line
client with client encoding set to UTF8 or WIN1252.)

Again, I was surprised by this error since I thought from the
documentation at [2]http://www.postgresql.org/docs/8.3/interactive/datatype-binary.html that the server would only expect to be dealing in
a sequence of octets here, without any character-encoding constraints
implied by the DB's encoding.

What is the actual cause of this error, and how do I workaround it? Do I
need to pretend that my data is Unicode character data and specify the
UTF8 octets for that character data in my E'...' literal?

thanks in advance for any help!

Lee

PS [3]I also was confused as to why 0x00 would be an invalid UTF8 byte sequence. On its own, as I understand it, 0x00 is a fine UTF8 byte sequence (representing Unicode codepoint 0). And when I (from the command line) try to insert other invalid UTF8 sequences -- such as INSERT INTO foo VALUES (E'\xC0\x80') I get an error that mentions the full byte sequence as invalid: "invalid byte sequence for encoding "UTF8": 0xc080". So this further confuses me. :-)

[1]: Actually, this DDL has been converted from that for a different DB that uses LONGVARBINARY for this. BYTEA was my best guess for the Postgresql equivalent.
that uses LONGVARBINARY for this. BYTEA was my best guess for the
Postgresql equivalent.

[2]: http://www.postgresql.org/docs/8.3/interactive/datatype-binary.html

[3]: I also was confused as to why 0x00 would be an invalid UTF8 byte sequence. On its own, as I understand it, 0x00 is a fine UTF8 byte sequence (representing Unicode codepoint 0). And when I (from the command line) try to insert other invalid UTF8 sequences -- such as INSERT INTO foo VALUES (E'\xC0\x80') I get an error that mentions the full byte sequence as invalid: "invalid byte sequence for encoding "UTF8": 0xc080". So this further confuses me. :-)
sequence. On its own, as I understand it, 0x00 is a fine UTF8 byte
sequence (representing Unicode codepoint 0). And when I (from the
command line) try to insert other invalid UTF8 sequences -- such as
INSERT INTO foo VALUES (E'\xC0\x80') I get an error that mentions the
full byte sequence as invalid: "invalid byte sequence for encoding
"UTF8": 0xc080". So this further confuses me. :-)

#2Asche
asche.public@mac.com
In reply to: Lee Feigenbaum (#1)
Re: bytea and character encoding when inserting escaped literals

Hi Lee,

On 05.05.2008, at 17:07, Lee Feigenbaum wrote:

INSERT INTO myTable VALUES (..., E'\x15\x1C\x2F\x00\x02...', ...) ;

try escaping the backslashes:

INSERT INTO myTable VALUES (..., E'\\x15\\x1C\\x2F\\x00\\x02...', ...) ;

Jan

#3Lee Feigenbaum
lee@thefigtrees.net
In reply to: Asche (#2)
Re: bytea and character encoding when inserting escaped literals

Asche wrote:

Hi Lee,

On 05.05.2008, at 17:07, Lee Feigenbaum wrote:

INSERT INTO myTable VALUES (..., E'\x15\x1C\x2F\x00\x02...', ...) ;

try escaping the backslashes:

INSERT INTO myTable VALUES (..., E'\\x15\\x1C\\x2F\\x00\\x02...', ...) ;

Hi Jan,

Thanks for the suggestion. I should have mentioned in my original
message that as per your suggestion and the suggestion in the
documentation, I have tried escaping the backslashes. When I do this, I
get the error:

ERROR: invalid input syntax for type bytea

I tried also doing

INSERT INTO myTable VALUES (...,
E'\\x15\\x1C\\x2F\\x00\\x02...'::bytea, ...) ;

but get the same errors.

Lee

#4Asche
asche.public@mac.com
In reply to: Lee Feigenbaum (#3)
Re: bytea and character encoding when inserting escaped literals

Hi Lee,

Thanks for the suggestion. I should have mentioned in my original
message that as per your suggestion and the suggestion in the
documentation, I have tried escaping the backslashes. When I do
this, I get the error:

ERROR: invalid input syntax for type bytea

I tried also doing

INSERT INTO myTable VALUES (..., E'\\x15\\x1C\\x2F\\x00\
\x02...'::bytea, ...) ;

but get the same errors.

I think i see another problem with your query. You should convert to
three-digit octal (something like \\001\\002...) not \\x01 (hex?).

Jan

#5Andy Anderson
aanderson@amherst.edu
In reply to: Lee Feigenbaum (#1)
Re: bytea and character encoding when inserting escaped literals

I'm thinking that the answer is in the literal interpretation of the
error message, i.e. it doesn't like the specific byte 0x00, i.e. the
null byte. According to the docs (4.1.2.1. String Constants):

"The character with the code zero cannot be in a string constant."

The reason may be that these are handled by C under the hood, so that
sequence would terminate the string and there shouldn't be anything
following it.

So the question then becomes, how to insert binary data this way? I'm
not sure about that off-hand.

-- Andy

On May 5, 2008, at 11:07 AM, Lee Feigenbaum wrote:

Show quoted text

I had thought -- apparently erroneously -- that because this is not
a text based column, that I could send any string of bytes (octets)
via my INSERT statement to populate values in this column. I'm
using escaped string literals with hexadecimal representation so my
INSERTs look something like:

INSERT INTO myTable VALUES (..., E'\x15\x1C\x2F\x00\x02...', ...) ;

As you might be able to guess, I'm getting the error:

ERROR: Invalid byte sequence for encoding "UTF8": 0x00

(I get the error whether I attempt this via JDBC or via the command-
line client with client encoding set to UTF8 or WIN1252.)

Again, I was surprised by this error since I thought from the
documentation at [2] that the server would only expect to be
dealing in a sequence of octets here, without any character-
encoding constraints implied by the DB's encoding.

What is the actual cause of this error, and how do I workaround it?
Do I need to pretend that my data is Unicode character data and
specify the UTF8 octets for that character data in my E'...' literal?

thanks in advance for any help!

Lee

PS [3]

[1] Actually, this DDL has been converted from that for a different
DB that uses LONGVARBINARY for this. BYTEA was my best guess for
the Postgresql equivalent.

[2] http://www.postgresql.org/docs/8.3/interactive/datatype-
binary.html

[3] I also was confused as to why 0x00 would be an invalid UTF8
byte sequence. On its own, as I understand it, 0x00 is a fine UTF8
byte sequence (representing Unicode codepoint 0). And when I (from
the command line) try to insert other invalid UTF8 sequences --
such as INSERT INTO foo VALUES (E'\xC0\x80') I get an error that
mentions the full byte sequence as invalid: "invalid byte sequence
for encoding "UTF8": 0xc080". So this further confuses me. :-)

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#6Lee Feigenbaum
lee@thefigtrees.net
In reply to: Asche (#4)
Re: bytea and character encoding when inserting escaped literals

Asche wrote:

Hi Lee,

Thanks for the suggestion. I should have mentioned in my original
message that as per your suggestion and the suggestion in the
documentation, I have tried escaping the backslashes. When I do this,
I get the error:

ERROR: invalid input syntax for type bytea

I tried also doing

INSERT INTO myTable VALUES (...,
E'\\x15\\x1C\\x2F\\x00\\x02...'::bytea, ...) ;

but get the same errors.

I think i see another problem with your query. You should convert to
three-digit octal (something like \\001\\002...) not \\x01 (hex?).

Hi Jan,

Thanks, I think I finally see what's happening here (and understand the
docs) - the bytea type has its own string-serialization (escape format)
_separate_ from postgresql's normal string literal escaping. So while
E'\xC0' is postgresql serialization of a string containing whatever
character maps from 0xC0 in the current encoding, that byte cannot
directly go into a bytea. Instead, I need to have a doubly-escaped octal
(specifically) string so that the first escape generates a string like
\000\001\002 which the bytea processor (somewhere) then re-parses as a
sequence of bytes.

Would be nice if the bytea parser understood hex representation too, but
beggars can't be choosers :)

thanks for the help,
Lee

Show quoted text

Jan

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Lee Feigenbaum (#6)
Re: bytea and character encoding when inserting escaped literals

Lee Feigenbaum <lee@thefigtrees.net> writes:

Would be nice if the bytea parser understood hex representation too, but
beggars can't be choosers :)

decode() might help you:

select decode('1200AB', 'hex');
decode
--------------
\022\000\253
(1 row)

regards, tom lane

In reply to: Lee Feigenbaum (#1)
Re: bytea and character encoding when inserting escaped literals

On 05/05/2008 16:07, Lee Feigenbaum wrote:

INSERT INTO myTable VALUES (..., E'\x15\x1C\x2F\x00\x02...', ...) ;

As I understand it, the octets need to be entered as their octal
representation - have a look at table 8-7 at
http://www.postgresql.org/docs/8.3/static/datatype-binary.html.

HTH,

Ray.

------------------------------------------------------------------
Raymond O'Donnell, Director of Music, Galway Cathedral, Ireland
rod@iol.ie
Galway Cathedral Recitals: http://www.galwaycathedral.org/recitals
------------------------------------------------------------------