encode, lower and 0x8a

Started by Michael Artzabout 19 years ago5 messagesgeneral
Jump to latest
#1Michael Artz
mlartz@gmail.com

Perhaps my understanding of the 'encode' function is incorrect, but I
was under the impression that I could do something like:

SELECT lower(encode(bytes, 'escape')) FROM mytable;

as it sounded like (from the manual) that 'encode' would return valid
ASCII, with all the non-ascii bytes hex escaped. When I have the byte
0x8a, however, I get the error:

ERROR: invalid byte sequence for encoding "UTF8": 0x8a

I have the sneaking suspicion that I am missing something, so please
correct me if I am wrong. If I am wrong, is there a better way to
lowercase all the ascii characters in a bytea string?

Here is a simple way to recreate this:

CREATE TABLE mytable (bytes BYTEA);
INSERT INTO mytable VALUES (E'212');
SELECT lower(encode(bytes, 'escape')) FROM mytable;

Thanks
-Mike

#2Michael Fuhr
mike@fuhr.org
In reply to: Michael Artz (#1)
Re: encode, lower and 0x8a

On Thu, Jan 25, 2007 at 02:28:38PM -0500, Michael Artz wrote:

Perhaps my understanding of the 'encode' function is incorrect, but I
was under the impression that I could do something like:

SELECT lower(encode(bytes, 'escape')) FROM mytable;

as it sounded like (from the manual) that 'encode' would return valid
ASCII, with all the non-ascii bytes hex escaped.

The documentation for encode() does give that impression: "Encode
binary string to ASCII-only representation. Supported types are:
base64, hex, escape." However, the source code for esc_encode()
in src/backend/utils/adt/encode.c says and does otherwise:

* Only two characters are escaped:
* \0 (null) and \\ (backslash)

When I have the byte 0x8a, however, I get the error:

ERROR: invalid byte sequence for encoding "UTF8": 0x8a

Since encode() returns text and doesn't escape non-ASCII characters,
all of the original binary data will be treated as though it's text
in the database's encoding. If the data contains byte sequences
that aren't valid in that encoding then you get the above error.

I have the sneaking suspicion that I am missing something, so please
correct me if I am wrong. If I am wrong, is there a better way to
lowercase all the ascii characters in a bytea string?

What are you trying to do? What is the binary data and why are you
treating it (or part of it) as though it's text? Do you want the
end result to be text with escape sequences or do you want to convert
it back to bytea?

Something like this might work:

SELECT lower(textin(byteaout(bytes))) FROM mytable;

To turn the result back into bytea:

SELECT decode(lower(textin(byteaout(bytes))), 'escape') FROM mytable;

--
Michael Fuhr

#3Michael Artz
mlartz@gmail.com
In reply to: Michael Fuhr (#2)
Re: encode, lower and 0x8a

On 1/27/07, Michael Fuhr <mike@fuhr.org> wrote:

On Thu, Jan 25, 2007 at 02:28:38PM -0500, Michael Artz wrote:

Perhaps my understanding of the 'encode' function is incorrect, but I
was under the impression that I could do something like:

SELECT lower(encode(bytes, 'escape')) FROM mytable;

as it sounded like (from the manual) that 'encode' would return valid
ASCII, with all the non-ascii bytes hex escaped.

...snip...

I have the sneaking suspicion that I am missing something, so please
correct me if I am wrong. If I am wrong, is there a better way to
lowercase all the ascii characters in a bytea string?

What are you trying to do? What is the binary data and why are you
treating it (or part of it) as though it's text? Do you want the
end result to be text with escape sequences or do you want to convert
it back to bytea?

The data are binary strings culled from network traffic. Most of the
"binary strings" are just ascii strings, however not all of them are,
and some of the ascii strings have binary characters embedded in them.
In this case, I was displaying the string to a user via a web
browser, in which case I wanted the output to be escaped. The reason
for the 'lower()' was to enable the case-insensitive sort, so actually
the offending line of the program would look something more like:

SELECT encode(bytes, 'escape') FROM mytables ORDER BY
lower(encode(bytes, 'escape'))

Something like this might work:

SELECT lower(textin(byteaout(bytes))) FROM mytable;

To turn the result back into bytea:

SELECT decode(lower(textin(byteaout(bytes))), 'escape') FROM mytable;

That seems to work correctly, however I missed the functions textin'
and 'byteaout' in the docs ... are they documented somewhere?

Thanks,
-Mike

#4Michael Fuhr
mike@fuhr.org
In reply to: Michael Artz (#3)
Re: encode, lower and 0x8a

On Mon, Jan 29, 2007 at 12:52:33PM -0500, Michael Artz wrote:

On 1/27/07, Michael Fuhr <mike@fuhr.org> wrote:

SELECT decode(lower(textin(byteaout(bytes))), 'escape') FROM mytable;

That seems to work correctly, however I missed the functions textin'
and 'byteaout' in the docs ... are they documented somewhere?

Those are the types' input/output functions. They're not specifically
documented but "User-Defined Types" in the "Extending SQL" chapter
discusses the concept; see also CREATE TYPE.

http://www.postgresql.org/docs/8.2/interactive/xtypes.html
http://www.postgresql.org/docs/8.2/interactive/sql-createtype.html

Using the input/output functions is a bit hackish since, as you
discovered, their exact behavior isn't documented.

--
Michael Fuhr

#5Bruce Momjian
bruce@momjian.us
In reply to: Michael Fuhr (#2)
Re: encode, lower and 0x8a

I have updated the encode() documentation to not mention "ASCII", and to
be more specific about what 'escape' does. Backpatched to 8.2.X.

---------------------------------------------------------------------------

Michael Fuhr wrote:

On Thu, Jan 25, 2007 at 02:28:38PM -0500, Michael Artz wrote:

Perhaps my understanding of the 'encode' function is incorrect, but I
was under the impression that I could do something like:

SELECT lower(encode(bytes, 'escape')) FROM mytable;

as it sounded like (from the manual) that 'encode' would return valid
ASCII, with all the non-ascii bytes hex escaped.

The documentation for encode() does give that impression: "Encode
binary string to ASCII-only representation. Supported types are:
base64, hex, escape." However, the source code for esc_encode()
in src/backend/utils/adt/encode.c says and does otherwise:

* Only two characters are escaped:
* \0 (null) and \\ (backslash)

When I have the byte 0x8a, however, I get the error:

ERROR: invalid byte sequence for encoding "UTF8": 0x8a

Since encode() returns text and doesn't escape non-ASCII characters,
all of the original binary data will be treated as though it's text
in the database's encoding. If the data contains byte sequences
that aren't valid in that encoding then you get the above error.

I have the sneaking suspicion that I am missing something, so please
correct me if I am wrong. If I am wrong, is there a better way to
lowercase all the ascii characters in a bytea string?

What are you trying to do? What is the binary data and why are you
treating it (or part of it) as though it's text? Do you want the
end result to be text with escape sequences or do you want to convert
it back to bytea?

Something like this might work:

SELECT lower(textin(byteaout(bytes))) FROM mytable;

To turn the result back into bytea:

SELECT decode(lower(textin(byteaout(bytes))), 'escape') FROM mytable;

--
Michael Fuhr

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

Attachments:

/rtmp/difftext/x-diffDownload+4-2