How to convert HEX to ASCII?

Started by Torsten Zühlsdorffover 14 years ago13 messagesgeneral

Torsten Zühlsdorff

foo@meisterderspiele.de

over 14 years ago

Hello,

after several attempts I have finally succeeded in developing a
urlencode() function to encode text correctly like defined in RFC 1738.

Now i have a big problem: how to decode the text?

Example:
# SELECT urlencode('Hellö World!');
urlencode
-----------------------
Hell%C3%B6%20World%21

Does anybody know a way to convert '%21' back to '!' and '%C3%B6' to 'ö'?

Thanks for your help and greetings from Germany,
Torsten
--
http://www.dddbl.de - ein Datenbank-Layer, der die Arbeit mit 8
verschiedenen Datenbanksystemen abstrahiert,
Queries von Applikationen trennt und automatisch die Query-Ergebnisse
auswerten kann.

Damien Churchill

damoxc@gmail.com

over 14 years ago

In reply to: Torsten Zühlsdorff (#1)

Re: How to convert HEX to ASCII?

On 2 December 2011 13:18, Torsten Zuehlsdorff <foo@meisterderspiele.de> wrote:

Hello,

after several attempts I have finally succeeded in developing a urlencode()
function to encode text correctly like defined in RFC 1738.

Now i have a big problem: how to decode the text?

Example:
# SELECT urlencode('Hellö World!');
urlencode
-----------------------
Hell%C3%B6%20World%21

Does anybody know a way to convert '%21' back to '!' and '%C3%B6' to 'ö'?

I've extracted the unquote method [0]http://pastie.org/2954968 from urllib in the python stdlib
that decodes urlencoded strings. Hopefully be some use!

[0]: http://pastie.org/2954968

Torsten Zühlsdorff

foo@meisterderspiele.de

over 14 years ago

In reply to: Damien Churchill (#2)

Re: How to convert HEX to ASCII?

Damien Churchill schrieb:

after several attempts I have finally succeeded in developing a urlencode()
function to encode text correctly like defined in RFC 1738.

Now i have a big problem: how to decode the text?

Example:
# SELECT urlencode('Hellö World!');
urlencode
-----------------------
Hell%C3%B6%20World%21

Does anybody know a way to convert '%21' back to '!' and '%C3%B6' to 'ö'?

I've extracted the unquote method [0] from urllib in the python stdlib
that decodes urlencoded strings. Hopefully be some use!

Not directly, but it gives me some helpful hints. For example i'm now
able to decode some basic characters, for example:

# SELECT chr(x'21'::int);
chr
-----
!
(1 row)

But i clearly have a missunderstanding of other chars, like umlauts or
utf-8 chars. This, for example, should return a 'ö':

# SELECT chr(x'C3B6'::int);
chr
-----
쎶
(1 row)

Also i'm not sure how to figure out, when to decode '%C3' and when to
decode '%C3%B6'.

Thanks for your help,
Torsten
--
http://www.dddbl.de - ein Datenbank-Layer, der die Arbeit mit 8
verschiedenen Datenbanksystemen abstrahiert,
Queries von Applikationen trennt und automatisch die Query-Ergebnisse
auswerten kann.

mmoncure@gmail.com

over 14 years ago

In reply to: Torsten Zühlsdorff (#3)

Re: How to convert HEX to ASCII?

On Fri, Dec 2, 2011 at 8:16 AM, Torsten Zuehlsdorff
<foo@meisterderspiele.de> wrote:

Damien Churchill schrieb:

after several attempts I have finally succeeded in developing a
urlencode()
function to encode text correctly like defined in RFC 1738.

Now i have a big problem: how to decode the text?

Example:
# SELECT urlencode('Hellö World!');
urlencode
-----------------------
Hell%C3%B6%20World%21

Does anybody know a way to convert '%21' back to '!' and '%C3%B6' to 'ö'?

I've extracted the unquote method [0] from urllib in the python stdlib
that decodes urlencoded strings. Hopefully be some use!

Not directly, but it gives me some helpful hints. For example i'm now able
to decode some basic characters, for example:

# SELECT chr(x'21'::int);
chr
-----
!
(1 row)

But i clearly have a missunderstanding of other chars, like umlauts or utf-8
chars. This, for example, should return a 'ö':

# SELECT chr(x'C3B6'::int);
chr
-----
쎶
(1 row)

Also i'm not sure how to figure out, when to decode '%C3' and when to decode
'%C3%B6'.

Thanks for your help,

You're welcome. get ready for some seriously abusive sql:

create or replace function unencode(text) returns text as
$$
with q as
(
select (regexp_matches($1, '(%..|.)', 'g'))[1] as v
)
select string_agg(case when length(v) = 3 then chr(replace(v, '%',
'x')::bit(8)::int) else v end, '') from q;
$$ language sql immutable;

set client_encoding to latin1;
SET

postgres=# select unencode('Hell%C3%B6%20World%21');
unencode
---------------
Hellö World!
(1 row)

Time: 1.908 ms

(maybe this isn't really an immutable function, but oh well).

merlin

marti@juffo.org

over 14 years ago

In reply to: Merlin Moncure (#4)

Re: How to convert HEX to ASCII?

On Fri, Dec 2, 2011 at 16:16, Torsten Zuehlsdorff
<foo@meisterderspiele.de> wrote:

But i clearly have a missunderstanding of other chars, like umlauts or utf-8
chars. This, for example, should return a 'ö':

# SELECT chr(x'C3B6'::int);
chr
-----
쎶
(1 row)

That gives you the Unicode codepoint C3B6, but %C3%B6 is UTF-8-encoded
and actually decodes to the codepoint 00F6.

There is a fundamental problem that a decoded URL may actually be a
binary string -- it might not have a textual representation at all.
But if text is what you want, RFC3986 strongly suggests using UTF-8
for encoding text strings in URLs, and that works almost always in the
real world.

So the *right* way is to first convert the URL to a binary "bytea"
type by fixing all the % escapes, then convert that to UTF-8 encoding
to handle multibyte characters.

What I came up with is far from elegant because PostgreSQL lacks
convenient functions for bytea manipulation (no bytea_agg, etc).
Stealing a little from Merlin, this is what it looks like:

CREATE OR REPLACE FUNCTION url_decode(input text) RETURNS text
LANGUAGE plpgsql IMMUTABLE STRICT AS $$
DECLARE
bin bytea = '';
byte text;
BEGIN
FOR byte IN (select (regexp_matches(input, '(%..|.)', 'g'))[1]) LOOP
IF length(byte) = 3 THEN
bin = bin || decode(substring(byte, 2, 2), 'hex');
ELSE
bin = bin || byte::bytea;
END IF;
END LOOP;
RETURN convert_from(bin, 'utf8');
END
$$;

db=# select url_decode('Hell%C3%B6%20World%21');
url_decode
--------------
Hellö World!

db=# select url_decode('%EC%8E%B6');
url_decode
------------
쎶

This will break for binary-encoded data in URLs, though.
db=# select url_decode('%fa%fa%fa');
ERROR: invalid byte sequence for encoding "UTF8": 0xfa
CONTEXT: PL/pgSQL function "url_decode" line 13 at RETURN

----

On Fri, Dec 2, 2011 at 17:46, Merlin Moncure <mmoncure@gmail.com> wrote:

set client_encoding to latin1;

postgres=# select unencode('Hell%C3%B6%20World%21');
unencode
---------------
Hellö World!
(1 row)

Sorry, but AFAICT this makes a mess of encodings and only works by
pure luck. The server thinks it's sending the client LATIN1 text, but
it's actually UTF8-encoded and the last decoding step is done by your
terminal.

Regards,
Marti

mmoncure@gmail.com

over 14 years ago

In reply to: Marti Raudsepp (#5)

Re: How to convert HEX to ASCII?

On Fri, Dec 2, 2011 at 10:42 AM, Marti Raudsepp <marti@juffo.org> wrote:

Sorry, but AFAICT this makes a mess of encodings and only works by
pure luck. The server thinks it's sending the client LATIN1 text, but
it's actually UTF8-encoded and the last decoding step is done by your
terminal.

yup -- your're right -- what a coincidence! I still prefer the 1
liner sql variant vs plpgsql loop though. nicely done.

merlin

pavel.stehule@gmail.com

over 14 years ago

In reply to: Merlin Moncure (#6)

Re: How to convert HEX to ASCII?

2011/12/2 Merlin Moncure <mmoncure@gmail.com>:

On Fri, Dec 2, 2011 at 10:42 AM, Marti Raudsepp <marti@juffo.org> wrote:

Sorry, but AFAICT this makes a mess of encodings and only works by
pure luck. The server thinks it's sending the client LATIN1 text, but
it's actually UTF8-encoded and the last decoding step is done by your
terminal.

yup -- your're right -- what a coincidence! I still prefer the 1
liner sql variant vs plpgsql loop though. nicely done.

so bytea_agg - one param aggregate has sense

it's very easy to implement it

Pavel

Show quoted text

merlin

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

mmoncure@gmail.com

over 14 years ago

In reply to: Pavel Stehule (#7)

Re: How to convert HEX to ASCII?

On Fri, Dec 2, 2011 at 11:15 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:

2011/12/2 Merlin Moncure <mmoncure@gmail.com>:

On Fri, Dec 2, 2011 at 10:42 AM, Marti Raudsepp <marti@juffo.org> wrote:

Sorry, but AFAICT this makes a mess of encodings and only works by
pure luck. The server thinks it's sending the client LATIN1 text, but
it's actually UTF8-encoded and the last decoding step is done by your
terminal.

yup -- your're right -- what a coincidence! I still prefer the 1
liner sql variant vs plpgsql loop though. nicely done.

so bytea_agg - one param aggregate has sense

it's very easy to implement it

yup:

create aggregate bytea_agg (bytea)
(
sfunc=byteacat,
stype=bytea
);

merlin

pavel.stehule@gmail.com

over 14 years ago

In reply to: Merlin Moncure (#8)

Re: How to convert HEX to ASCII?

2011/12/2 Merlin Moncure <mmoncure@gmail.com>:

On Fri, Dec 2, 2011 at 11:15 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:

2011/12/2 Merlin Moncure <mmoncure@gmail.com>:

On Fri, Dec 2, 2011 at 10:42 AM, Marti Raudsepp <marti@juffo.org> wrote:

Sorry, but AFAICT this makes a mess of encodings and only works by
pure luck. The server thinks it's sending the client LATIN1 text, but
it's actually UTF8-encoded and the last decoding step is done by your
terminal.

yup -- your're right -- what a coincidence! I still prefer the 1
liner sql variant vs plpgsql loop though. nicely done.

so bytea_agg - one param aggregate has sense

it's very easy to implement it

yup:

create aggregate bytea_agg (bytea)
(
sfunc=byteacat,
stype=bytea
);

this is workaround :)

without a memory preallocating it has same speed like cycle in plpgsql.

Regards

Pavel

Show quoted text

merlin

mmoncure@gmail.com

over 14 years ago

In reply to: Pavel Stehule (#9)

Re: How to convert HEX to ASCII?

On Fri, Dec 2, 2011 at 2:01 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:

so bytea_agg - one param aggregate has sense

it's very easy to implement it

yup:

create aggregate bytea_agg (bytea)
(
sfunc=byteacat,
stype=bytea
);

this is workaround :)

without a memory preallocating it has same speed like cycle in plpgsql.

sure, but I prefer to code against the workaround because it's cleaner
and it makes things easier to port over when such a feature makes it
into core. also, one liner sql has better chance of inlining as a
general rule.

merlin

pavel.stehule@gmail.com

over 14 years ago

In reply to: Merlin Moncure (#10)

Re: How to convert HEX to ASCII?

2011/12/2 Merlin Moncure <mmoncure@gmail.com>:

On Fri, Dec 2, 2011 at 2:01 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:

so bytea_agg - one param aggregate has sense

it's very easy to implement it

yup:

create aggregate bytea_agg (bytea)
(
sfunc=byteacat,
stype=bytea
);

this is workaround :)

without a memory preallocating it has same speed like cycle in plpgsql.

sure, but I prefer to code against the workaround because it's cleaner
and it makes things easier to port over when such a feature makes it
into core. also, one liner sql has better chance of inlining as a
general rule.

ook

Pavel

Show quoted text

merlin

Torsten Zühlsdorff

foo@meisterderspiele.de

over 14 years ago

In reply to: Marti Raudsepp (#5)

Re: How to convert HEX to ASCII?

Marti Raudsepp schrieb:

On Fri, Dec 2, 2011 at 16:16, Torsten Zuehlsdorff
<foo@meisterderspiele.de> wrote:

But i clearly have a missunderstanding of other chars, like umlauts or utf-8
chars. This, for example, should return a 'ö':

# SELECT chr(x'C3B6'::int);
chr
-----
쎶
(1 row)

That gives you the Unicode codepoint C3B6, but %C3%B6 is UTF-8-encoded
and actually decodes to the codepoint 00F6.

There is a fundamental problem that a decoded URL may actually be a
binary string -- it might not have a textual representation at all.
But if text is what you want, RFC3986 strongly suggests using UTF-8
for encoding text strings in URLs, and that works almost always in the
real world.

Text is what i want. :) I've created a highly specialiced CMS, which
handle a bunch of big sites (in meaning of a great numbers of users and
content). It has a build-in traffic-analyze and with this function it
creates a real time analyze of the keywords, a user used to find the
sites in search engines. This is very needful if you try to do SEO for
websites with more than 20.000 unique content-pages. :)

CREATE OR REPLACE FUNCTION url_decode(input text) RETURNS text
LANGUAGE plpgsql IMMUTABLE STRICT AS $$
DECLARE
bin bytea = '';
byte text;
BEGIN
FOR byte IN (select (regexp_matches(input, '(%..|.)', 'g'))[1]) LOOP
IF length(byte) = 3 THEN
bin = bin || decode(substring(byte, 2, 2), 'hex');
ELSE
bin = bin || byte::bytea;
END IF;
END LOOP;
RETURN convert_from(bin, 'utf8');
END
$$;

Hey, this function looks similar to my encoding function :) Thank you
very munch!

This will break for binary-encoded data in URLs, though.

Thats no problem, i just have text.

Big thanks to all of you,
Torsten

pavel.stehule@gmail.com

over 14 years ago

In reply to: Marti Raudsepp (#5)

Re: How to convert HEX to ASCII?

Hello all

just note

9.1 will have a bytea_agg aggregate

regards

Pavel Stehule

2011/12/2 Marti Raudsepp <marti@juffo.org>:

Show quoted text

On Fri, Dec 2, 2011 at 16:16, Torsten Zuehlsdorff
<foo@meisterderspiele.de> wrote:

But i clearly have a missunderstanding of other chars, like umlauts or utf-8
chars. This, for example, should return a 'ö':

# SELECT chr(x'C3B6'::int);
chr
-----
쎶
(1 row)

That gives you the Unicode codepoint C3B6, but %C3%B6 is UTF-8-encoded
and actually decodes to the codepoint 00F6.

There is a fundamental problem that a decoded URL may actually be a
binary string -- it might not have a textual representation at all.
But if text is what you want, RFC3986 strongly suggests using UTF-8
for encoding text strings in URLs, and that works almost always in the
real world.

So the *right* way is to first convert the URL to a binary "bytea"
type by fixing all the % escapes, then convert that to UTF-8 encoding
to handle multibyte characters.

What I came up with is far from elegant because PostgreSQL lacks
convenient functions for bytea manipulation (no bytea_agg, etc).
Stealing a little from Merlin, this is what it looks like:

CREATE OR REPLACE FUNCTION url_decode(input text) RETURNS text
LANGUAGE plpgsql IMMUTABLE STRICT AS $$
DECLARE
bin bytea = '';
byte text;
BEGIN
FOR byte IN (select (regexp_matches(input, '(%..|.)', 'g'))[1]) LOOP
IF length(byte) = 3 THEN
bin = bin || decode(substring(byte, 2, 2), 'hex');
ELSE
bin = bin || byte::bytea;
END IF;
END LOOP;
RETURN convert_from(bin, 'utf8');
END
$$;

db=# select url_decode('Hell%C3%B6%20World%21');
url_decode
--------------
Hellö World!

db=# select url_decode('%EC%8E%B6');
url_decode
------------
쎶

This will break for binary-encoded data in URLs, though.
db=# select url_decode('%fa%fa%fa');
ERROR: invalid byte sequence for encoding "UTF8": 0xfa
CONTEXT: PL/pgSQL function "url_decode" line 13 at RETURN

----

On Fri, Dec 2, 2011 at 17:46, Merlin Moncure <mmoncure@gmail.com> wrote:

set client_encoding to latin1;

postgres=# select unencode('Hell%C3%B6%20World%21');
unencode
---------------
Hellö World!
(1 row)

Sorry, but AFAICT this makes a mess of encodings and only works by
pure luck. The server thinks it's sending the client LATIN1 text, but
it's actually UTF8-encoded and the last decoding step is done by your
terminal.

Regards,
Marti

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general