How to convert HEX to ASCII?
Hello,
after several attempts I have finally succeeded in developing a
urlencode() function to encode text correctly like defined in RFC 1738.
Now i have a big problem: how to decode the text?
Example:
# SELECT urlencode('Hellö World!');
urlencode
-----------------------
Hell%C3%B6%20World%21
Does anybody know a way to convert '%21' back to '!' and '%C3%B6' to 'ö'?
Thanks for your help and greetings from Germany,
Torsten
--
http://www.dddbl.de - ein Datenbank-Layer, der die Arbeit mit 8
verschiedenen Datenbanksystemen abstrahiert,
Queries von Applikationen trennt und automatisch die Query-Ergebnisse
auswerten kann.
On 2 December 2011 13:18, Torsten Zuehlsdorff <foo@meisterderspiele.de> wrote:
Hello,
after several attempts I have finally succeeded in developing a urlencode()
function to encode text correctly like defined in RFC 1738.Now i have a big problem: how to decode the text?
Example:
# SELECT urlencode('Hellö World!');
urlencode
-----------------------
Hell%C3%B6%20World%21Does anybody know a way to convert '%21' back to '!' and '%C3%B6' to 'ö'?
I've extracted the unquote method [0]http://pastie.org/2954968 from urllib in the python stdlib
that decodes urlencoded strings. Hopefully be some use!
Damien Churchill schrieb:
after several attempts I have finally succeeded in developing a urlencode()
function to encode text correctly like defined in RFC 1738.Now i have a big problem: how to decode the text?
Example:
# SELECT urlencode('Hellö World!');
urlencode
-----------------------
Hell%C3%B6%20World%21Does anybody know a way to convert '%21' back to '!' and '%C3%B6' to 'ö'?
I've extracted the unquote method [0] from urllib in the python stdlib
that decodes urlencoded strings. Hopefully be some use!
Not directly, but it gives me some helpful hints. For example i'm now
able to decode some basic characters, for example:
# SELECT chr(x'21'::int);
chr
-----
!
(1 row)
But i clearly have a missunderstanding of other chars, like umlauts or
utf-8 chars. This, for example, should return a 'ö':
# SELECT chr(x'C3B6'::int);
chr
-----
쎶
(1 row)
Also i'm not sure how to figure out, when to decode '%C3' and when to
decode '%C3%B6'.
Thanks for your help,
Torsten
--
http://www.dddbl.de - ein Datenbank-Layer, der die Arbeit mit 8
verschiedenen Datenbanksystemen abstrahiert,
Queries von Applikationen trennt und automatisch die Query-Ergebnisse
auswerten kann.
On Fri, Dec 2, 2011 at 8:16 AM, Torsten Zuehlsdorff
<foo@meisterderspiele.de> wrote:
Damien Churchill schrieb:
after several attempts I have finally succeeded in developing a
urlencode()
function to encode text correctly like defined in RFC 1738.Now i have a big problem: how to decode the text?
Example:
# SELECT urlencode('Hellö World!');
urlencode
-----------------------
Hell%C3%B6%20World%21Does anybody know a way to convert '%21' back to '!' and '%C3%B6' to 'ö'?
I've extracted the unquote method [0] from urllib in the python stdlib
that decodes urlencoded strings. Hopefully be some use!Not directly, but it gives me some helpful hints. For example i'm now able
to decode some basic characters, for example:# SELECT chr(x'21'::int);
chr
-----
!
(1 row)But i clearly have a missunderstanding of other chars, like umlauts or utf-8
chars. This, for example, should return a 'ö':# SELECT chr(x'C3B6'::int);
chr
-----
쎶
(1 row)Also i'm not sure how to figure out, when to decode '%C3' and when to decode
'%C3%B6'.Thanks for your help,
You're welcome. get ready for some seriously abusive sql:
create or replace function unencode(text) returns text as
$$
with q as
(
select (regexp_matches($1, '(%..|.)', 'g'))[1] as v
)
select string_agg(case when length(v) = 3 then chr(replace(v, '%',
'x')::bit(8)::int) else v end, '') from q;
$$ language sql immutable;
set client_encoding to latin1;
SET
postgres=# select unencode('Hell%C3%B6%20World%21');
unencode
---------------
Hellö World!
(1 row)
Time: 1.908 ms
(maybe this isn't really an immutable function, but oh well).
merlin
On Fri, Dec 2, 2011 at 16:16, Torsten Zuehlsdorff
<foo@meisterderspiele.de> wrote:
But i clearly have a missunderstanding of other chars, like umlauts or utf-8
chars. This, for example, should return a 'ö':# SELECT chr(x'C3B6'::int);
chr
-----
쎶
(1 row)
That gives you the Unicode codepoint C3B6, but %C3%B6 is UTF-8-encoded
and actually decodes to the codepoint 00F6.
There is a fundamental problem that a decoded URL may actually be a
binary string -- it might not have a textual representation at all.
But if text is what you want, RFC3986 strongly suggests using UTF-8
for encoding text strings in URLs, and that works almost always in the
real world.
So the *right* way is to first convert the URL to a binary "bytea"
type by fixing all the % escapes, then convert that to UTF-8 encoding
to handle multibyte characters.
What I came up with is far from elegant because PostgreSQL lacks
convenient functions for bytea manipulation (no bytea_agg, etc).
Stealing a little from Merlin, this is what it looks like:
CREATE OR REPLACE FUNCTION url_decode(input text) RETURNS text
LANGUAGE plpgsql IMMUTABLE STRICT AS $$
DECLARE
bin bytea = '';
byte text;
BEGIN
FOR byte IN (select (regexp_matches(input, '(%..|.)', 'g'))[1]) LOOP
IF length(byte) = 3 THEN
bin = bin || decode(substring(byte, 2, 2), 'hex');
ELSE
bin = bin || byte::bytea;
END IF;
END LOOP;
RETURN convert_from(bin, 'utf8');
END
$$;
db=# select url_decode('Hell%C3%B6%20World%21');
url_decode
--------------
Hellö World!
db=# select url_decode('%EC%8E%B6');
url_decode
------------
쎶
This will break for binary-encoded data in URLs, though.
db=# select url_decode('%fa%fa%fa');
ERROR: invalid byte sequence for encoding "UTF8": 0xfa
CONTEXT: PL/pgSQL function "url_decode" line 13 at RETURN
----
On Fri, Dec 2, 2011 at 17:46, Merlin Moncure <mmoncure@gmail.com> wrote:
set client_encoding to latin1;
postgres=# select unencode('Hell%C3%B6%20World%21');
unencode
---------------
Hellö World!
(1 row)
Sorry, but AFAICT this makes a mess of encodings and only works by
pure luck. The server thinks it's sending the client LATIN1 text, but
it's actually UTF8-encoded and the last decoding step is done by your
terminal.
Regards,
Marti
On Fri, Dec 2, 2011 at 10:42 AM, Marti Raudsepp <marti@juffo.org> wrote:
Sorry, but AFAICT this makes a mess of encodings and only works by
pure luck. The server thinks it's sending the client LATIN1 text, but
it's actually UTF8-encoded and the last decoding step is done by your
terminal.
yup -- your're right -- what a coincidence! I still prefer the 1
liner sql variant vs plpgsql loop though. nicely done.
merlin
2011/12/2 Merlin Moncure <mmoncure@gmail.com>:
On Fri, Dec 2, 2011 at 10:42 AM, Marti Raudsepp <marti@juffo.org> wrote:
Sorry, but AFAICT this makes a mess of encodings and only works by
pure luck. The server thinks it's sending the client LATIN1 text, but
it's actually UTF8-encoded and the last decoding step is done by your
terminal.yup -- your're right -- what a coincidence! I still prefer the 1
liner sql variant vs plpgsql loop though. nicely done.
so bytea_agg - one param aggregate has sense
it's very easy to implement it
Pavel
Show quoted text
merlin
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Fri, Dec 2, 2011 at 11:15 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
2011/12/2 Merlin Moncure <mmoncure@gmail.com>:
On Fri, Dec 2, 2011 at 10:42 AM, Marti Raudsepp <marti@juffo.org> wrote:
Sorry, but AFAICT this makes a mess of encodings and only works by
pure luck. The server thinks it's sending the client LATIN1 text, but
it's actually UTF8-encoded and the last decoding step is done by your
terminal.yup -- your're right -- what a coincidence! I still prefer the 1
liner sql variant vs plpgsql loop though. nicely done.so bytea_agg - one param aggregate has sense
it's very easy to implement it
yup:
create aggregate bytea_agg (bytea)
(
sfunc=byteacat,
stype=bytea
);
merlin
2011/12/2 Merlin Moncure <mmoncure@gmail.com>:
On Fri, Dec 2, 2011 at 11:15 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
2011/12/2 Merlin Moncure <mmoncure@gmail.com>:
On Fri, Dec 2, 2011 at 10:42 AM, Marti Raudsepp <marti@juffo.org> wrote:
Sorry, but AFAICT this makes a mess of encodings and only works by
pure luck. The server thinks it's sending the client LATIN1 text, but
it's actually UTF8-encoded and the last decoding step is done by your
terminal.yup -- your're right -- what a coincidence! I still prefer the 1
liner sql variant vs plpgsql loop though. nicely done.so bytea_agg - one param aggregate has sense
it's very easy to implement it
yup:
create aggregate bytea_agg (bytea)
(
sfunc=byteacat,
stype=bytea
);
this is workaround :)
without a memory preallocating it has same speed like cycle in plpgsql.
Regards
Pavel
Show quoted text
merlin
On Fri, Dec 2, 2011 at 2:01 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
so bytea_agg - one param aggregate has sense
it's very easy to implement it
yup:
create aggregate bytea_agg (bytea)
(
sfunc=byteacat,
stype=bytea
);this is workaround :)
without a memory preallocating it has same speed like cycle in plpgsql.
sure, but I prefer to code against the workaround because it's cleaner
and it makes things easier to port over when such a feature makes it
into core. also, one liner sql has better chance of inlining as a
general rule.
merlin
2011/12/2 Merlin Moncure <mmoncure@gmail.com>:
On Fri, Dec 2, 2011 at 2:01 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
so bytea_agg - one param aggregate has sense
it's very easy to implement it
yup:
create aggregate bytea_agg (bytea)
(
sfunc=byteacat,
stype=bytea
);this is workaround :)
without a memory preallocating it has same speed like cycle in plpgsql.
sure, but I prefer to code against the workaround because it's cleaner
and it makes things easier to port over when such a feature makes it
into core. also, one liner sql has better chance of inlining as a
general rule.
ook
Pavel
Show quoted text
merlin
Marti Raudsepp schrieb:
On Fri, Dec 2, 2011 at 16:16, Torsten Zuehlsdorff
<foo@meisterderspiele.de> wrote:But i clearly have a missunderstanding of other chars, like umlauts or utf-8
chars. This, for example, should return a 'ö':# SELECT chr(x'C3B6'::int);
chr
-----
쎶
(1 row)That gives you the Unicode codepoint C3B6, but %C3%B6 is UTF-8-encoded
and actually decodes to the codepoint 00F6.There is a fundamental problem that a decoded URL may actually be a
binary string -- it might not have a textual representation at all.
But if text is what you want, RFC3986 strongly suggests using UTF-8
for encoding text strings in URLs, and that works almost always in the
real world.
Text is what i want. :) I've created a highly specialiced CMS, which
handle a bunch of big sites (in meaning of a great numbers of users and
content). It has a build-in traffic-analyze and with this function it
creates a real time analyze of the keywords, a user used to find the
sites in search engines. This is very needful if you try to do SEO for
websites with more than 20.000 unique content-pages. :)
CREATE OR REPLACE FUNCTION url_decode(input text) RETURNS text
LANGUAGE plpgsql IMMUTABLE STRICT AS $$
DECLARE
bin bytea = '';
byte text;
BEGIN
FOR byte IN (select (regexp_matches(input, '(%..|.)', 'g'))[1]) LOOP
IF length(byte) = 3 THEN
bin = bin || decode(substring(byte, 2, 2), 'hex');
ELSE
bin = bin || byte::bytea;
END IF;
END LOOP;
RETURN convert_from(bin, 'utf8');
END
$$;
Hey, this function looks similar to my encoding function :) Thank you
very munch!
This will break for binary-encoded data in URLs, though.
Thats no problem, i just have text.
Big thanks to all of you,
Torsten
Hello all
just note
9.1 will have a bytea_agg aggregate
regards
Pavel Stehule
2011/12/2 Marti Raudsepp <marti@juffo.org>:
Show quoted text
On Fri, Dec 2, 2011 at 16:16, Torsten Zuehlsdorff
<foo@meisterderspiele.de> wrote:But i clearly have a missunderstanding of other chars, like umlauts or utf-8
chars. This, for example, should return a 'ö':# SELECT chr(x'C3B6'::int);
chr
-----
쎶
(1 row)That gives you the Unicode codepoint C3B6, but %C3%B6 is UTF-8-encoded
and actually decodes to the codepoint 00F6.There is a fundamental problem that a decoded URL may actually be a
binary string -- it might not have a textual representation at all.
But if text is what you want, RFC3986 strongly suggests using UTF-8
for encoding text strings in URLs, and that works almost always in the
real world.So the *right* way is to first convert the URL to a binary "bytea"
type by fixing all the % escapes, then convert that to UTF-8 encoding
to handle multibyte characters.What I came up with is far from elegant because PostgreSQL lacks
convenient functions for bytea manipulation (no bytea_agg, etc).
Stealing a little from Merlin, this is what it looks like:CREATE OR REPLACE FUNCTION url_decode(input text) RETURNS text
LANGUAGE plpgsql IMMUTABLE STRICT AS $$
DECLARE
bin bytea = '';
byte text;
BEGIN
FOR byte IN (select (regexp_matches(input, '(%..|.)', 'g'))[1]) LOOP
IF length(byte) = 3 THEN
bin = bin || decode(substring(byte, 2, 2), 'hex');
ELSE
bin = bin || byte::bytea;
END IF;
END LOOP;
RETURN convert_from(bin, 'utf8');
END
$$;db=# select url_decode('Hell%C3%B6%20World%21');
url_decode
--------------
Hellö World!db=# select url_decode('%EC%8E%B6');
url_decode
------------
쎶This will break for binary-encoded data in URLs, though.
db=# select url_decode('%fa%fa%fa');
ERROR: invalid byte sequence for encoding "UTF8": 0xfa
CONTEXT: PL/pgSQL function "url_decode" line 13 at RETURN----
On Fri, Dec 2, 2011 at 17:46, Merlin Moncure <mmoncure@gmail.com> wrote:
set client_encoding to latin1;
postgres=# select unencode('Hell%C3%B6%20World%21');
unencode
---------------
Hellö World!
(1 row)Sorry, but AFAICT this makes a mess of encodings and only works by
pure luck. The server thinks it's sending the client LATIN1 text, but
it's actually UTF8-encoded and the last decoding step is done by your
terminal.Regards,
Marti--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general