pl/perl and utf-8 in sql_ascii databases
Hi,
we have a database that is storing strings in various encodings (and
non-encodings, namely the arbitrary byte soup that you might see in
email headers from the internet). For this reason, the database uses
sql_ascii encoding. The columns are text, as most characters are
ascii, so bytea didn't seem the right way to go.
Currently we are on 8.3 and try to upgrade to 9.1, but the plperlu
functions we have are acting up.
Old behavior on 8.3 .. 9.0:
sql_ascii =# create or replace function whitespace(text) returns text
language plperlu as $$ $a = shift; $a =~ s/[\t ]+/ /g; return $a; $$;
CREATE FUNCTION
sql_ascii =# select whitespace (E'\200'); -- 0x80 is not valid utf-8
whitespace
------------
sql_ascii =# select whitespace (E'\200')::bytea;
whitespace
------------
\x80
New behavior on 9.1.2:
sql_ascii =# select whitespace (E'\200');
ERROR: XX000: Malformed UTF-8 character (fatal) at line 1.
KONTEXT: PL/Perl function "whitespace"
ORT: plperl_call_perl_func, plperl.c:2037
A crude workaround is:
sql_ascii =# create or replace function whitespace_utf8_off(text)
returns text language plperlu as $$ use Encode; $a = shift;
Encode::_utf8_off($a); $a =~ s/[\t ]+/ /g; return $a; $$;
CREATE FUNCTION
sql_ascii =# select whitespace_utf8_off (E'\200');
whitespace_utf8_off
---------------------
\u0080
sql_ascii =# select whitespace_utf8_off (E'\200')::bytea;
whitespace_utf8_off
---------------------
\xc280
(Note that the workaround is not perfect as the resulting 0x80..0xff
bytes are still tagged to be utf8.)
I think the bug is in plperl_helpers.h:
/*
* Create a new SV from a string assumed to be in the current database's
* encoding.
*/
static inline SV *
cstr2sv(const char *str)
{
SV *sv;
char *utf8_str = utf_e2u(str);
sv = newSVpv(utf8_str, 0);
SvUTF8_on(sv);
pfree(utf8_str);
return sv;
}
In sql_ascii databases, utf_e2u does not do any recoding, but then
SvUTF8_on still marks the string as utf-8, while it isn't.
(Returned values might also need fixing.)
In my view, this is clearly a bug in pl/perl on sql_ascii databases.
Christoph
--
cb@df7cb.de | http://www.df7cb.de/
On Thu, Feb 9, 2012 at 03:21, Christoph Berg <cb@df7cb.de> wrote:
Hi,
we have a database that is storing strings in various encodings (and
non-encodings, namely the arbitrary byte soup [ ... ]
For this reason, the database uses
sql_ascii encoding
...snip...
In sql_ascii databases, utf_e2u does not do any recoding, but then
SvUTF8_on still marks the string as utf-8, while it isn't.(Returned values might also need fixing.)
In my view, this is clearly a bug in pl/perl on sql_ascii databases.
Yeah, there was some musing about this over in:
http://archives.postgresql.org/pgsql-hackers/2011-02/msg01142.php
Seems like we missed the fact that we still did SvUTF8_on() in sv2cstr
and SvPVUTF8() when turning a perl string into a cstring.
With the attached I get:
=> create or replace function perl_white(a text) returns text as $$
return shift; $$ language plperlu;
=> select perl_white(E'\200'), perl_white(E'\200')::bytea,
coalesce(perl_white(E'\200'), 'null');
perl_white | perl_white | coalesce
------------+------------+----------
| \x80 |
=> select perl_white(E'\401');
perl_white
------------
\x01
(1 row)
Does the attached fix the issue for you?
Ill note that all the pls seem to behave a bit differently:
=> create or replace function py_white(a text) returns text as $$
return a; $$ language plpython3u;
=> select py_white(E'\200'), py_white(E'\200')::bytea,
coalesce(py_white(E'\200'), 'null');
py_white | py_white | coalesce
----------+----------+----------
| | null
(1 row)
=>select py_white(E'\401');
py_white
----------
\x01
(1 row)
=> create or replace function tcl_white(text) returns text as $$
return $1; $$ language pltcl;
=> select tcl_white(E'\200'), tcl_white(E'\200')::bytea,
coalesce(tcl_white(E'\200'), 'null');
tcl_white | tcl_white | coalesce
-----------+-----------+----------
| \x80 |
=> select tcl_white(E'\402');
tcl_white
-----------
\x02
(1 row)
Attachments:
plperl_sql_ascii.patchtext/x-patch; charset=US-ASCII; name=plperl_sql_ascii.patchDownload+34-35
Re: Alex Hunsaker 2012-02-10 <CAFaPBrR9y1fu6gpVu+8TA8vTY6QVCm3DfarKT8JG_EhGeTXnDA@mail.gmail.com>
Does the attached fix the issue for you?
Yes. :)
Christoph
--
cb@df7cb.de | http://www.df7cb.de/
Excerpts from Alex Hunsaker's message of vie feb 10 16:53:05 -0300 2012:
Seems like we missed the fact that we still did SvUTF8_on() in sv2cstr
and SvPVUTF8() when turning a perl string into a cstring.
Hmm, this patch belongs into back branches too, right? Not just the
current development tree?
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Mon, Jun 18, 2012 at 3:30 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
Excerpts from Alex Hunsaker's message of vie feb 10 16:53:05 -0300 2012:
Seems like we missed the fact that we still did SvUTF8_on() in sv2cstr
and SvPVUTF8() when turning a perl string into a cstring.Hmm, this patch belongs into back branches too, right? Not just the
current development tree?
It seems like a bug fix to me.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, Jun 18, 2012 at 1:30 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
Excerpts from Alex Hunsaker's message of vie feb 10 16:53:05 -0300 2012:
Seems like we missed the fact that we still did SvUTF8_on() in sv2cstr
and SvPVUTF8() when turning a perl string into a cstring.Hmm, this patch belongs into back branches too, right? Not just the
current development tree?
(Have been out of town, sorry for the late reply)
Yeah, back to 9.1.
Excerpts from Robert Haas's message of mar jun 19 11:36:41 -0400 2012:
On Mon, Jun 18, 2012 at 3:30 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:Excerpts from Alex Hunsaker's message of vie feb 10 16:53:05 -0300 2012:
Seems like we missed the fact that we still did SvUTF8_on() in sv2cstr
and SvPVUTF8() when turning a perl string into a cstring.Hmm, this patch belongs into back branches too, right? Not just the
current development tree?It seems like a bug fix to me.
That's what I thought. I will commit to both branches soon, then.
Mind you, this should have been an "open item", not a commitfest item.
(Actually not even an open item. We should have committed it right
away.)
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Tue, Jun 19, 2012 at 3:03 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
That's what I thought. I will commit to both branches soon, then.
I think there might be three branches involved.
Mind you, this should have been an "open item", not a commitfest item.
(Actually not even an open item. We should have committed it right
away.)
True, but it's not a perfect world, and I didn't feel qualified to
commit it myself. Adding it the CommitFest at least prevented it from
getting lost forever.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Excerpts from Alex Hunsaker's message of vie feb 10 16:53:05 -0300 2012:
Seems like we missed the fact that we still did SvUTF8_on() in sv2cstr
and SvPVUTF8() when turning a perl string into a cstring.
Right.
So I played a bit with this patch, and touched it a bit mainly just to
add some more comments; and while at it I noticed that some of the
functions in Util.xs might leak some memory, so I made an attempt to
plug them, as in the attached patch (which supersedes yours).
Now, with my version of the patch applied and using a SQL_ASCII database
to test the problem in the original report, I notice that we now have a
regression failure:
*** /pgsql/source/HEAD/src/pl/plperl/expected/plperl.out 2012-05-16 13:38:02.495647259 -0400
--- /home/alvherre/Code/pgsql/build/HEAD/src/pl/plperl/results/plperl.out 2012-06-20 15:09:19.869778824 -0400
***************
*** 658,664 ****
return "abcd\0efg";
$$ LANGUAGE plperl;
SELECT perl_zerob();
! ERROR: invalid byte sequence for encoding "UTF8": 0x00
CONTEXT: PL/Perl function "perl_zerob"
-- make sure functions marked as VOID without an explicit return work
CREATE OR REPLACE FUNCTION myfuncs() RETURNS void AS $$
--- 658,664 ----
return "abcd\0efg";
$$ LANGUAGE plperl;
SELECT perl_zerob();
! ERROR: invalid byte sequence for encoding "SQL_ASCII": 0x00
CONTEXT: PL/Perl function "perl_zerob"
-- make sure functions marked as VOID without an explicit return work
CREATE OR REPLACE FUNCTION myfuncs() RETURNS void AS $$
======================================================================
I'm not really sure what to do here -- maybe have a second expected file
for that test is a good enough answer? Or should I just take the test
out? Opinions please.
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Attachments:
plperl_sql_ascii-2.patchapplication/octet-stream; name=plperl_sql_ascii-2.patchDownload+64-27
Hello,
Seems like we missed the fact that we still did SvUTF8_on() in sv2cstr
and SvPVUTF8() when turning a perl string into a cstring.Right.
I spent a bit longer time catching on pl/perl and now understand
what is the problem...
So I played a bit with this patch, and touched it a bit mainly just to
add some more comments; and while at it I noticed that some of the
functions in Util.xs might leak some memory, so I made an attempt to
plug them, as in the attached patch (which supersedes yours).
Ok, Is it ok to look into the newer patch including fix of leaks
at first?
-- Coding and styles.
This also seems to have polished the previous one on some codes,
styles and comments which generally look reasonable. And patch
style was corrected into unified.
-- Functions
I seems to work properly on the database the encodings of which
are SQL_ASCII and UTF8 (and EUC-JP) as below,
=================
=> create or replace function foo(text) returns text language plperlu as $$ $a = shift; return "BOO!" if ($a != "a\x80cあ"); return $a; $$;
SQL_ASCII=> select foo(E'a\200cあ') = E'a\200cあ';
?column?
----------
t
UTF8=> select foo(E'a\200cあ');
ERROR: invalid byte sequence for encoding "UTF8": 0x80
UTF8=> select foo(E'a\302\200cあ') = E'a\u0080cあ';
?column?
----------
t
=================
This looks quite valid according to the definition of the
encodings and perl's nature as far as I see.
-- The others
Variable naming in util_quote_*() seems a bit confusing,
text *arg = sv2text(sv);
text *ret = DatumGetTextP(..., PointerGetDatum(arg)));
char *str;
pfree(arg);
str = text_to_cstring(ret);
RETVAL = cstr2sv(str);
pfree(str);
Renaming ret to quoted and str to ret as the patch attached might
make it easily readable.
Now, with my version of the patch applied and using a SQL_ASCII database
to test the problem in the original report, I notice that we now have a
regression failure:
snip.
I'm not really sure what to do here -- maybe have a second expected file
for that test is a good enough answer? Or should I just take the test
out? Opinions please.
The attached ugly patch does it. We seem should put NO_LOCALE=1
on the 'make check' command line for the encodings not compatible
with the environmental locale, although it looks work.
# UtfToLocal() seems to have a bug that always report illegal
# encoding was "UTF8" regardless of the real encoding. But
# plper_lc_*.(sql|out) increases if the bug is fixed.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
== My e-mail address has been changed since Apr. 1, 2012.
Ouch!
# UtfToLocal() seems to have a bug that always report illegal
# encoding was "UTF8" regardless of the real encoding. But
# plper_lc_*.(sql|out) increases if the bug is fixed.
This is not a bug. The error message "invalid byte sequence for
encoding "UTF8"" meant "invalid INPUT byte sequence as encoding
"UTF8"" not encoding for output.. So the regtest patch covers all
of possible patterns - UTF8 and SQL_ASCII..
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
== My e-mail address has been changed since Apr. 1, 2012.
On Wed, Jun 20, 2012 at 1:15 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
Excerpts from Alex Hunsaker's message of vie feb 10 16:53:05 -0300 2012:
Seems like we missed the fact that we still did SvUTF8_on() in sv2cstr
and SvPVUTF8() when turning a perl string into a cstring.Right.
So I played a bit with this patch, and touched it a bit mainly just to
add some more comments; and while at it I noticed that some of the
functions in Util.xs might leak some memory, so I made an attempt to
plug them, as in the attached patch (which supersedes yours).
I think most of these leaks go back to 9.0. Dunno if its worth
backpatching them...
to test the problem in the original report, I notice that we now have a
regression failure:
I'm not really sure what to do here -- maybe have a second expected file
for that test is a good enough answer? Or should I just take the test
out? Opinions please.
I think we have broken that check twice so it seems like it would be
nice to keep. But I don't feel *to* strongly about it.
The comment and cleanups all look good to me.
Thanks!
On Thu, Jun 21, 2012 at 5:22 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
So I played a bit with this patch, and touched it a bit mainly
[...] functions in Util.xs might leak some memory, so I made an attempt to
Ok, Is it ok to look into the newer patch including fix of leaks
at first?
Yeah :-).
Variable naming in util_quote_*() seems a bit confusing,
Renaming ret to quoted and str to ret as the patch attached might
make it easily readable.
Ok.
[ regression failure on a SQL_ASCII database ]
I'm not really sure what to do here -- maybe have a second expected file
for that test is a good enough answer? Or should I just take the test
out? Opinions please.
The attached ugly patch does it. We seem should put NO_LOCALE=1
on the 'make check' command line for the encodings not compatible
with the environmental locale, although it looks work.
+REGRESS_LC0 = $(subst .sql,,$(shell cd sql; ls plperl_lc_$(shell echo
$(ENCODING) | tr "A-Z-" "a-z_").sql 2>/dev/null))
+REGRESS_LC = $(if $(REGRESS_LC0),$(REGRESS_LC0),plperl_lc)
+REGRESS = plperl $(REGRESS_LC) plperl_trigger plperl_shared
plperl_elog plperl_util plperl_init plperlu plperl_array
Hrm, that's quite cute. I dunno if there is a more cannon way of doing
the above-- but it seems to work. I'm not sure this regression test is
worth it. I'm thinking maybe we should just remove
theegressionegression test instead.
There is a minor issue with the patch where
sql/plperl_lc_sql_ascii.sql contains the text "plperl_lc.sql". After
copying sql/plperl_lc.sql to sql/plperl_lc_sql_ascii.sql everything
worked as described.
Thanks for the review!
Excerpts from Kyotaro HORIGUCHI's message of jue jun 21 08:02:58 -0400 2012:
Ouch!
# UtfToLocal() seems to have a bug that always report illegal
# encoding was "UTF8" regardless of the real encoding. But
# plper_lc_*.(sql|out) increases if the bug is fixed.This is not a bug. The error message "invalid byte sequence for
encoding "UTF8"" meant "invalid INPUT byte sequence as encoding
"UTF8"" not encoding for output.. So the regtest patch covers all
of possible patterns - UTF8 and SQL_ASCII..
Right, that confused me too for a while.
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Excerpts from Kyotaro HORIGUCHI's message of jue jun 21 07:22:43 -0400 2012:
Hello,
Seems like we missed the fact that we still did SvUTF8_on() in sv2cstr
and SvPVUTF8() when turning a perl string into a cstring.Right.
I spent a bit longer time catching on pl/perl and now understand
what is the problem...
Hi,
Thanks for the review.
So I played a bit with this patch, and touched it a bit mainly just to
add some more comments; and while at it I noticed that some of the
functions in Util.xs might leak some memory, so I made an attempt to
plug them, as in the attached patch (which supersedes yours).Ok, Is it ok to look into the newer patch including fix of leaks
at first?
Yeah, I'm going to have to backpatch the whole of it so having full
review is good. Thanks.
-- The others
Variable naming in util_quote_*() seems a bit confusing,
text *arg = sv2text(sv);
text *ret = DatumGetTextP(..., PointerGetDatum(arg)));
char *str;
pfree(arg);
str = text_to_cstring(ret);
RETVAL = cstr2sv(str);
pfree(str);Renaming ret to quoted and str to ret as the patch attached might
make it easily readable.
I think I'm going to refrain from this because it will be more painful
to backpatch.
Now, with my version of the patch applied and using a SQL_ASCII database
to test the problem in the original report, I notice that we now have a
regression failure:snip.
I'm not really sure what to do here -- maybe have a second expected file
for that test is a good enough answer? Or should I just take the test
out? Opinions please.The attached ugly patch does it. We seem should put NO_LOCALE=1
on the 'make check' command line for the encodings not compatible
with the environmental locale, although it looks work.
The idea of separating the test into its own file has its merit; but
instead of having two different tests, I'm going to have a single test
and two expected files. That seems simpler than messing around in the
makefile.
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Excerpts from Alex Hunsaker's message of jue jun 21 10:27:41 -0400 2012:
On Wed, Jun 20, 2012 at 1:15 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:Excerpts from Alex Hunsaker's message of vie feb 10 16:53:05 -0300 2012:
Seems like we missed the fact that we still did SvUTF8_on() in sv2cstr
and SvPVUTF8() when turning a perl string into a cstring.Right.
So I played a bit with this patch, and touched it a bit mainly just to
add some more comments; and while at it I noticed that some of the
functions in Util.xs might leak some memory, so I made an attempt to
plug them, as in the attached patch (which supersedes yours).I think most of these leaks go back to 9.0. Dunno if its worth
backpatching them...
Well, nobody has ever complained about them so maybe they're just not
worth the effort.
to test the problem in the original report, I notice that we now have a
regression failure:I'm not really sure what to do here -- maybe have a second expected file
for that test is a good enough answer? Or should I just take the test
out? Opinions please.I think we have broken that check twice so it seems like it would be
nice to keep. But I don't feel *to* strongly about it.
Hmm, if we're broken it then it's probably best to keep it.
The comment and cleanups all look good to me.
Thanks for the review.
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Hello.
The attached ugly patch does it. We seem should put NO_LOCALE=1
on the 'make check' command line for the encodings not compatible
with the environmental locale, although it looks work.+REGRESS_LC0 = $(subst .sql,,$(shell cd sql; ls plperl_lc_$(shell echo
snip.
Hrm, that's quite cute. I dunno if there is a more cannon way of doing
the above-- but it seems to work. I'm not sure this regression test is
worth it. I'm thinking maybe we should just remove
theegressionegression test instead.
I agree. That is the fundamental question. I've coded just for my
fun but I don't see not so much signicance to do that. We might
omit the test for this which is non-ciritical and corner cases.
I'll leave it to your decision whether to do that.
There is a minor issue with the patch where
sql/plperl_lc_sql_ascii.sql contains the text "plperl_lc.sql". After
copying sql/plperl_lc.sql to sql/plperl_lc_sql_ascii.sql everything
worked as described.
Ah. It is what was a simbolic link. I made the patch with doubt
whether symlink could be encoded into diff, and saw the doubious
result but left as it is :-p. I leaned that no meaningful
symbolic-link cannot be used in source tree managed by git.
--
Kyotaro Horiguchi
NTT Open Source Software Center
== My e-mail address has been changed since Apr. 1, 2012.
Hello.
Renaming ret to quoted and str to ret as the patch attached might
make it easily readable.I think I'm going to refrain from this because it will be more painful
to backpatch.
I've felt hesitation to do so, too.
The new patch is indeed avoid leaks although which does not lasts
permanently. This will preserve more heap for future work but has
no necessity to do.
On the other hand, the return value of DatumGetTextP() is also
may (and 'should' in this case) palloc'ed but not pfree'd like
other part of PostgreSQL source (as far as I saw..) because of, I
suppose, the nature of these functions that it is
difficult/unable to be predicted/determined whether returning
memory block is palloc'ed ones or not. And the pain to maintain
such codes unrobust for future modification.
From such a point of view, we might be good to refrain to
backport this.
The attached ugly patch does it. We seem should put NO_LOCALE=1
on the 'make check' command line for the encodings not compatible
with the environmental locale, although it looks work.The idea of separating the test into its own file has its merit; but
instead of having two different tests, I'm going to have a single test
and two expected files. That seems simpler than messing around in the
makefile.
Yes, you're right. But it was easier to add pairs of .sql and
.out to do that. Plus, as I wrote in another message, I'm
unwilling to push it nevertheless I've wrote it:-(
--
Kyotaro Horiguchi
NTT Open Source Software Center
== My e-mail address has been changed since Apr. 1, 2012.
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:
+REGRESS_LC0 = $(subst .sql,,$(shell cd sql; ls plperl_lc_$(shell echo
Hrm, that's quite cute. I dunno if there is a more cannon way of doing
the above-- but it seems to work. I'm not sure this regression test is
worth it. I'm thinking maybe we should just remove
theegressionegression test instead.
I agree. That is the fundamental question. I've coded just for my
fun but I don't see not so much signicance to do that. We might
omit the test for this which is non-ciritical and corner cases.
We need these tests to work on Windows too, so fancy gmake tricks are
probably not the way to deal with varying results.
regards, tom lane
Hello, thank you for your sugestion.
I agree. That is the fundamental question. I've coded just for my
fun but I don't see not so much signicance to do that. We might
omit the test for this which is non-ciritical and corner cases.We need these tests to work on Windows too, so fancy gmake tricks are
probably not the way to deal with varying results.
Hmm. I understand that you suggested that we should do this in
normal regression test.
Ok. Since there found to be only two patterns in the regression
test. The fancy thing is no more needed. I will unfold them and
make sure to work on mingw build environment.
And for one more environment, on the one with VC++.. I'll need a
bit longer time to make out what vcregress.pl does.
On the other things, I will decide as following and sent to
committer as soon as the above is finished.
- The main patch fixes the sql-ascii handling itself shoud ported
into 9.2 and 9.1. Someone shoud work for this. (me?)
- The remainder of the patch whic fixes the easy fixable leakes
of palloc'ed memory won't be ported into 9.1. This is only for
9.3dev.
- The patch for 9.3dev will be provided with the new regression
test. It will be easily ported into 9.1 and 9.2 and there seems
to be no problem technically, but a bit unsure from the other
points of view...
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
== My e-mail address has been changed since Apr. 1, 2012.