pgsql: Force strings passed to and from plperl to be in UTF8 encoding.

Started by Andrew Dunstanalmost 15 years ago24 messages
#1Andrew Dunstan
andrew@dunslane.net

Force strings passed to and from plperl to be in UTF8 encoding.

String are converted to UTF8 on the way into perl and to the
database encoding on the way back. This avoids a number of
observed anomalies, and ensures Perl a consistent view of the
world.

Some minor code cleanups are also accomplished.

Alex Hunsaker, reviewed by Andy Colson.

Branch
------
master

Details
-------
http://git.postgresql.org/gitweb?p=postgresql.git;a=commitdiff;h=50d89d422f9c68a52a6964e5468e8eb4f90b1d95

Modified Files
--------------
doc/src/sgml/plperl.sgml | 8 ++
src/pl/plperl/SPI.xs | 52 ++++++---
src/pl/plperl/Util.xs | 66 +++++-----
src/pl/plperl/plperl.c | 260 +++++++++++++++++++++++-----------------
src/pl/plperl/plperl_helpers.h | 69 +++++++++++
5 files changed, 295 insertions(+), 160 deletions(-)

#2Andrew Dunstan
andrew@dunslane.net
In reply to: Andrew Dunstan (#1)
Re: [COMMITTERS] pgsql: Force strings passed to and from plperl to be in UTF8 encoding.

On 02/06/2011 05:31 PM, Andrew Dunstan wrote:

Force strings passed to and from plperl to be in UTF8 encoding.

String are converted to UTF8 on the way into perl and to the
database encoding on the way back. This avoids a number of
observed anomalies, and ensures Perl a consistent view of the
world.

Some minor code cleanups are also accomplished.

Alex Hunsaker, reviewed by Andy Colson.

This has broken the buildfarm :-(

perl ppport.h reports:

*** WARNING: Uses HeUTF8, which may not be portable below perl
5.11.0, even with 'ppport.h'

Experimentation on a CentOS machine suggests we can cure it with this:

#ifndef HeUTF8
#define HeUTF8(he) ((HeKLEN(he) == HEf_SVKEY) ? \
SvUTF8(HeKEY_sv(he))
: \
(U32)HeKUTF8(he))
#endif

cheers

andrew

#3Alex Hunsaker
badalex@gmail.com
In reply to: Andrew Dunstan (#2)
Re: [COMMITTERS] pgsql: Force strings passed to and from plperl to be in UTF8 encoding.

On Sun, Feb 6, 2011 at 18:02, Andrew Dunstan <andrew@dunslane.net> wrote:

On 02/06/2011 05:31 PM, Andrew Dunstan wrote:

Force strings passed to and from plperl to be in UTF8 encoding.

String are converted to UTF8 on the way into perl and to the
database encoding on the way back. This avoids a number of
observed anomalies, and ensures Perl a consistent view of the
world.

Some minor code cleanups are also accomplished.

Alex Hunsaker, reviewed by Andy Colson.

This has broken the buildfarm :-(

Drat.

perl ppport.h reports:

  *** WARNING: Uses HeUTF8, which may not be portable below perl
  5.11.0, even with 'ppport.h'
Experimentation on a CentOS machine suggests we can cure it with this:

  #ifndef HeUTF8
  #define HeUTF8(he)             ((HeKLEN(he) == HEf_SVKEY) ?            \
                                   SvUTF8(HeKEY_sv(he))
  :                 \
                                   (U32)HeKUTF8(he))
  #endif

Yeah, that should work. BTW I would have loved to add some regression
tests for some of this (like the example hek2cstr states). Is there
any way to do that?

#4Andrew Dunstan
andrew@dunslane.net
In reply to: Alex Hunsaker (#3)
Re: [COMMITTERS] pgsql: Force strings passed to and from plperl to be in UTF8 encoding.

On 02/06/2011 09:13 PM, Alex Hunsaker wrote:

I would have loved to add some regression
tests for some of this (like the example hek2cstr states). Is there
any way to do that?

I can't think of an obvious way. Anyone else?

cheers

ndrew

#5Alex Hunsaker
badalex@gmail.com
In reply to: Andrew Dunstan (#1)
1 attachment(s)
Re: [COMMITTERS] pgsql: Force strings passed to and from plperl to be in UTF8 encoding.

On Sun, Feb 6, 2011 at 15:31, Andrew Dunstan <andrew@dunslane.net> wrote:

Force strings passed to and from plperl to be in UTF8 encoding.

String are converted to UTF8 on the way into perl and to the
database encoding on the way back. This avoids a number of
observed anomalies, and ensures Perl a consistent view of the
world.

So I noticed a problem while playing with this in my discussion with
David Wheeler. pg_do_encoding() does nothing when the src encoding ==
the dest encoding. That means on a UTF-8 database we fail make sure
our strings are valid utf8.

An easy way to see this is to embed a null in the middle of a string:
=> create or replace function zerob() returns text as $$ return
"abcd\0efg"; $$ language plperl;
=> SELECT zerob();
abcd

Also It seems bogus to bogus to do any encoding conversion when we are
SQL_ASCII, and its really trivial to fix.

With the attached:
- when we are on a utf8 database make sure to verify our output string
in sv2cstr (we assume database strings coming in are already valid)

- Do no string conversion when we are SQL_ASCII in or out

- add plperl_helpers.h as a dep to plperl.o in our makefile

- remove some redundant calls to pg_verify_mbstr()

- as utf_e2u only as one caller dont pstrdup() instead have the caller
check (saves some cycles and memory)

Attachments:

plperl_utf8_mbverify.patchtext/x-patch; charset=US-ASCII; name=plperl_utf8_mbverify.patchDownload
*** a/doc/src/sgml/plperl.sgml
--- b/doc/src/sgml/plperl.sgml
***************
*** 127,135 **** $$ LANGUAGE plperl;
  
    <note>
      <para>
!       Arguments will be converted from the database's encoding to UTF-8 
!       for use inside plperl, and then converted from UTF-8 back to the 
!       database encoding upon return. 
      </para>
    </note>
  
--- 127,136 ----
  
    <note>
      <para>
!       Arguments will be converted from the database's encoding to
!       UTF-8 for use inside plperl, and then converted from UTF-8 back
!       to the database encoding upon return.  Unless the database
!       encoding is SQL_ASCII, then no conversion is done.
      </para>
    </note>
  
*** a/src/pl/plperl/GNUmakefile
--- b/src/pl/plperl/GNUmakefile
***************
*** 54,60 **** PSQLDIR = $(bindir)
  
  include $(top_srcdir)/src/Makefile.shlib
  
! plperl.o: perlchunks.h plperl_opmask.h
  
  plperl_opmask.h: plperl_opmask.pl
  	$(PERL) $< $@
--- 54,60 ----
  
  include $(top_srcdir)/src/Makefile.shlib
  
! plperl.o: perlchunks.h plperl_opmask.h plperl_helpers.h
  
  plperl_opmask.h: plperl_opmask.pl
  	$(PERL) $< $@
*** a/src/pl/plperl/plperl.c
--- b/src/pl/plperl/plperl.c
***************
*** 2308,2315 **** plperl_spi_exec(char *query, int limit)
  	{
  		int			spi_rv;
  
- 		pg_verifymbstr(query, strlen(query), false);
- 
  		spi_rv = SPI_execute(query, current_call_data->prodesc->fn_readonly,
  							 limit);
  		ret_hv = plperl_spi_execute_fetch_result(SPI_tuptable, SPI_processed,
--- 2308,2313 ----
***************
*** 2555,2563 **** plperl_spi_query(char *query)
  		void	   *plan;
  		Portal		portal;
  
- 		/* Make sure the query is validly encoded */
- 		pg_verifymbstr(query, strlen(query), false);
- 
  		/* Create a cursor for the query */
  		plan = SPI_prepare(query, 0, NULL);
  		if (plan == NULL)
--- 2553,2558 ----
***************
*** 2767,2775 **** plperl_spi_prepare(char *query, int argc, SV **argv)
  			qdesc->argtypioparams[i] = typIOParam;
  		}
  
- 		/* Make sure the query is validly encoded */
- 		pg_verifymbstr(query, strlen(query), false);
- 
  		/************************************************************
  		 * Prepare the plan and check for errors
  		 ************************************************************/
--- 2762,2767 ----
*** a/src/pl/plperl/plperl_helpers.h
--- b/src/pl/plperl/plperl_helpers.h
***************
*** 20,27 **** static inline char *
  utf_e2u(const char *str)
  {
  	char *ret = (char*)pg_do_encoding_conversion((unsigned char*)str, strlen(str), GetDatabaseEncoding(), PG_UTF8);
- 	if (ret == str)
- 		ret = pstrdup(ret);
  	return ret;
  }
  
--- 20,25 ----
***************
*** 34,46 **** sv2cstr(SV *sv)
  {
  	char *val;
  	STRLEN len;
  
  	/*
! 	 * get a utf8 encoded char * out of perl. *note* it may not be valid utf8!
  	 */
  	val = SvPVutf8(sv, len);
  
  	/*
  	 * we use perls length in the event we had an embedded null byte to ensure
  	 * we error out properly
  	 */
--- 32,59 ----
  {
  	char *val;
  	STRLEN len;
+ 	int enc = GetDatabaseEncoding();
  
  	/*
! 	 * we dont do any conversion when SQL_ASCII
  	 */
+ 	if (enc == PG_SQL_ASCII)
+ 	{
+ 		val = SvPV(sv, len);
+ 		return pstrdup(val);
+ 	}
+ 
  	val = SvPVutf8(sv, len);
  
  	/*
+ 	 * when the src encoding matches the dest encoding pg_do_encoding will not
+ 	 * do any verification. SvPVutf8() may return invalid utf8 so we need to do
+ 	 * that ourselves
+ 	 */
+ 	if (enc == PG_UTF8)
+ 		pg_verifymbstr(val, len, false);
+ 
+ 	/*
  	 * we use perls length in the event we had an embedded null byte to ensure
  	 * we error out properly
  	 */
***************
*** 56,67 **** static inline SV *
  cstr2sv(const char *str)
  {
  	SV *sv;
! 	char *utf8_str = utf_e2u(str);
  
  	sv = newSVpv(utf8_str, 0);
  	SvUTF8_on(sv);
  
! 	pfree(utf8_str);
  
  	return sv;
  }
--- 69,91 ----
  cstr2sv(const char *str)
  {
  	SV *sv;
! 	char *utf8_str;
  
+ 	/*
+ 	 * we dont do any conversion when SQL_ASCII leave it as byte soup
+ 	 */
+ 	if (GetDatabaseEncoding() != PG_SQL_ASCII)
+ 		return newSVpv(str, 0);
+ 
+ 	/*
+ 	 * for all other encodings we convert it to utf8
+ 	 */
+ 	utf8_str = utf_e2u(str);
  	sv = newSVpv(utf8_str, 0);
  	SvUTF8_on(sv);
  
! 	if (utf8_str != str)
! 		pfree(utf8_str);
  
  	return sv;
  }
#6Amit Khandekar
amit.khandekar@enterprisedb.com
In reply to: Alex Hunsaker (#5)
1 attachment(s)
Re: Re: [COMMITTERS] pgsql: Force strings passed to and from plperl to be in UTF8 encoding.

On 12 February 2011 14:48, Alex Hunsaker <badalex@gmail.com> wrote:

On Sun, Feb 6, 2011 at 15:31, Andrew Dunstan <andrew@dunslane.net> wrote:

Force strings passed to and from plperl to be in UTF8 encoding.

String are converted to UTF8 on the way into perl and to the
database encoding on the way back. This avoids a number of
observed anomalies, and ensures Perl a consistent view of the
world.

So I noticed a problem while playing with this in my discussion with
David Wheeler. pg_do_encoding() does nothing when the src encoding ==
the dest encoding. That means on a UTF-8 database we fail make sure
our strings are valid utf8.

An easy way to see this is to embed a null in the middle of a string:
=> create or replace function zerob() returns text as $$ return
"abcd\0efg"; $$ language plperl;
=> SELECT zerob();
abcd

Also It seems bogus to bogus to do any encoding conversion when we are
SQL_ASCII, and its really trivial to fix.

With the attached:
- when we are on a utf8 database make sure to verify our output string
in sv2cstr (we assume database strings coming in are already valid)

- Do no string conversion when we are SQL_ASCII in or out

- add plperl_helpers.h as a dep to plperl.o in our makefile

- remove some redundant calls to pg_verify_mbstr()

- as utf_e2u only as one caller dont pstrdup() instead have the caller
check (saves some cycles and memory)

Is there a plan to commit this issue? I am still seeing this issue on
PG 9.1 STABLE branch. Attached is a small patch that targets only the
specific issue in the described testcase :

create or replace function zerob() returns text as $$ return
"abcd\0efg"; $$ language plperl;
SELECT zerob();

The patch does the perl data validation in the function utf_u2e() itself.

Show quoted text

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachments:

verify_in_utf_u2e.patchtext/x-diff; charset=US-ASCII; name=verify_in_utf_u2e.patchDownload
diff --git a/src/pl/plperl/plperl_helpers.h b/src/pl/plperl/plperl_helpers.h
index 81c177b..3afe2f5 100644
--- a/src/pl/plperl/plperl_helpers.h
+++ b/src/pl/plperl/plperl_helpers.h
@@ -10,7 +10,10 @@ utf_u2e(const char *utf8_str, size_t len)
 	char	   *ret = (char *) pg_do_encoding_conversion((unsigned char *) utf8_str, len, PG_UTF8, GetDatabaseEncoding());
 
 	if (ret == utf8_str)
+	{
+		pg_verify_mbstr_len(PG_UTF8, utf8_str, len, false);
 		ret = pstrdup(ret);
+	}
 	return ret;
 }
 
#7Alex Hunsaker
badalex@gmail.com
In reply to: Amit Khandekar (#6)
Re: Re: [COMMITTERS] pgsql: Force strings passed to and from plperl to be in UTF8 encoding.

On Mon, Oct 3, 2011 at 04:20, Amit Khandekar
<amit.khandekar@enterprisedb.com> wrote:

Is there a plan to commit this issue? I am still seeing this issue on
PG 9.1 STABLE branch. Attached is a small patch that targets only the
specific issue in the described testcase :

create or replace function zerob() returns text as $$ return
"abcd\0efg"; $$ language plperl;
SELECT zerob();

The patch does the perl data validation in the function utf_u2e() itself.

I think thats fine, but as coded it will verify the string twice in
the GetDatabaseEncoding() != PG_UTF8 case (once for
pg_do_encoding_conversion() and again with the added
pg_verify_mbstr_len), which seems a bit wasteful.

It might be worth adding a regression test also...

#8Amit Khandekar
amit.khandekar@enterprisedb.com
In reply to: Alex Hunsaker (#7)
Re: Re: [COMMITTERS] pgsql: Force strings passed to and from plperl to be in UTF8 encoding.

On 3 October 2011 22:37, Alex Hunsaker <badalex@gmail.com> wrote:

On Mon, Oct 3, 2011 at 04:20, Amit Khandekar
<amit.khandekar@enterprisedb.com> wrote:

Is there a plan to commit this issue? I am still seeing this issue on
PG 9.1 STABLE branch. Attached is a small patch that targets only the
specific issue in the described testcase :

create or replace function zerob() returns text as $$ return
"abcd\0efg"; $$ language plperl;
SELECT zerob();

The patch does the perl data validation in the function utf_u2e() itself.

I think thats fine, but as coded it will verify the string twice in
the GetDatabaseEncoding() != PG_UTF8 case (once for
pg_do_encoding_conversion() and again with the added
pg_verify_mbstr_len), which seems a bit wasteful.

WHen GetDatabaseEncoding() != PG_UTF8 case, ret will not be equal to
utf8_str, so pg_verify_mbstr_len() will not get called. That's the
reason, pg_verify_mbstr_len() is under the ( ret == utf8_str )
condition.

It might be worth adding a regression test also...

I could not find any basic pl/perl tests in the regression
serial_schedule. I am not sure if we want to add just this scenario
without any basic tests for pl/perl ?

Show quoted text
#9Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Amit Khandekar (#8)
Re: Re: [COMMITTERS] pgsql: Force strings passed to and from plperl to be in UTF8 encoding.

On 04.10.2011 08:35, Amit Khandekar wrote:

On 3 October 2011 22:37, Alex Hunsaker<badalex@gmail.com> wrote:

It might be worth adding a regression test also...

I could not find any basic pl/perl tests in the regression
serial_schedule. I am not sure if we want to add just this scenario
without any basic tests for pl/perl ?

See src/pl/plperl/[sql|expected]

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#10Alex Hunsaker
badalex@gmail.com
In reply to: Amit Khandekar (#8)
1 attachment(s)
Re: Re: [COMMITTERS] pgsql: Force strings passed to and from plperl to be in UTF8 encoding.

On Mon, Oct 3, 2011 at 23:35, Amit Khandekar
<amit.khandekar@enterprisedb.com> wrote:

WHen GetDatabaseEncoding() != PG_UTF8 case, ret will not be equal to
utf8_str, so pg_verify_mbstr_len() will not get called. That's the
reason, pg_verify_mbstr_len() is under the ( ret == utf8_str )
condition.

Consider a latin1 database where utf8_str was a string of ascii
characters. Then no conversion would take place and ret == utf8_str
but the string would be verified by pg_do_encdoing_conversion() and
verified again by your added check :-).

It might be worth adding a regression test also...

I could not find any basic pl/perl tests in the regression
serial_schedule. I am not sure if we want to add just this scenario
without any basic tests for pl/perl ?

I went ahead and added one in the attached based upon your example.

Look ok to you?

BTW thanks for the patch!

[ side note ]
I still think we should not be doing any conversion in the SQL_ASCII
case but this slimmed down patch should be less controversial.

Attachments:

plperl_verify_utf_u2e_v2.patchtext/x-patch; charset=US-ASCII; name=plperl_verify_utf_u2e_v2.patchDownload
*** a/src/pl/plperl/GNUmakefile
--- b/src/pl/plperl/GNUmakefile
***************
*** 57,63 **** PSQLDIR = $(bindir)
  
  include $(top_srcdir)/src/Makefile.shlib
  
! plperl.o: perlchunks.h plperl_opmask.h
  
  plperl_opmask.h: plperl_opmask.pl
  	@if [ x"$(perl_privlibexp)" = x"" ]; then echo "configure switch --with-perl was not specified."; exit 1; fi
--- 57,63 ----
  
  include $(top_srcdir)/src/Makefile.shlib
  
! plperl.o: perlchunks.h plperl_opmask.h plperl_helpers.h
  
  plperl_opmask.h: plperl_opmask.pl
  	@if [ x"$(perl_privlibexp)" = x"" ]; then echo "configure switch --with-perl was not specified."; exit 1; fi
*** a/src/pl/plperl/expected/plperl.out
--- b/src/pl/plperl/expected/plperl.out
***************
*** 639,641 **** CONTEXT:  PL/Perl anonymous code block
--- 639,651 ----
  DO $do$ use warnings FATAL => qw(void) ; my @y; my $x = sort @y; 1; $do$ LANGUAGE plperl;
  ERROR:  Useless use of sort in scalar context at line 1.
  CONTEXT:  PL/Perl anonymous code block
+ --
+ -- Make sure strings are validated -- This code may fail in a non-UTF8 database
+ -- if it allows null bytes in strings.
+ --
+ CREATE OR REPLACE FUNCTION perl_zerob() RETURNS TEXT AS $$
+   return "abcd\0efg";
+ $$ LANGUAGE plperlu;
+ SELECT perl_zerob();
+ ERROR:  invalid byte sequence for encoding "UTF8": 0x00
+ CONTEXT:  PL/Perl function "perl_zerob"
*** a/src/pl/plperl/plperl_helpers.h
--- b/src/pl/plperl/plperl_helpers.h
***************
*** 9,14 **** utf_u2e(const char *utf8_str, size_t len)
--- 9,22 ----
  {
  	char	   *ret = (char *) pg_do_encoding_conversion((unsigned char *) utf8_str, len, PG_UTF8, GetDatabaseEncoding());
  
+ 	/*
+ 	 * when src encoding == dest encoding (PG_UTF8 ==
+ 	 * GetDatabaseEncoding(), pg_do_encoding_conversion() is a noop and
+ 	 * does not verify the string so we need to do it manually
+ 	 */
+ 	if(GetDatabaseEncoding() == PG_UTF8)
+ 		pg_verify_mbstr_len(PG_UTF8, utf8_str, len, false);
+ 
  	if (ret == utf8_str)
  		ret = pstrdup(ret);
  	return ret;
*** a/src/pl/plperl/sql/plperl.sql
--- b/src/pl/plperl/sql/plperl.sql
***************
*** 415,417 **** DO $do$ use strict; my $name = "foo"; my $ref = $$name; $do$ LANGUAGE plperl;
--- 415,426 ----
  -- check that we can "use warnings" (in this case to turn a warn into an error)
  -- yields "ERROR:  Useless use of sort in scalar context."
  DO $do$ use warnings FATAL => qw(void) ; my @y; my $x = sort @y; 1; $do$ LANGUAGE plperl;
+ 
+ --
+ -- Make sure strings are validated -- This code may fail in a non-UTF8 database
+ -- if it allows null bytes in strings.
+ --
+ CREATE OR REPLACE FUNCTION perl_zerob() RETURNS TEXT AS $$
+   return "abcd\0efg";
+ $$ LANGUAGE plperlu;
+ SELECT perl_zerob();
#11Amit Khandekar
amit.khandekar@enterprisedb.com
In reply to: Alex Hunsaker (#10)
Re: Re: [COMMITTERS] pgsql: Force strings passed to and from plperl to be in UTF8 encoding.

On 4 October 2011 14:04, Alex Hunsaker <badalex@gmail.com> wrote:

On Mon, Oct 3, 2011 at 23:35, Amit Khandekar
<amit.khandekar@enterprisedb.com> wrote:

WHen GetDatabaseEncoding() != PG_UTF8 case, ret will not be equal to
utf8_str, so pg_verify_mbstr_len() will not get called. That's the
reason, pg_verify_mbstr_len() is under the ( ret == utf8_str )
condition.

Consider a latin1 database where utf8_str was a string of ascii
characters. Then no conversion would take place and ret == utf8_str
but the string would be verified by pg_do_encdoing_conversion() and
verified again by your added check :-).

It might be worth adding a regression test also...

I could not find any basic pl/perl tests in the regression
serial_schedule. I am not sure if we want to add just this scenario
without any basic tests for pl/perl ?

I went ahead and added one in the attached based upon your example.

Look ok to you?

+ 	if(GetDatabaseEncoding() == PG_UTF8)
+ 		pg_verify_mbstr_len(PG_UTF8, utf8_str, len, false);

In your patch, the above will again skip mb-validation if the database
encoding is SQL_ASCII. Note that in pg_do_encoding_conversion returns
the un-converted string even if *one* of the src and dest encodings is
SQL_ASCII.

I think :
        if (ret == utf8_str)
+       {
+               pg_verify_mbstr_len(PG_UTF8, utf8_str, len, false);
                ret = pstrdup(ret);
+       }

This (ret == utf8_str) condition would be a reliable way for knowing
whether pg_do_encoding_conversion() has done the conversion at all.

I am ok with the new test. Thanks for doing it yourself !

Show quoted text

BTW thanks for the patch!

[ side note ]
I still think we should not be doing any conversion in the SQL_ASCII
case but this slimmed down patch should be less controversial.

#12Alex Hunsaker
badalex@gmail.com
In reply to: Amit Khandekar (#11)
Re: Re: [COMMITTERS] pgsql: Force strings passed to and from plperl to be in UTF8 encoding.

On Tue, Oct 4, 2011 at 03:09, Amit Khandekar
<amit.khandekar@enterprisedb.com> wrote:

On 4 October 2011 14:04, Alex Hunsaker <badalex@gmail.com> wrote:

On Mon, Oct 3, 2011 at 23:35, Amit Khandekar
<amit.khandekar@enterprisedb.com> wrote:

WHen GetDatabaseEncoding() != PG_UTF8 case, ret will not be equal to
utf8_str, so pg_verify_mbstr_len() will not get called. [...]

Consider a latin1 database where utf8_str was a string of ascii
characters. [...]

[Patch] Look ok to you?

+       if(GetDatabaseEncoding() == PG_UTF8)
+               pg_verify_mbstr_len(PG_UTF8, utf8_str, len, false);

In your patch, the above will again skip mb-validation if the database
encoding is SQL_ASCII. Note that in pg_do_encoding_conversion returns
the un-converted string even if *one* of the src and dest encodings is
SQL_ASCII.

*scratches head* I thought the point of SQL_ASCII was no encoding
conversion was done and so there would be nothing to verify.

Ahh I see looks like pg_verify_mbstr_len() will make sure there are no
NULL bytes in the string when we are a single byte encoding.

I think :
       if (ret == utf8_str)
+       {
+               pg_verify_mbstr_len(PG_UTF8, utf8_str, len, false);
               ret = pstrdup(ret);
+       }

This (ret == utf8_str) condition would be a reliable way for knowing
whether pg_do_encoding_conversion() has done the conversion at all.

Yes. However (and maybe im nitpicking here), I dont see any reason to
verify certain strings twice if we can avoid it.

What do you think about:
+    /*
+    * when we are a PG_UTF8 or SQL_ASCII database pg_do_encoding_conversion()
+    * will not do any conversion or verification. we need to do it
manually instead.
+    */
+       if( GetDatabaseEncoding() == PG_UTF8 ||
              GetDatabaseEncoding() == SQL_ASCII)
+               pg_verify_mbstr_len(PG_UTF8, utf8_str, len, false);
#13Amit Khandekar
amit.khandekar@enterprisedb.com
In reply to: Alex Hunsaker (#12)
Re: Re: [COMMITTERS] pgsql: Force strings passed to and from plperl to be in UTF8 encoding.

On 4 October 2011 22:57, Alex Hunsaker <badalex@gmail.com> wrote:

On Tue, Oct 4, 2011 at 03:09, Amit Khandekar
<amit.khandekar@enterprisedb.com> wrote:

On 4 October 2011 14:04, Alex Hunsaker <badalex@gmail.com> wrote:

On Mon, Oct 3, 2011 at 23:35, Amit Khandekar
<amit.khandekar@enterprisedb.com> wrote:

WHen GetDatabaseEncoding() != PG_UTF8 case, ret will not be equal to
utf8_str, so pg_verify_mbstr_len() will not get called. [...]

Consider a latin1 database where utf8_str was a string of ascii
characters. [...]

[Patch] Look ok to you?

+       if(GetDatabaseEncoding() == PG_UTF8)
+               pg_verify_mbstr_len(PG_UTF8, utf8_str, len, false);

In your patch, the above will again skip mb-validation if the database
encoding is SQL_ASCII. Note that in pg_do_encoding_conversion returns
the un-converted string even if *one* of the src and dest encodings is
SQL_ASCII.

*scratches head* I thought the point of SQL_ASCII was no encoding
conversion was done and so there would be nothing to verify.

Ahh I see looks like pg_verify_mbstr_len() will make sure there are no
NULL bytes in the string when we are a single byte encoding.

I think :
       if (ret == utf8_str)
+       {
+               pg_verify_mbstr_len(PG_UTF8, utf8_str, len, false);
               ret = pstrdup(ret);
+       }

This (ret == utf8_str) condition would be a reliable way for knowing
whether pg_do_encoding_conversion() has done the conversion at all.

Yes. However (and maybe im nitpicking here), I dont see any reason to
verify certain strings twice if we can avoid it.

What do you think about:
+    /*
+    * when we are a PG_UTF8 or SQL_ASCII database pg_do_encoding_conversion()
+    * will not do any conversion or verification. we need to do it
manually instead.
+    */
+       if( GetDatabaseEncoding() == PG_UTF8 ||
             GetDatabaseEncoding() == SQL_ASCII)
+               pg_verify_mbstr_len(PG_UTF8, utf8_str, len, false);

You mean the final changes in plperl_helpers.h would look like
something like this right? :

static inline char *
utf_u2e(const char *utf8_str, size_t len)
{
char *ret = (char *) pg_do_encoding_conversion((unsigned
char *) utf8_str, len, PG_UTF8, GetDatabaseEncoding());

        if (ret == utf8_str)
+       {
+               if (GetDatabaseEncoding() == PG_UTF8 ||
+                       GetDatabaseEncoding() == PG_SQL_ASCII)
+               {
+                       pg_verify_mbstr_len(PG_UTF8, utf8_str, len, false);
+               }
+
                ret = pstrdup(ret);
+       }
        return ret;
 }

Yeah I am ok with that. It's just an additional check besides (ret ==
utf8_str) to know if we really require validation.

#14Alex Hunsaker
badalex@gmail.com
In reply to: Amit Khandekar (#13)
Re: Re: [COMMITTERS] pgsql: Force strings passed to and from plperl to be in UTF8 encoding.

On Tue, Oct 4, 2011 at 23:46, Amit Khandekar
<amit.khandekar@enterprisedb.com> wrote:

On 4 October 2011 22:57, Alex Hunsaker <badalex@gmail.com> wrote:

On Tue, Oct 4, 2011 at 03:09, Amit Khandekar
<amit.khandekar@enterprisedb.com> wrote:

On 4 October 2011 14:04, Alex Hunsaker <badalex@gmail.com> wrote:

On Mon, Oct 3, 2011 at 23:35, Amit Khandekar
<amit.khandekar@enterprisedb.com> wrote:

WHen GetDatabaseEncoding() != PG_UTF8 case, ret will not be equal to
utf8_str, so pg_verify_mbstr_len() will not get called. [...]

Consider a latin1 database where utf8_str was a string of ascii
characters. [...]

[Patch] Look ok to you?

+       if(GetDatabaseEncoding() == PG_UTF8)
+               pg_verify_mbstr_len(PG_UTF8, utf8_str, len, false);

In your patch, the above will again skip mb-validation if the database
encoding is SQL_ASCII. Note that in pg_do_encoding_conversion returns
the un-converted string even if *one* of the src and dest encodings is
SQL_ASCII.

*scratches head* I thought the point of SQL_ASCII was no encoding
conversion was done and so there would be nothing to verify.

Ahh I see looks like pg_verify_mbstr_len() will make sure there are no
NULL bytes in the string when we are a single byte encoding.

I think :
       if (ret == utf8_str)
+       {
+               pg_verify_mbstr_len(PG_UTF8, utf8_str, len, false);
               ret = pstrdup(ret);
+       }

This (ret == utf8_str) condition would be a reliable way for knowing
whether pg_do_encoding_conversion() has done the conversion at all.

Yes. However (and maybe im nitpicking here), I dont see any reason to
verify certain strings twice if we can avoid it.

What do you think about:
+    /*
+    * when we are a PG_UTF8 or SQL_ASCII database pg_do_encoding_conversion()
+    * will not do any conversion or verification. we need to do it
manually instead.
+    */
+       if( GetDatabaseEncoding() == PG_UTF8 ||
             GetDatabaseEncoding() == SQL_ASCII)
+               pg_verify_mbstr_len(PG_UTF8, utf8_str, len, false);

You mean the final changes in plperl_helpers.h would look like
something like this right? :

 static inline char *
 utf_u2e(const char *utf8_str, size_t len)
 {
       char       *ret = (char *) pg_do_encoding_conversion((unsigned
char *) utf8_str, len, PG_UTF8, GetDatabaseEncoding());

       if (ret == utf8_str)
+       {
+               if (GetDatabaseEncoding() == PG_UTF8 ||
+                       GetDatabaseEncoding() == PG_SQL_ASCII)
+               {
+                       pg_verify_mbstr_len(PG_UTF8, utf8_str, len, false);
+               }
+
               ret = pstrdup(ret);
+       }
       return ret;
 }

Yes.

Show quoted text

Yeah I am ok with that. It's just an additional check besides (ret ==
utf8_str) to know if we really require validation.

#15Alex Hunsaker
badalex@gmail.com
In reply to: Alex Hunsaker (#14)
1 attachment(s)
Re: Re: [COMMITTERS] pgsql: Force strings passed to and from plperl to be in UTF8 encoding.

On Wed, Oct 5, 2011 at 00:30, Alex Hunsaker <badalex@gmail.com> wrote:

On Tue, Oct 4, 2011 at 23:46, Amit Khandekar
<amit.khandekar@enterprisedb.com> wrote:

You mean the final changes in plperl_helpers.h would look like
something like this right? :

 static inline char *
 utf_u2e(const char *utf8_str, size_t len)
 {
       char       *ret = (char *) pg_do_encoding_conversion((unsigned
char *) utf8_str, len, PG_UTF8, GetDatabaseEncoding());

       if (ret == utf8_str)
+       {
+               if (GetDatabaseEncoding() == PG_UTF8 ||
+                       GetDatabaseEncoding() == PG_SQL_ASCII)
+               {
+                       pg_verify_mbstr_len(PG_UTF8, utf8_str, len, false);
+               }
+
               ret = pstrdup(ret);
+       }
       return ret;
 }

Yeah I am ok with that. It's just an additional check besides (ret ==
utf8_str) to know if we really require validation.

Find it attached. [ Note I didn't put the check inside the if (ret ==
utf8_str) as it seemed a bit cleaner (indentation wise) to have it
outside ]

Attachments:

plperl_verify_utf_u2e_v3.patchtext/x-patch; charset=US-ASCII; name=plperl_verify_utf_u2e_v3.patchDownload
*** a/src/pl/plperl/GNUmakefile
--- b/src/pl/plperl/GNUmakefile
***************
*** 57,63 **** PSQLDIR = $(bindir)
  
  include $(top_srcdir)/src/Makefile.shlib
  
! plperl.o: perlchunks.h plperl_opmask.h
  
  plperl_opmask.h: plperl_opmask.pl
  	@if [ x"$(perl_privlibexp)" = x"" ]; then echo "configure switch --with-perl was not specified."; exit 1; fi
--- 57,63 ----
  
  include $(top_srcdir)/src/Makefile.shlib
  
! plperl.o: perlchunks.h plperl_opmask.h plperl_helpers.h
  
  plperl_opmask.h: plperl_opmask.pl
  	@if [ x"$(perl_privlibexp)" = x"" ]; then echo "configure switch --with-perl was not specified."; exit 1; fi
*** a/src/pl/plperl/expected/plperl.out
--- b/src/pl/plperl/expected/plperl.out
***************
*** 639,641 **** CONTEXT:  PL/Perl anonymous code block
--- 639,651 ----
  DO $do$ use warnings FATAL => qw(void) ; my @y; my $x = sort @y; 1; $do$ LANGUAGE plperl;
  ERROR:  Useless use of sort in scalar context at line 1.
  CONTEXT:  PL/Perl anonymous code block
+ --
+ -- Make sure strings are validated -- This code may fail in a non-UTF8 database
+ -- if it allows null bytes in strings.
+ --
+ CREATE OR REPLACE FUNCTION perl_zerob() RETURNS TEXT AS $$
+   return "abcd\0efg";
+ $$ LANGUAGE plperlu;
+ SELECT perl_zerob();
+ ERROR:  invalid byte sequence for encoding "UTF8": 0x00
+ CONTEXT:  PL/Perl function "perl_zerob"
*** a/src/pl/plperl/plperl_helpers.h
--- b/src/pl/plperl/plperl_helpers.h
***************
*** 7,16 ****
  static inline char *
  utf_u2e(const char *utf8_str, size_t len)
  {
! 	char	   *ret = (char *) pg_do_encoding_conversion((unsigned char *) utf8_str, len, PG_UTF8, GetDatabaseEncoding());
  
  	if (ret == utf8_str)
  		ret = pstrdup(ret);
  	return ret;
  }
  
--- 7,27 ----
  static inline char *
  utf_u2e(const char *utf8_str, size_t len)
  {
! 	int 	    enc = GetDatabaseEncoding();
! 
! 	char	   *ret = (char *) pg_do_encoding_conversion((unsigned char *) utf8_str, len, PG_UTF8, enc);
! 
! 	/*
! 	* when we are a PG_UTF8 or SQL_ASCII database
! 	* pg_do_encoding_conversion() will not do any conversion or
! 	* verification. we need to do it manually instead.
! 	*/
! 	if (enc == PG_UTF8 || enc == PG_SQL_ASCII)
! 		pg_verify_mbstr_len(PG_UTF8, utf8_str, len, false);
  
  	if (ret == utf8_str)
  		ret = pstrdup(ret);
+ 
  	return ret;
  }
  
*** a/src/pl/plperl/sql/plperl.sql
--- b/src/pl/plperl/sql/plperl.sql
***************
*** 415,417 **** DO $do$ use strict; my $name = "foo"; my $ref = $$name; $do$ LANGUAGE plperl;
--- 415,426 ----
  -- check that we can "use warnings" (in this case to turn a warn into an error)
  -- yields "ERROR:  Useless use of sort in scalar context."
  DO $do$ use warnings FATAL => qw(void) ; my @y; my $x = sort @y; 1; $do$ LANGUAGE plperl;
+ 
+ --
+ -- Make sure strings are validated -- This code may fail in a non-UTF8 database
+ -- if it allows null bytes in strings.
+ --
+ CREATE OR REPLACE FUNCTION perl_zerob() RETURNS TEXT AS $$
+   return "abcd\0efg";
+ $$ LANGUAGE plperlu;
+ SELECT perl_zerob();
#16Amit Khandekar
amit.khandekar@enterprisedb.com
In reply to: Alex Hunsaker (#15)
Re: Re: [COMMITTERS] pgsql: Force strings passed to and from plperl to be in UTF8 encoding.

On 5 October 2011 12:29, Alex Hunsaker <badalex@gmail.com> wrote:

On Wed, Oct 5, 2011 at 00:30, Alex Hunsaker <badalex@gmail.com> wrote:

On Tue, Oct 4, 2011 at 23:46, Amit Khandekar
<amit.khandekar@enterprisedb.com> wrote:

You mean the final changes in plperl_helpers.h would look like
something like this right? :

 static inline char *
 utf_u2e(const char *utf8_str, size_t len)
 {
       char       *ret = (char *) pg_do_encoding_conversion((unsigned
char *) utf8_str, len, PG_UTF8, GetDatabaseEncoding());

       if (ret == utf8_str)
+       {
+               if (GetDatabaseEncoding() == PG_UTF8 ||
+                       GetDatabaseEncoding() == PG_SQL_ASCII)
+               {
+                       pg_verify_mbstr_len(PG_UTF8, utf8_str, len, false);
+               }
+
               ret = pstrdup(ret);
+       }
       return ret;
 }

Yeah I am ok with that. It's just an additional check besides (ret ==
utf8_str) to know if we really require validation.

Find it attached. [ Note I didn't put the check inside the if (ret ==
utf8_str) as it seemed a bit cleaner (indentation wise) to have it
outside ]

I have no more issues with the patch.
Thanks!
-Amit

Show quoted text
#17Robert Haas
robertmhaas@gmail.com
In reply to: Amit Khandekar (#16)
Re: Re: [COMMITTERS] pgsql: Force strings passed to and from plperl to be in UTF8 encoding.

On Wed, Oct 5, 2011 at 3:58 AM, Amit Khandekar
<amit.khandekar@enterprisedb.com> wrote:

I have no more issues with the patch.
Thanks!

I think this patch needs to be added to the open CommitFest, with
links to the reviews, and marked Ready for Committer.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#18Alex Hunsaker
badalex@gmail.com
In reply to: Robert Haas (#17)
Re: Re: [COMMITTERS] pgsql: Force strings passed to and from plperl to be in UTF8 encoding.

On Wed, Oct 5, 2011 at 08:18, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Oct 5, 2011 at 3:58 AM, Amit Khandekar
<amit.khandekar@enterprisedb.com> wrote:

I have no more issues with the patch.
Thanks!

I think this patch needs to be added to the open CommitFest, with
links to the reviews, and marked Ready for Committer.

The open commitfest? Even if its an "important" bug fix that should be
backpatched?

#19Robert Haas
robertmhaas@gmail.com
In reply to: Alex Hunsaker (#18)
Re: Re: [COMMITTERS] pgsql: Force strings passed to and from plperl to be in UTF8 encoding.

On Wed, Oct 5, 2011 at 5:03 PM, Alex Hunsaker <badalex@gmail.com> wrote:

On Wed, Oct 5, 2011 at 08:18, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Oct 5, 2011 at 3:58 AM, Amit Khandekar
<amit.khandekar@enterprisedb.com> wrote:

I have no more issues with the patch.
Thanks!

I think this patch needs to be added to the open CommitFest, with
links to the reviews, and marked Ready for Committer.

The open commitfest? Even if its an "important" bug fix that should be
backpatched?

Considering that the issue appears to have been ignored from
mid-February until early October, I don't see why it should now get to
jump to the head of the queue. Other people may have different
opinions, of course.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#20David E. Wheeler
david@kineticode.com
In reply to: Robert Haas (#19)
Re: [COMMITTERS] pgsql: Force strings passed to and from plperl to be in UTF8 encoding.

On Oct 5, 2011, at 7:36 PM, Robert Haas wrote:

The open commitfest? Even if its an "important" bug fix that should be
backpatched?

Considering that the issue appears to have been ignored from
mid-February until early October, I don't see why it should now get to
jump to the head of the queue. Other people may have different
opinions, of course.

Proper encoding of text data to and from Perl is more important every day, as it's used for more and more uses beyond ASCII. I think it's important to backport such issues modulo behavioral changes.

Best,

David

#21Alex Hunsaker
badalex@gmail.com
In reply to: Robert Haas (#19)
Re: Re: [COMMITTERS] pgsql: Force strings passed to and from plperl to be in UTF8 encoding.

On Wed, Oct 5, 2011 at 20:36, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Oct 5, 2011 at 5:03 PM, Alex Hunsaker <badalex@gmail.com> wrote:

On Wed, Oct 5, 2011 at 08:18, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Oct 5, 2011 at 3:58 AM, Amit Khandekar
<amit.khandekar@enterprisedb.com> wrote:

I have no more issues with the patch.
Thanks!

I think this patch needs to be added to the open CommitFest, with
links to the reviews, and marked Ready for Committer.

The open commitfest? Even if its an "important" bug fix that should be
backpatched?

Considering that the issue appears to have been ignored from
mid-February until early October, I don't see why it should now get to
jump to the head of the queue.  Other people may have different
opinions, of course.

Added. :-)

#22Andrew Dunstan
andrew@dunslane.net
In reply to: Alex Hunsaker (#21)
Re: Re: [COMMITTERS] pgsql: Force strings passed to and from plperl to be in UTF8 encoding.

On 10/07/2011 12:51 PM, Alex Hunsaker wrote:

On Wed, Oct 5, 2011 at 20:36, Robert Haas<robertmhaas@gmail.com> wrote:

On Wed, Oct 5, 2011 at 5:03 PM, Alex Hunsaker<badalex@gmail.com> wrote:

On Wed, Oct 5, 2011 at 08:18, Robert Haas<robertmhaas@gmail.com> wrote:

On Wed, Oct 5, 2011 at 3:58 AM, Amit Khandekar
<amit.khandekar@enterprisedb.com> wrote:

I have no more issues with the patch.
Thanks!

I think this patch needs to be added to the open CommitFest, with
links to the reviews, and marked Ready for Committer.

The open commitfest? Even if its an "important" bug fix that should be
backpatched?

Considering that the issue appears to have been ignored from
mid-February until early October, I don't see why it should now get to
jump to the head of the queue. Other people may have different
opinions, of course.

Added. :-)

I'm just starting to look at this, by way of a break in staring at
pg_dump code ;-). This just needs to be backpatched to 9.1, is that right?

cheers

andrew

#23Alex Hunsaker
badalex@gmail.com
In reply to: Andrew Dunstan (#22)
Re: Re: [COMMITTERS] pgsql: Force strings passed to and from plperl to be in UTF8 encoding.

On Wed, Nov 2, 2011 at 17:12, Andrew Dunstan <andrew@dunslane.net> wrote:

Considering that the issue appears to have been ignored from
mid-February until early October, I don't see why it should now get to
jump to the head of the queue.  Other people may have different
opinions, of course.

Added. :-)

I'm just starting to look at this, by way of a break in staring at pg_dump
code ;-). This just needs to be backpatched to 9.1, is that right?

Yes please.

9.0 did not have this problem (or at least if it did it was a separate issue).

#24Andrew Dunstan
andrew@dunslane.net
In reply to: Amit Khandekar (#16)
Re: Re: [COMMITTERS] pgsql: Force strings passed to and from plperl to be in UTF8 encoding.

On 10/05/2011 03:58 AM, Amit Khandekar wrote:

On 5 October 2011 12:29, Alex Hunsaker<badalex@gmail.com> wrote:

Find it attached. [ Note I didn't put the check inside the if (ret ==
utf8_str) as it seemed a bit cleaner (indentation wise) to have it
outside ]

I have no more issues with the patch.

Committed.

cheers

andrew