pgsql: Unicode escapes in E'...' strings Author: Marko Kreen

Started by Nonameover 16 years ago12 messages
#1Noname
petere@postgresql.org

Log Message:
-----------
Unicode escapes in E'...' strings

Author: Marko Kreen <markokr@gmail.com>

Modified Files:
--------------
pgsql/doc/src/sgml:
syntax.sgml (r1.135 -> r1.136)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/doc/src/sgml/syntax.sgml?r1=1.135&amp;r2=1.136)
pgsql/src/backend/parser:
scan.l (r1.158 -> r1.159)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/parser/scan.l?r1=1.158&amp;r2=1.159)
pgsql/src/include/parser:
gramparse.h (r1.47 -> r1.48)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/include/parser/gramparse.h?r1=1.47&amp;r2=1.48)

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noname (#1)
Re: [COMMITTERS] pgsql: Unicode escapes in E'...' strings Author: Marko Kreen

petere@postgresql.org (Peter Eisentraut) writes:

Log Message:
-----------
Unicode escapes in E'...' strings

Author: Marko Kreen <markokr@gmail.com>

This patch has broken the no-backup property of the scanner, which
is an absolutely unacceptable penalty for such a second-order feature.
Please fix or revert.

Also, it failed to update psql's scanner to match.

regards, tom lane

#3Marko Kreen
markokr@gmail.com
In reply to: Tom Lane (#2)
Re: [COMMITTERS] pgsql: Unicode escapes in E'...' strings Author: Marko Kreen

On 9/25/09, Tom Lane <tgl@sss.pgh.pa.us> wrote:

petere@postgresql.org (Peter Eisentraut) writes:

Log Message:
-----------
Unicode escapes in E'...' strings

Author: Marko Kreen <markokr@gmail.com>

This patch has broken the no-backup property of the scanner, which
is an absolutely unacceptable penalty for such a second-order feature.
Please fix or revert.

How do I find out the state of said property?

Currently I assume its related to xeunicodebad pattern?

Will this fix it:

 -xeunicodebad   [\\]([uU])
 +xeunicodebad   [\\](u[0-9A-Fa-f]{0,3}|U[0-9A-Fa-f]{0,7})

?

--
marko

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Marko Kreen (#3)
Re: [COMMITTERS] pgsql: Unicode escapes in E'...' strings Author: Marko Kreen

Marko Kreen <markokr@gmail.com> writes:

On 9/25/09, Tom Lane <tgl@sss.pgh.pa.us> wrote:

This patch has broken the no-backup property of the scanner, which
is an absolutely unacceptable penalty for such a second-order feature.
Please fix or revert.

How do I find out the state of said property?

Per the comment at the head of scan.l, add the -b switch to the flex
call and see what flex says about it.

Currently I assume its related to xeunicodebad pattern?

Probably, but I didn't check.

regards, tom lane

#5Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#2)
Re: [COMMITTERS] pgsql: Unicode escapes in E'...' strings Author: Marko Kreen

On Fri, 2009-09-25 at 15:39 -0400, Tom Lane wrote:

petere@postgresql.org (Peter Eisentraut) writes:

Log Message:
-----------
Unicode escapes in E'...' strings

Author: Marko Kreen <markokr@gmail.com>

This patch has broken the no-backup property of the scanner,

Fixed.

Also, it failed to update psql's scanner to match.

Why does the psql scanner need to know about this? Doesn't it just need
to know the difference between backslash-quote and backslash-something
else?

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#5)
Re: [COMMITTERS] pgsql: Unicode escapes in E'...' strings Author: Marko Kreen

Peter Eisentraut <peter_e@gmx.net> writes:

On Fri, 2009-09-25 at 15:39 -0400, Tom Lane wrote:

Also, it failed to update psql's scanner to match.

Why does the psql scanner need to know about this? Doesn't it just need
to know the difference between backslash-quote and backslash-something
else?

Maybe it doesn't "need" to know, but I think it would be disastrous from
a maintenance standpoint to not keep the two sets of flex rules in
strict correspondence. It would soon become unclear whether or how to
apply changes in the backend lexer to psql.

regards, tom lane

#7Marko Kreen
markokr@gmail.com
In reply to: Tom Lane (#6)
1 attachment(s)
Re: [COMMITTERS] pgsql: Unicode escapes in E'...' strings Author: Marko Kreen

On 9/26/09, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Maybe it doesn't "need" to know, but I think it would be disastrous from
a maintenance standpoint to not keep the two sets of flex rules in
strict correspondence. It would soon become unclear whether or how to
apply changes in the backend lexer to psql.

Patch attached.

--
marko

Attachments:

psql-unicode.difftext/x-diff; charset=US-ASCII; name=psql-unicode.diffDownload
diff --git a/src/bin/psql/psqlscan.l b/src/bin/psql/psqlscan.l
index 7f08da2..cb85658 100644
--- a/src/bin/psql/psqlscan.l
+++ b/src/bin/psql/psqlscan.l
@@ -158,6 +158,7 @@ static void emit(const char *txt, int len);
  *  <xdolq> $foo$ quoted strings
  *  <xui> quoted identifier with Unicode escapes
  *  <xus> quoted string with Unicode escapes
+ *  <xeu> Unicode surrogate pair in extended quoted string
  */
 
 %x xb
@@ -169,6 +170,7 @@ static void emit(const char *txt, int len);
 %x xdolq
 %x xui
 %x xus
+%x xeu
 /* Additional exclusive states for psql only: lex backslash commands */
 %x xslashcmd
 %x xslasharg
@@ -253,6 +255,8 @@ xeinside		[^\\']+
 xeescape		[\\][^0-7]
 xeoctesc		[\\][0-7]{1,3}
 xehexesc		[\\]x[0-9A-Fa-f]{1,2}
+xeunicode		[\\](u[0-9A-Fa-f]{4}|U[0-9A-Fa-f]{8})
+xeunicodefail	[\\](u[0-9A-Fa-f]{0,3}|U[0-9A-Fa-f]{0,7})
 
 /* Extended quote
  * xqdouble implements embedded quote, ''''
@@ -511,6 +515,12 @@ other			.
 <xe>{xeinside}  {
 					ECHO;
 				}
+<xe>{xeunicode}	{
+					ECHO;
+				}
+<xe>{xeunicodefail} {
+					ECHO;
+				}
 <xe>{xeescape}  {
 					ECHO;
 				}
#8Peter Eisentraut
peter_e@gmx.net
In reply to: Marko Kreen (#7)
Re: [COMMITTERS] pgsql: Unicode escapes in E'...' strings Author: Marko Kreen

On Sat, 2009-09-26 at 00:18 +0300, Marko Kreen wrote:

On 9/26/09, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Maybe it doesn't "need" to know, but I think it would be disastrous from
a maintenance standpoint to not keep the two sets of flex rules in
strict correspondence. It would soon become unclear whether or how to
apply changes in the backend lexer to psql.

Patch attached.

That patch results in the following message from flex:

psqlscan.l:1039: warning, -s option given but default rule can be
matched

#9Marko Kreen
markokr@gmail.com
In reply to: Peter Eisentraut (#8)
Re: [COMMITTERS] pgsql: Unicode escapes in E'...' strings Author: Marko Kreen

On 9/26/09, Peter Eisentraut <peter_e@gmx.net> wrote:

On Sat, 2009-09-26 at 00:18 +0300, Marko Kreen wrote:

On 9/26/09, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Maybe it doesn't "need" to know, but I think it would be disastrous from
a maintenance standpoint to not keep the two sets of flex rules in
strict correspondence. It would soon become unclear whether or how to
apply changes in the backend lexer to psql.

Patch attached.

That patch results in the following message from flex:

psqlscan.l:1039: warning, -s option given but default rule can be
matched

Agh. Well, that just means the <xeu> state must be commented out:

 -%x xeu
 +/* %x xeu */

--
marko

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Marko Kreen (#9)
Re: [COMMITTERS] pgsql: Unicode escapes in E'...' strings Author: Marko Kreen

Marko Kreen <markokr@gmail.com> writes:

On 9/26/09, Peter Eisentraut <peter_e@gmx.net> wrote:

That patch results in the following message from flex:

psqlscan.l:1039: warning, -s option given but default rule can be
matched

Agh. Well, that just means the <xeu> state must be commented out:

-%x xeu
+/* %x xeu */

Ick --- that breaks the whole concept of keeping the two sets of
flex rules in sync. And it's quite unclear why it fixes the problem,
too. At the very least, if you do it that way, it needs a comment
explaining exactly why it's different from the backend.

regards, tom lane

#11Marko Kreen
markokr@gmail.com
In reply to: Tom Lane (#10)
1 attachment(s)
Re: [COMMITTERS] pgsql: Unicode escapes in E'...' strings Author: Marko Kreen

Resend...

On 9/26/09, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Marko Kreen <markokr@gmail.com> writes:

On 9/26/09, Peter Eisentraut <peter_e@gmx.net> wrote:

That patch results in the following message from flex:

psqlscan.l:1039: warning, -s option given but default rule can be
matched

Agh. Well, that just means the <xeu> state must be commented out:

-%x xeu
+/* %x xeu */

Ick --- that breaks the whole concept of keeping the two sets of
flex rules in sync. And it's quite unclear why it fixes the problem,
too. At the very least, if you do it that way, it needs a comment
explaining exactly why it's different from the backend.

The commenting-out fixes the problem, because I copy pasted the state
declaration without any rules in it.

Anyway, now I attached a patch, where I filled the section but without
referring it from anywhere. The rules itself are now equal. Is that OK?

--
marko

Attachments:

psql-unicode2.difftext/x-diff; charset=US-ASCII; name=psql-unicode2.diffDownload
diff --git a/src/bin/psql/psqlscan.l b/src/bin/psql/psqlscan.l
index 7f08da2..8309577 100644
--- a/src/bin/psql/psqlscan.l
+++ b/src/bin/psql/psqlscan.l
@@ -158,6 +158,7 @@ static void emit(const char *txt, int len);
  *  <xdolq> $foo$ quoted strings
  *  <xui> quoted identifier with Unicode escapes
  *  <xus> quoted string with Unicode escapes
+ *  <xeu> Unicode surrogate pair in extended quoted string.
  */
 
 %x xb
@@ -169,6 +170,7 @@ static void emit(const char *txt, int len);
 %x xdolq
 %x xui
 %x xus
+%x xeu
 /* Additional exclusive states for psql only: lex backslash commands */
 %x xslashcmd
 %x xslasharg
@@ -253,6 +255,8 @@ xeinside		[^\\']+
 xeescape		[\\][^0-7]
 xeoctesc		[\\][0-7]{1,3}
 xehexesc		[\\]x[0-9A-Fa-f]{1,2}
+xeunicode		[\\](u[0-9A-Fa-f]{4}|U[0-9A-Fa-f]{8})
+xeunicodefail	[\\](u[0-9A-Fa-f]{0,3}|U[0-9A-Fa-f]{0,7})
 
 /* Extended quote
  * xqdouble implements embedded quote, ''''
@@ -511,6 +515,20 @@ other			.
 <xe>{xeinside}  {
 					ECHO;
 				}
+<xe>{xeunicode}	{
+					ECHO;
+				}
+<xeu>{xeunicode} {
+					ECHO;
+				}
+<xeu>.			|
+<xeu>\n			|
+<xeu><<EOF>>	{
+					ECHO;
+				}
+<xe,xeu>{xeunicodefail} {
+					ECHO;
+				}
 <xe>{xeescape}  {
 					ECHO;
 				}
#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Marko Kreen (#11)
Re: [COMMITTERS] pgsql: Unicode escapes in E'...' strings Author: Marko Kreen

Marko Kreen <markokr@gmail.com> writes:

Anyway, now I attached a patch, where I filled the section but without
referring it from anywhere. The rules itself are now equal. Is that OK?

Well, you also have to track the state changes (BEGIN).

In comparing the scanners I realized I'd forgotten to sync psql myself
when I was fooling around with the plpgsql scanner :-(. So mea culpa
as well ...

Fixed and applied.

regards, tom lane