pgsql: Unicode escapes in E'...' strings Author: Marko Kreen
Log Message:
-----------
Unicode escapes in E'...' strings
Author: Marko Kreen <markokr@gmail.com>
Modified Files:
--------------
pgsql/doc/src/sgml:
syntax.sgml (r1.135 -> r1.136)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/doc/src/sgml/syntax.sgml?r1=1.135&r2=1.136)
pgsql/src/backend/parser:
scan.l (r1.158 -> r1.159)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/parser/scan.l?r1=1.158&r2=1.159)
pgsql/src/include/parser:
gramparse.h (r1.47 -> r1.48)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/include/parser/gramparse.h?r1=1.47&r2=1.48)
petere@postgresql.org (Peter Eisentraut) writes:
Log Message:
-----------
Unicode escapes in E'...' strings
Author: Marko Kreen <markokr@gmail.com>
This patch has broken the no-backup property of the scanner, which
is an absolutely unacceptable penalty for such a second-order feature.
Please fix or revert.
Also, it failed to update psql's scanner to match.
regards, tom lane
On 9/25/09, Tom Lane <tgl@sss.pgh.pa.us> wrote:
petere@postgresql.org (Peter Eisentraut) writes:
Log Message:
-----------
Unicode escapes in E'...' stringsAuthor: Marko Kreen <markokr@gmail.com>
This patch has broken the no-backup property of the scanner, which
is an absolutely unacceptable penalty for such a second-order feature.
Please fix or revert.
How do I find out the state of said property?
Currently I assume its related to xeunicodebad pattern?
Will this fix it:
-xeunicodebad [\\]([uU])
+xeunicodebad [\\](u[0-9A-Fa-f]{0,3}|U[0-9A-Fa-f]{0,7})
?
--
marko
Marko Kreen <markokr@gmail.com> writes:
On 9/25/09, Tom Lane <tgl@sss.pgh.pa.us> wrote:
This patch has broken the no-backup property of the scanner, which
is an absolutely unacceptable penalty for such a second-order feature.
Please fix or revert.
How do I find out the state of said property?
Per the comment at the head of scan.l, add the -b switch to the flex
call and see what flex says about it.
Currently I assume its related to xeunicodebad pattern?
Probably, but I didn't check.
regards, tom lane
On Fri, 2009-09-25 at 15:39 -0400, Tom Lane wrote:
petere@postgresql.org (Peter Eisentraut) writes:
Log Message:
-----------
Unicode escapes in E'...' stringsAuthor: Marko Kreen <markokr@gmail.com>
This patch has broken the no-backup property of the scanner,
Fixed.
Also, it failed to update psql's scanner to match.
Why does the psql scanner need to know about this? Doesn't it just need
to know the difference between backslash-quote and backslash-something
else?
Peter Eisentraut <peter_e@gmx.net> writes:
On Fri, 2009-09-25 at 15:39 -0400, Tom Lane wrote:
Also, it failed to update psql's scanner to match.
Why does the psql scanner need to know about this? Doesn't it just need
to know the difference between backslash-quote and backslash-something
else?
Maybe it doesn't "need" to know, but I think it would be disastrous from
a maintenance standpoint to not keep the two sets of flex rules in
strict correspondence. It would soon become unclear whether or how to
apply changes in the backend lexer to psql.
regards, tom lane
On 9/26/09, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Maybe it doesn't "need" to know, but I think it would be disastrous from
a maintenance standpoint to not keep the two sets of flex rules in
strict correspondence. It would soon become unclear whether or how to
apply changes in the backend lexer to psql.
Patch attached.
--
marko
Attachments:
psql-unicode.difftext/x-diff; charset=US-ASCII; name=psql-unicode.diffDownload
diff --git a/src/bin/psql/psqlscan.l b/src/bin/psql/psqlscan.l
index 7f08da2..cb85658 100644
--- a/src/bin/psql/psqlscan.l
+++ b/src/bin/psql/psqlscan.l
@@ -158,6 +158,7 @@ static void emit(const char *txt, int len);
* <xdolq> $foo$ quoted strings
* <xui> quoted identifier with Unicode escapes
* <xus> quoted string with Unicode escapes
+ * <xeu> Unicode surrogate pair in extended quoted string
*/
%x xb
@@ -169,6 +170,7 @@ static void emit(const char *txt, int len);
%x xdolq
%x xui
%x xus
+%x xeu
/* Additional exclusive states for psql only: lex backslash commands */
%x xslashcmd
%x xslasharg
@@ -253,6 +255,8 @@ xeinside [^\\']+
xeescape [\\][^0-7]
xeoctesc [\\][0-7]{1,3}
xehexesc [\\]x[0-9A-Fa-f]{1,2}
+xeunicode [\\](u[0-9A-Fa-f]{4}|U[0-9A-Fa-f]{8})
+xeunicodefail [\\](u[0-9A-Fa-f]{0,3}|U[0-9A-Fa-f]{0,7})
/* Extended quote
* xqdouble implements embedded quote, ''''
@@ -511,6 +515,12 @@ other .
<xe>{xeinside} {
ECHO;
}
+<xe>{xeunicode} {
+ ECHO;
+ }
+<xe>{xeunicodefail} {
+ ECHO;
+ }
<xe>{xeescape} {
ECHO;
}
On Sat, 2009-09-26 at 00:18 +0300, Marko Kreen wrote:
On 9/26/09, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Maybe it doesn't "need" to know, but I think it would be disastrous from
a maintenance standpoint to not keep the two sets of flex rules in
strict correspondence. It would soon become unclear whether or how to
apply changes in the backend lexer to psql.Patch attached.
That patch results in the following message from flex:
psqlscan.l:1039: warning, -s option given but default rule can be
matched
On 9/26/09, Peter Eisentraut <peter_e@gmx.net> wrote:
On Sat, 2009-09-26 at 00:18 +0300, Marko Kreen wrote:
On 9/26/09, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Maybe it doesn't "need" to know, but I think it would be disastrous from
a maintenance standpoint to not keep the two sets of flex rules in
strict correspondence. It would soon become unclear whether or how to
apply changes in the backend lexer to psql.Patch attached.
That patch results in the following message from flex:
psqlscan.l:1039: warning, -s option given but default rule can be
matched
Agh. Well, that just means the <xeu> state must be commented out:
-%x xeu
+/* %x xeu */
--
marko
Marko Kreen <markokr@gmail.com> writes:
On 9/26/09, Peter Eisentraut <peter_e@gmx.net> wrote:
That patch results in the following message from flex:
psqlscan.l:1039: warning, -s option given but default rule can be
matched
Agh. Well, that just means the <xeu> state must be commented out:
-%x xeu +/* %x xeu */
Ick --- that breaks the whole concept of keeping the two sets of
flex rules in sync. And it's quite unclear why it fixes the problem,
too. At the very least, if you do it that way, it needs a comment
explaining exactly why it's different from the backend.
regards, tom lane
Resend...
On 9/26/09, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Marko Kreen <markokr@gmail.com> writes:
On 9/26/09, Peter Eisentraut <peter_e@gmx.net> wrote:
That patch results in the following message from flex:
psqlscan.l:1039: warning, -s option given but default rule can be
matchedAgh. Well, that just means the <xeu> state must be commented out:
-%x xeu +/* %x xeu */Ick --- that breaks the whole concept of keeping the two sets of
flex rules in sync. And it's quite unclear why it fixes the problem,
too. At the very least, if you do it that way, it needs a comment
explaining exactly why it's different from the backend.
The commenting-out fixes the problem, because I copy pasted the state
declaration without any rules in it.
Anyway, now I attached a patch, where I filled the section but without
referring it from anywhere. The rules itself are now equal. Is that OK?
--
marko
Attachments:
psql-unicode2.difftext/x-diff; charset=US-ASCII; name=psql-unicode2.diffDownload
diff --git a/src/bin/psql/psqlscan.l b/src/bin/psql/psqlscan.l
index 7f08da2..8309577 100644
--- a/src/bin/psql/psqlscan.l
+++ b/src/bin/psql/psqlscan.l
@@ -158,6 +158,7 @@ static void emit(const char *txt, int len);
* <xdolq> $foo$ quoted strings
* <xui> quoted identifier with Unicode escapes
* <xus> quoted string with Unicode escapes
+ * <xeu> Unicode surrogate pair in extended quoted string.
*/
%x xb
@@ -169,6 +170,7 @@ static void emit(const char *txt, int len);
%x xdolq
%x xui
%x xus
+%x xeu
/* Additional exclusive states for psql only: lex backslash commands */
%x xslashcmd
%x xslasharg
@@ -253,6 +255,8 @@ xeinside [^\\']+
xeescape [\\][^0-7]
xeoctesc [\\][0-7]{1,3}
xehexesc [\\]x[0-9A-Fa-f]{1,2}
+xeunicode [\\](u[0-9A-Fa-f]{4}|U[0-9A-Fa-f]{8})
+xeunicodefail [\\](u[0-9A-Fa-f]{0,3}|U[0-9A-Fa-f]{0,7})
/* Extended quote
* xqdouble implements embedded quote, ''''
@@ -511,6 +515,20 @@ other .
<xe>{xeinside} {
ECHO;
}
+<xe>{xeunicode} {
+ ECHO;
+ }
+<xeu>{xeunicode} {
+ ECHO;
+ }
+<xeu>. |
+<xeu>\n |
+<xeu><<EOF>> {
+ ECHO;
+ }
+<xe,xeu>{xeunicodefail} {
+ ECHO;
+ }
<xe>{xeescape} {
ECHO;
}
Marko Kreen <markokr@gmail.com> writes:
Anyway, now I attached a patch, where I filled the section but without
referring it from anywhere. The rules itself are now equal. Is that OK?
Well, you also have to track the state changes (BEGIN).
In comparing the scanners I realized I'd forgotten to sync psql myself
when I was fooling around with the plpgsql scanner :-(. So mea culpa
as well ...
Fixed and applied.
regards, tom lane