benchmarking Flex practices

Started by John Naylorover 6 years ago30 messages

john.naylor@2ndquadrant.com

over 6 years ago

1 attachment(s)

I decided to do some experiments with how we use Flex. The main
takeaway is that backtracking, which we removed in 2005, doesn't seem
to matter anymore for the core scanner. Also, state table size is of
marginal importance.

Using the information_schema Flex+Bison microbenchmark from Tom [1]/messages/by-id/14616.1558560331@sss.pgh.pa.us, I
tested removing most of the "fail" rules designed to avoid
backtracking ("decimalfail" is needed by PL/pgSQL). Below are the best
times (most runs within 1%), followed by postgres binary size. The
numbers are with Flex 2.5.35 on MacOS, no asserts or debugging
symbols.

HEAD:
1.53s
7139132 bytes

HEAD minus "fail" rules (patch attached):
1.53s
6971204 bytes

Surprisingly, it has the same performance and a much smaller binary.
The size difference is because the size of the elements of the
yy_transition array is constrained by the number of elements in the
array. Since there are now fewer than INT16_MAX state transitions, the
struct members go from 32 bit:

struct yy_trans_info
{
flex_int32_t yy_verify;
flex_int32_t yy_nxt;
};
static yyconst struct yy_trans_info yy_transition[37045] = ...

to 16 bit:

struct yy_trans_info
{
flex_int16_t yy_verify;
flex_int16_t yy_nxt;
};
static yyconst struct yy_trans_info yy_transition[31763] = ...

To test if array size was the deciding factor, I tried bloating it by
essentially undoing commit a5ff502fcea. Doing so produced an array
with 62583 elements and 32-bit members, so nearly quadruple in size,
and it was still not much slower than HEAD:

HEAD minus "fail" rules, minus %xusend/%xuiend:
1.56s
7343932 bytes

While at it, I repeated the benchmark with different Flex flags:

HEAD, plus -Cf:
1.60s
6995788 bytes

HEAD, minus "fail" rules, plus -Cf:
1.59s
6979396 bytes

HEAD, plus -Cfe:
1.65s
6868804 bytes

So this recommendation of the Flex manual (-CF) still holds true. It's
worth noting that using perfect hashing for keyword lookup (20%
faster) had a much bigger effect than switching from -Cfe to -CF (7%
faster).

It would be nice to have confirmation to make sure I didn't err
somewhere, and to try a more real-world benchmark. (Also for the
moment I only have Linux on a virtual machine.) The regression tests
pass, but some comments are now wrong. If it's confirmed that
backtracking doesn't matter for recent Flex/hardware, disregarding it
would make maintenance of our scanners a bit easier.

[1]: /messages/by-id/14616.1558560331@sss.pgh.pa.us

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

remove-scanner-fail-rules.patchapplication/octet-stream; name=remove-scanner-fail-rules.patchDownload

diff --git a/src/backend/parser/Makefile b/src/backend/parser/Makefile
index f14febdbda..3a2459cb72 100644
--- a/src/backend/parser/Makefile
+++ b/src/backend/parser/Makefile
@@ -40,7 +40,6 @@ gram.c: BISON_CHECK_CMD = $(PERL) $(srcdir)/check_keywords.pl $< $(top_srcdir)/s
 
 
 scan.c: FLEXFLAGS = -CF -p -p
-scan.c: FLEX_NO_BACKUP=yes
 scan.c: FLEX_FIX_WARNING=yes
 
 
diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l
index e1cae859e8..13ec8daf9c 100644
--- a/src/backend/parser/scan.l
+++ b/src/backend/parser/scan.l
@@ -241,9 +241,7 @@ whitespace_with_newline	({horiz_whitespace}*{newline}{special_whitespace}*)
  * beyond the quote proper.
  */
 quote			'
-quotestop		{quote}{whitespace}*
 quotecontinue	{quote}{whitespace_with_newline}{quote}
-quotefail		{quote}{whitespace}*"-"
 
 /* Bit string
  * It is tempting to scan the string for only those characters
@@ -292,7 +290,6 @@ xqinside		[^']+
 dolq_start		[A-Za-z\200-\377_]
 dolq_cont		[A-Za-z\200-\377_0-9]
 dolqdelim		\$({dolq_start}{dolq_cont}*)?\$
-dolqfailed		\${dolq_start}{dolq_cont}*
 dolqinside		[^$]+
 
 /* Double quote
@@ -306,8 +303,6 @@ xdinside		[^"]+
 
 /* Unicode escapes */
 uescape			[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']{quote}
-/* error rule to avoid backup */
-uescapefail		[uU][eE][sS][cC][aA][pP][eE]{whitespace}*"-"|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*|[uU][eE][sS][cC][aA][pP]|[uU][eE][sS][cC][aA]|[uU][eE][sS][cC]|[uU][eE][sS]|[uU][eE]|[uU]
 
 /* Quoted identifier with Unicode escapes */
 xuistart		[uU]&{dquote}
@@ -315,13 +310,6 @@ xuistart		[uU]&{dquote}
 /* Quoted string with Unicode escapes */
 xusstart		[uU]&{quote}
 
-/* Optional UESCAPE after a quoted string or identifier with Unicode escapes. */
-xustop1		{uescapefail}?
-xustop2		{uescape}
-
-/* error rule to avoid backup */
-xufailed		[uU]&
-
 
 /* C-style comments
  *
@@ -398,8 +386,6 @@ integer			{digit}+
 decimal			(({digit}*\.{digit}+)|({digit}+\.{digit}*))
 decimalfail		{digit}+\.\.
 real			({integer}|{decimal})[Ee][-+]?{digit}+
-realfail1		({integer}|{decimal})[Ee]
-realfail2		({integer}|{decimal})[Ee][-+]
 
 param			\${integer}
 
@@ -476,9 +462,7 @@ other			.
 					startlit();
 					addlitchar('b', yyscanner);
 				}
-<xb>{quotestop}	|
-<xb>{quotefail} {
-					yyless(1);
+<xb>{quote}		{
 					BEGIN(INITIAL);
 					yylval->str = litbufdup(yyscanner);
 					return BCONST;
@@ -505,9 +489,7 @@ other			.
 					startlit();
 					addlitchar('x', yyscanner);
 				}
-<xh>{quotestop}	|
-<xh>{quotefail} {
-					yyless(1);
+<xh>{quote}		{
 					BEGIN(INITIAL);
 					yylval->str = litbufdup(yyscanner);
 					return XCONST;
@@ -568,9 +550,7 @@ other			.
 					BEGIN(xus);
 					startlit();
 				}
-<xq,xe>{quotestop}	|
-<xq,xe>{quotefail} {
-					yyless(1);
+<xq,xe>{quote}	{
 					BEGIN(INITIAL);
 					/*
 					 * check that the data remains valid if it might have been
@@ -583,26 +563,21 @@ other			.
 					yylval->str = litbufdup(yyscanner);
 					return SCONST;
 				}
-<xus>{quotestop} |
-<xus>{quotefail} {
-					/* throw back all but the quote */
-					yyless(1);
+<xus>{quote}	{
 					/* xusend state looks for possible UESCAPE */
 					BEGIN(xusend);
 				}
 <xusend>{whitespace} {
 					/* stay in xusend state over whitespace */
 				}
-<xusend><<EOF>> |
-<xusend>{other} |
-<xusend>{xustop1} {
+<xusend>{other} {
 					/* no UESCAPE after the quote, throw back everything */
 					yyless(0);
 					BEGIN(INITIAL);
 					yylval->str = litbuf_udeescape('\\', yyscanner);
 					return SCONST;
 				}
-<xusend>{xustop2} {
+<xusend>{uescape} {
 					/* found UESCAPE after the end quote */
 					BEGIN(INITIAL);
 					if (!check_uescapechar(yytext[yyleng - 2]))
@@ -708,13 +683,6 @@ other			.
 					BEGIN(xdolq);
 					startlit();
 				}
-{dolqfailed}	{
-					SET_YYLLOC();
-					/* throw back all but the initial "$" */
-					yyless(1);
-					/* and treat it as {other} */
-					return yytext[0];
-				}
 <xdolq>{dolqdelim} {
 					if (strcmp(yytext, yyextra->dolqstart) == 0)
 					{
@@ -738,9 +706,6 @@ other			.
 <xdolq>{dolqinside} {
 					addlit(yytext, yyleng, yyscanner);
 				}
-<xdolq>{dolqfailed} {
-					addlit(yytext, yyleng, yyscanner);
-				}
 <xdolq>.		{
 					/* This is only needed for $ inside the quoted text */
 					addlitchar(yytext[0], yyscanner);
@@ -770,16 +735,13 @@ other			.
 					return IDENT;
 				}
 <xui>{dquote} {
-					yyless(1);
 					/* xuiend state looks for possible UESCAPE */
 					BEGIN(xuiend);
 				}
 <xuiend>{whitespace} {
 					/* stay in xuiend state over whitespace */
 				}
-<xuiend><<EOF>> |
-<xuiend>{other} |
-<xuiend>{xustop1} {
+<xuiend>{other} {
 					/* no UESCAPE after the quote, throw back everything */
 					char	   *ident;
 					int			identlen;
@@ -796,7 +758,7 @@ other			.
 					yylval->str = ident;
 					return IDENT;
 				}
-<xuiend>{xustop2}	{
+<xuiend>{uescape} {
 					/* found UESCAPE after the end quote */
 					char	   *ident;
 					int			identlen;
@@ -825,18 +787,6 @@ other			.
 				}
 <xd,xui><<EOF>>		{ yyerror("unterminated quoted identifier"); }
 
-{xufailed}	{
-					char	   *ident;
-
-					SET_YYLLOC();
-					/* throw back all but the initial u/U */
-					yyless(1);
-					/* and treat it as {identifier} */
-					ident = downcase_truncate_identifier(yytext, yyleng, true);
-					yylval->str = ident;
-					return IDENT;
-				}
-
 {typecast}		{
 					SET_YYLLOC();
 					return TYPECAST;
@@ -1018,23 +968,6 @@ other			.
 					yylval->str = pstrdup(yytext);
 					return FCONST;
 				}
-{realfail1}		{
-					/*
-					 * throw back the [Ee], and figure out whether what
-					 * remains is an {integer} or {decimal}.
-					 */
-					yyless(yyleng - 1);
-					SET_YYLLOC();
-					return process_integer_literal(yytext, yylval);
-				}
-{realfail2}		{
-					/* throw back the [Ee][+-], and proceed as above */
-					yyless(yyleng - 2);
-					SET_YYLLOC();
-					return process_integer_literal(yytext, yylval);
-				}
-
-
 {identifier}	{
 					int			kwnum;
 					char	   *ident;

Tom Lane

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: John Naylor (#1)

Re: benchmarking Flex practices

John Naylor <john.naylor@2ndquadrant.com> writes:

I decided to do some experiments with how we use Flex. The main
takeaway is that backtracking, which we removed in 2005, doesn't seem
to matter anymore for the core scanner. Also, state table size is of
marginal importance.

Huh. That's really interesting, because removing backtracking was a
demonstrable, significant win when we did it [1]/messages/by-id/8652.1116865895@sss.pgh.pa.us. I wonder what has
changed? I'd be prepared to believe that today's machines are more
sensitive to the amount of cache space eaten by the tables --- but that
idea seems contradicted by your result that the table size isn't
important. (I'm wishing I'd documented the test case I used in 2005...)

The size difference is because the size of the elements of the
yy_transition array is constrained by the number of elements in the
array. Since there are now fewer than INT16_MAX state transitions, the
struct members go from 32 bit:
static yyconst struct yy_trans_info yy_transition[37045] = ...
to 16 bit:
static yyconst struct yy_trans_info yy_transition[31763] = ...

Hm. Smaller binary is definitely nice, but 31763 is close enough to
32768 that I'd have little faith in the optimization surviving for long.
Is there any way we could buy back some more transitions?

It would be nice to have confirmation to make sure I didn't err
somewhere, and to try a more real-world benchmark.

I don't see much wrong with using information_schema.sql as a parser/lexer
benchmark case. We should try to confirm the results on other platforms
though.

regards, tom lane

[1]: /messages/by-id/8652.1116865895@sss.pgh.pa.us

Andres Freund

andres@anarazel.de

over 6 years ago

In reply to: Tom Lane (#2)

Re: benchmarking Flex practices

Hi,

On 2019-06-20 10:52:54 -0400, Tom Lane wrote:

John Naylor <john.naylor@2ndquadrant.com> writes:

It would be nice to have confirmation to make sure I didn't err
somewhere, and to try a more real-world benchmark.

I don't see much wrong with using information_schema.sql as a parser/lexer
benchmark case. We should try to confirm the results on other platforms
though.

Might be worth also testing with a more repetitive testcase to measure
both cache locality and branch prediction. I assume that with
information_schema there's enough variability that these effects play a
smaller role. And there's plenty real-world cases where there's a *lot*
of very similar statements being parsed over and over. I'd probably just
measure the statements pgbench generates or such.

Greetings,

Andres Freund

John Naylor

john.naylor@2ndquadrant.com

over 6 years ago

In reply to: Tom Lane (#2)

Re: benchmarking Flex practices

On Fri, Jun 21, 2019 at 12:02 AM Andres Freund <andres@anarazel.de> wrote:

Might be worth also testing with a more repetitive testcase to measure
both cache locality and branch prediction. I assume that with
information_schema there's enough variability that these effects play a
smaller role. And there's plenty real-world cases where there's a *lot*
of very similar statements being parsed over and over. I'd probably just
measure the statements pgbench generates or such.

I tried benchmarking with a query string with just

BEGIN;
UPDATE pgbench_accounts SET abalance = abalance + 1 WHERE aid = 1;
SELECT abalance FROM pgbench_accounts WHERE aid = 1;
UPDATE pgbench_tellers SET tbalance = tbalance + 1 WHERE tid = 1;
UPDATE pgbench_branches SET bbalance = bbalance + 1 WHERE bid = 1;
INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (1,
1, 1, 1, CURRENT_TIMESTAMP);
END;

repeated about 500 times. With this, backtracking is about 3% slower:

HEAD:
1.15s

patch:
1.19s

patch + huge array:
1.19s

That's possibly significant enough to be evidence for your assumption,
as well as to persuade us to keep things as they are.

On Thu, Jun 20, 2019 at 10:52 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Huh. That's really interesting, because removing backtracking was a
demonstrable, significant win when we did it [1]. I wonder what has
changed? I'd be prepared to believe that today's machines are more
sensitive to the amount of cache space eaten by the tables --- but that
idea seems contradicted by your result that the table size isn't
important. (I'm wishing I'd documented the test case I used in 2005...)

It's possible the code used with backtracking is better predicted than
15 years ago, but my uneducated hunch is our Bison grammar has gotten
much worse in cache misses and branch prediction than the scanner has
in 15 years. That, plus the recent keyword lookup optimization might
have caused parsing to be completely dominated by Bison. If that's the
case, the 3% slowdown above could be a significant portion of scanning
in isolation.

Hm. Smaller binary is definitely nice, but 31763 is close enough to
32768 that I'd have little faith in the optimization surviving for long.
Is there any way we could buy back some more transitions?

I tried quickly ripping out the unicode escape support entirely. It
builds with warnings, but the point is to just get the size -- that
produced an array with only 28428 elements, and that's keeping all the
no-backup rules intact. This might be unworkable and/or ugly, but I
wonder if it's possible to pull unicode escape handling into the
parsing stage, with "UESCAPE" being a keyword token that we have to
peek ahead to check for. I'll look for other rules that could be more
easily optimized, but I'm not terribly optimistic.

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

John Naylor

john.naylor@2ndquadrant.com

over 6 years ago

In reply to: John Naylor (#4)

1 attachment(s)

Re: benchmarking Flex practices

I wrote:

I'll look for other rules that could be more
easily optimized, but I'm not terribly optimistic.

I found a possible other way to bring the size of the transition table
under 32k entries while keeping the existing no-backup rules in place:
Replace the "quotecontinue" rule with a new state. In the attached
draft patch, when Flex encounters a quote while inside any kind of
quoted string, it saves the current state and enters %xqs (think
'quotestop'). If it then sees {whitespace_with_newline}{quote}, it
reenters the previous state and continues to slurp the string,
otherwise, it throws back everything and returns the string it just
exited. Doing it this way is a bit uglier, but with some extra
commentary it might not be too bad.

The array is now 30883 entries. That's still a bit close for comfort,
but shrinks the binary by 171kB on Linux x86-64 with Flex 2.6.4. The
bad news is I have these baffling backup states in my new rules:

State #133 is non-accepting -
associated rule line numbers:
551 554 564
out-transitions: [ \000-\377 ]
jam-transitions: EOF []

State #162 is non-accepting -
associated rule line numbers:
551 554 564
out-transitions: [ \000-\377 ]
jam-transitions: EOF []

2 backing up (non-accepting) states.

I already explicitly handle EOF, so I don't know what it's trying to
tell me. If it can be fixed while keeping the array size, I'll do
performance tests.

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

v1-lexer-redo-quote-continuation.patchapplication/octet-stream; name=v1-lexer-redo-quote-continuation.patchDownload

diff --git a/src/backend/parser/Makefile b/src/backend/parser/Makefile
index f14febdbda..3a2459cb72 100644
--- a/src/backend/parser/Makefile
+++ b/src/backend/parser/Makefile
@@ -40,7 +40,6 @@ gram.c: BISON_CHECK_CMD = $(PERL) $(srcdir)/check_keywords.pl $< $(top_srcdir)/s
 
 
 scan.c: FLEXFLAGS = -CF -p -p
-scan.c: FLEX_NO_BACKUP=yes
 scan.c: FLEX_FIX_WARNING=yes
 
 
diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l
index e1cae859e8..24f351229b 100644
--- a/src/backend/parser/scan.l
+++ b/src/backend/parser/scan.l
@@ -56,6 +56,8 @@ fprintf_to_ereport(const char *fmt, const char *msg)
 	ereport(ERROR, (errmsg_internal("%s", msg)));
 }
 
+static int state_before;
+
 /*
  * GUC variables.  This is a DIRECT violation of the warning given at the
  * head of gram.y, ie flex/bison code must not depend on any GUC variables;
@@ -168,6 +170,7 @@ extern void core_yyset_column(int column_no, yyscan_t yyscanner);
  *  <xd> delimited identifiers (double-quoted identifiers)
  *  <xh> hexadecimal numeric string
  *  <xq> standard quoted strings
+ *  <xqs> quote stop (detect continued strings)
  *  <xe> extended quoted strings (support backslash escape sequences)
  *  <xdolq> $foo$ quoted strings
  *  <xui> quoted identifier with Unicode escapes
@@ -185,6 +188,7 @@ extern void core_yyset_column(int column_no, yyscan_t yyscanner);
 %x xd
 %x xh
 %x xq
+%x xqs
 %x xe
 %x xdolq
 %x xui
@@ -231,19 +235,7 @@ special_whitespace		({space}+|{comment}{newline})
 horiz_whitespace		({horiz_space}|{comment})
 whitespace_with_newline	({horiz_whitespace}*{newline}{special_whitespace}*)
 
-/*
- * To ensure that {quotecontinue} can be scanned without having to back up
- * if the full pattern isn't matched, we include trailing whitespace in
- * {quotestop}.  This matches all cases where {quotecontinue} fails to match,
- * except for {quote} followed by whitespace and just one "-" (not two,
- * which would start a {comment}).  To cover that we have {quotefail}.
- * The actions for {quotestop} and {quotefail} must throw back characters
- * beyond the quote proper.
- */
 quote			'
-quotestop		{quote}{whitespace}*
-quotecontinue	{quote}{whitespace_with_newline}{quote}
-quotefail		{quote}{whitespace}*"-"
 
 /* Bit string
  * It is tempting to scan the string for only those characters
@@ -476,21 +468,10 @@ other			.
 					startlit();
 					addlitchar('b', yyscanner);
 				}
-<xb>{quotestop}	|
-<xb>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
-					yylval->str = litbufdup(yyscanner);
-					return BCONST;
-				}
 <xh>{xhinside}	|
 <xb>{xbinside}	{
 					addlit(yytext, yyleng, yyscanner);
 				}
-<xh>{quotecontinue}	|
-<xb>{quotecontinue}	{
-					/* ignore */
-				}
 <xb><<EOF>>		{ yyerror("unterminated bit string literal"); }
 
 {xhstart}		{
@@ -505,13 +486,6 @@ other			.
 					startlit();
 					addlitchar('x', yyscanner);
 				}
-<xh>{quotestop}	|
-<xh>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
-					yylval->str = litbufdup(yyscanner);
-					return XCONST;
-				}
 <xh><<EOF>>		{ yyerror("unterminated hexadecimal string literal"); }
 
 {xnstart}		{
@@ -568,28 +542,65 @@ other			.
 					BEGIN(xus);
 					startlit();
 				}
-<xq,xe>{quotestop}	|
-<xq,xe>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
+
+<xb,xh,xq,xe,xus>{quote} {
+					state_before = YYSTATE;
+					BEGIN(xqs);
+				}
+<xqs>{whitespace_with_newline}{quote} {
+					/* resume scanning string that started on a previous line */
+					BEGIN(state_before);
+				}
+<xqs>{whitespace}*{quote} {
 					/*
-					 * check that the data remains valid if it might have been
-					 * made invalid by unescaping any chars.
+					 * SQL requires at least one newline in the whitespace separating
+					 * string literals that are to be concatenated, so throw an error
+					 * if we see the start of a new string on the same line.
 					 */
-					if (yyextra->saw_non_ascii)
-						pg_verifymbstr(yyextra->literalbuf,
-									   yyextra->literallen,
-									   false);
-					yylval->str = litbufdup(yyscanner);
-					return SCONST;
+					SET_YYLLOC();
+					ADVANCE_YYLLOC(yyleng - 1);
+					yyerror("syntax error");
 				}
-<xus>{quotestop} |
-<xus>{quotefail} {
-					/* throw back all but the quote */
-					yyless(1);
-					/* xusend state looks for possible UESCAPE */
-					BEGIN(xusend);
+<xqs><<EOF>> |
+<xqs>{whitespace}*[^'] {
+					/* throw back everything and handle the string we just scanned */
+					yyless(0);
+
+					switch (state_before)
+					{
+						case xb:
+							BEGIN(INITIAL);
+							yylval->str = litbufdup(yyscanner);
+							return BCONST;
+						case xh:
+							BEGIN(INITIAL);
+							yylval->str = litbufdup(yyscanner);
+							return XCONST;
+						case xe:
+							/* fallthrough */
+						case xq:
+							BEGIN(INITIAL);
+
+							/*
+							 * check that the data remains valid if it might have been
+							 * made invalid by unescaping any chars.
+							 */
+							if (yyextra->saw_non_ascii)
+								pg_verifymbstr(yyextra->literalbuf,
+											   yyextra->literallen,
+											   false);
+							yylval->str = litbufdup(yyscanner);
+							return SCONST;
+						case xus:
+							/* xusend state looks for possible UESCAPE */
+							BEGIN(xusend);
+							break;
+						default:
+							yyerror("unhandled previous state in quote continuation");
+					}
+
 				}
+
 <xusend>{whitespace} {
 					/* stay in xusend state over whitespace */
 				}
@@ -693,9 +704,6 @@ other			.
 					if (c == '\0' || IS_HIGHBIT_SET(c))
 						yyextra->saw_non_ascii = true;
 				}
-<xq,xe,xus>{quotecontinue} {
-					/* ignore */
-				}
 <xe>.			{
 					/* This is only needed for \ just before EOF */
 					addlitchar(yytext[0], yyscanner);

John Naylor

john.naylor@2ndquadrant.com

over 6 years ago

In reply to: John Naylor (#5)

1 attachment(s)

Re: benchmarking Flex practices

I wrote:

I'll look for other rules that could be more
easily optimized, but I'm not terribly optimistic.

I found a possible other way to bring the size of the transition table
under 32k entries while keeping the existing no-backup rules in place:
Replace the "quotecontinue" rule with a new state. In the attached
draft patch, when Flex encounters a quote while inside any kind of
quoted string, it saves the current state and enters %xqs (think
'quotestop'). If it then sees {whitespace_with_newline}{quote}, it
reenters the previous state and continues to slurp the string,
otherwise, it throws back everything and returns the string it just
exited. Doing it this way is a bit uglier, but with some extra
commentary it might not be too bad.

I had an epiphany and managed to get rid of the backup states.
Regression tests pass. The array is down to 30367 entries and the
binary is smaller by 172kB on Linux x86-64. Performance is identical
to master on both tests mentioned upthread. I'll clean this up and add
it to the commitfest.

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

v2-lexer-redo-quote-continuation.patchapplication/x-patch; name=v2-lexer-redo-quote-continuation.patchDownload

diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l
index e1cae859e8..67ad06da4f 100644
--- a/src/backend/parser/scan.l
+++ b/src/backend/parser/scan.l
@@ -56,6 +56,8 @@ fprintf_to_ereport(const char *fmt, const char *msg)
 	ereport(ERROR, (errmsg_internal("%s", msg)));
 }
 
+static int state_before;
+
 /*
  * GUC variables.  This is a DIRECT violation of the warning given at the
  * head of gram.y, ie flex/bison code must not depend on any GUC variables;
@@ -168,6 +170,7 @@ extern void core_yyset_column(int column_no, yyscan_t yyscanner);
  *  <xd> delimited identifiers (double-quoted identifiers)
  *  <xh> hexadecimal numeric string
  *  <xq> standard quoted strings
+ *  <xqs> quote stop (detect continued strings)
  *  <xe> extended quoted strings (support backslash escape sequences)
  *  <xdolq> $foo$ quoted strings
  *  <xui> quoted identifier with Unicode escapes
@@ -185,6 +188,7 @@ extern void core_yyset_column(int column_no, yyscan_t yyscanner);
 %x xd
 %x xh
 %x xq
+%x xqs
 %x xe
 %x xdolq
 %x xui
@@ -231,19 +235,7 @@ special_whitespace		({space}+|{comment}{newline})
 horiz_whitespace		({horiz_space}|{comment})
 whitespace_with_newline	({horiz_whitespace}*{newline}{special_whitespace}*)
 
-/*
- * To ensure that {quotecontinue} can be scanned without having to back up
- * if the full pattern isn't matched, we include trailing whitespace in
- * {quotestop}.  This matches all cases where {quotecontinue} fails to match,
- * except for {quote} followed by whitespace and just one "-" (not two,
- * which would start a {comment}).  To cover that we have {quotefail}.
- * The actions for {quotestop} and {quotefail} must throw back characters
- * beyond the quote proper.
- */
 quote			'
-quotestop		{quote}{whitespace}*
-quotecontinue	{quote}{whitespace_with_newline}{quote}
-quotefail		{quote}{whitespace}*"-"
 
 /* Bit string
  * It is tempting to scan the string for only those characters
@@ -476,21 +468,10 @@ other			.
 					startlit();
 					addlitchar('b', yyscanner);
 				}
-<xb>{quotestop}	|
-<xb>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
-					yylval->str = litbufdup(yyscanner);
-					return BCONST;
-				}
 <xh>{xhinside}	|
 <xb>{xbinside}	{
 					addlit(yytext, yyleng, yyscanner);
 				}
-<xh>{quotecontinue}	|
-<xb>{quotecontinue}	{
-					/* ignore */
-				}
 <xb><<EOF>>		{ yyerror("unterminated bit string literal"); }
 
 {xhstart}		{
@@ -505,13 +486,6 @@ other			.
 					startlit();
 					addlitchar('x', yyscanner);
 				}
-<xh>{quotestop}	|
-<xh>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
-					yylval->str = litbufdup(yyscanner);
-					return XCONST;
-				}
 <xh><<EOF>>		{ yyerror("unterminated hexadecimal string literal"); }
 
 {xnstart}		{
@@ -568,28 +542,65 @@ other			.
 					BEGIN(xus);
 					startlit();
 				}
-<xq,xe>{quotestop}	|
-<xq,xe>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
+
+<xb,xh,xq,xe,xus>{quote} {
+					state_before = YYSTATE;
+					BEGIN(xqs);
+				}
+<xqs>{whitespace_with_newline}{quote} {
+					/* resume scanning string that started on a previous line */
+					BEGIN(state_before);
+				}
+<xqs>{quote} {
 					/*
-					 * check that the data remains valid if it might have been
-					 * made invalid by unescaping any chars.
+					 * SQL requires at least one newline in the whitespace separating
+					 * string literals that are to be concatenated, so throw an error
+					 * if we see the start of a new string on the same line.
 					 */
-					if (yyextra->saw_non_ascii)
-						pg_verifymbstr(yyextra->literalbuf,
-									   yyextra->literallen,
-									   false);
-					yylval->str = litbufdup(yyscanner);
-					return SCONST;
+					SET_YYLLOC();
+					ADVANCE_YYLLOC(yyleng - 1);
+					yyerror("syntax error");
 				}
-<xus>{quotestop} |
-<xus>{quotefail} {
-					/* throw back all but the quote */
-					yyless(1);
-					/* xusend state looks for possible UESCAPE */
-					BEGIN(xusend);
+<xqs>{whitespace}*[^']? |
+<xqs><<EOF>> {
+					/* throw back everything and handle the string we just scanned */
+					yyless(0);
+
+					switch (state_before)
+					{
+						case xb:
+							BEGIN(INITIAL);
+							yylval->str = litbufdup(yyscanner);
+							return BCONST;
+						case xh:
+							BEGIN(INITIAL);
+							yylval->str = litbufdup(yyscanner);
+							return XCONST;
+						case xe:
+							/* fallthrough */
+						case xq:
+							BEGIN(INITIAL);
+
+							/*
+							 * check that the data remains valid if it might have been
+							 * made invalid by unescaping any chars.
+							 */
+							if (yyextra->saw_non_ascii)
+								pg_verifymbstr(yyextra->literalbuf,
+											   yyextra->literallen,
+											   false);
+							yylval->str = litbufdup(yyscanner);
+							return SCONST;
+						case xus:
+							/* xusend state looks for possible UESCAPE */
+							BEGIN(xusend);
+							break;
+						default:
+							yyerror("unhandled previous state in quote continuation");
+					}
+
 				}
+
 <xusend>{whitespace} {
 					/* stay in xusend state over whitespace */
 				}
@@ -693,9 +704,6 @@ other			.
 					if (c == '\0' || IS_HIGHBIT_SET(c))
 						yyextra->saw_non_ascii = true;
 				}
-<xq,xe,xus>{quotecontinue} {
-					/* ignore */
-				}
 <xe>.			{
 					/* This is only needed for \ just before EOF */
 					addlitchar(yytext[0], yyscanner);

John Naylor

john.naylor@2ndquadrant.com

over 6 years ago

In reply to: John Naylor (#6)

2 attachment(s)

Re: benchmarking Flex practices

I wrote:

I found a possible other way to bring the size of the transition table
under 32k entries while keeping the existing no-backup rules in place:
Replace the "quotecontinue" rule with a new state. In the attached
draft patch, when Flex encounters a quote while inside any kind of
quoted string, it saves the current state and enters %xqs (think
'quotestop'). If it then sees {whitespace_with_newline}{quote}, it
reenters the previous state and continues to slurp the string,
otherwise, it throws back everything and returns the string it just
exited. Doing it this way is a bit uglier, but with some extra
commentary it might not be too bad.

I had an epiphany and managed to get rid of the backup states.
Regression tests pass. The array is down to 30367 entries and the
binary is smaller by 172kB on Linux x86-64. Performance is identical
to master on both tests mentioned upthread. I'll clean this up and add
it to the commitfest.

For the commitfest:

0001 is a small patch to remove some unneeded generality from the
current rules. This lowers the number of elements in the yy_transition
array from 37045 to 36201.

0002 is a cleaned up version of the above, bring the size down to 29521.

I haven't changed psqlscan.l or pgc.l, in case this approach is
changed or rejected

With the two together, the binary is about 175kB smaller than on HEAD.

I also couldn't resist playing around with the idea upthread to handle
unicode escapes in parser.c, which further reduces the number of
states down to 21068, which allows some headroom for future additions
without going back to 32-bit types in the transition array. It mostly
works, but it's quite ugly and breaks the token position handling for
unicode escape syntax errors, so it's not in a state to share.

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

v3-0001-Remove-some-unneeded-generality-from-the-core-Fle.patchapplication/octet-stream; name=v3-0001-Remove-some-unneeded-generality-from-the-core-Fle.patchDownload

From 5f7e0e4c1955260936e19446c304d7c3bf3c2acd Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@2ndquadrant.com>
Date: Thu, 27 Jun 2019 13:36:58 +0800
Subject: [PATCH v3 1/2] Remove some unneeded generality from the core Flex
 rules

---
 src/backend/parser/scan.l | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l
index e1cae859e8..90f96c446f 100644
--- a/src/backend/parser/scan.l
+++ b/src/backend/parser/scan.l
@@ -218,7 +218,7 @@ non_newline		[^\n\r]
 
 comment			("--"{non_newline}*)
 
-whitespace		({space}+|{comment})
+whitespace		({space}|{comment})
 
 /*
  * SQL requires at least one newline in the whitespace separating
@@ -227,7 +227,7 @@ whitespace		({space}+|{comment})
  * it, whereas {whitespace} should generally have a * after it...
  */
 
-special_whitespace		({space}+|{comment}{newline})
+special_whitespace		({space}|{comment}{newline})
 horiz_whitespace		({horiz_space}|{comment})
 whitespace_with_newline	({horiz_whitespace}*{newline}{special_whitespace}*)
 
@@ -696,8 +696,7 @@ other			.
 <xq,xe,xus>{quotecontinue} {
 					/* ignore */
 				}
-<xe>.			{
-					/* This is only needed for \ just before EOF */
+<xe>\\			{
 					addlitchar(yytext[0], yyscanner);
 				}
 <xq,xe,xus><<EOF>>		{ yyerror("unterminated quoted string"); }
@@ -741,8 +740,7 @@ other			.
 <xdolq>{dolqfailed} {
 					addlit(yytext, yyleng, yyscanner);
 				}
-<xdolq>.		{
-					/* This is only needed for $ inside the quoted text */
+<xdolq>\$		{
 					addlitchar(yytext[0], yyscanner);
 				}
 <xdolq><<EOF>>	{ yyerror("unterminated dollar-quoted string"); }
-- 
2.17.2 (Apple Git-113)

v3-0002-Replace-the-Flex-quotestop-rules-with-a-new-exclu.patchapplication/octet-stream; name=v3-0002-Replace-the-Flex-quotestop-rules-with-a-new-exclu.patchDownload

From 9b9b2882905409b91a26ee8f92961450af6591d7 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@2ndquadrant.com>
Date: Thu, 27 Jun 2019 13:52:58 +0800
Subject: [PATCH v3 2/2] Replace the Flex quotestop rules with a new exclusive
 state

When Flex encounters a quote while inside any kind of quoted string,
it saves the current state and enters a new state in order to
detect string continuations, if any. This brings the number of
scanner states down to 29521, which is small enough to allow Flex to
use 16 bit types in the yy_transition array. This reduces the size
of the postgres binary by 171kB.
---
 src/backend/parser/scan.l | 110 ++++++++++++++++++++------------------
 1 file changed, 59 insertions(+), 51 deletions(-)

diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l
index 90f96c446f..525cef4b02 100644
--- a/src/backend/parser/scan.l
+++ b/src/backend/parser/scan.l
@@ -56,6 +56,8 @@ fprintf_to_ereport(const char *fmt, const char *msg)
 	ereport(ERROR, (errmsg_internal("%s", msg)));
 }
 
+static int state_before;
+
 /*
  * GUC variables.  This is a DIRECT violation of the warning given at the
  * head of gram.y, ie flex/bison code must not depend on any GUC variables;
@@ -168,6 +170,7 @@ extern void core_yyset_column(int column_no, yyscan_t yyscanner);
  *  <xd> delimited identifiers (double-quoted identifiers)
  *  <xh> hexadecimal numeric string
  *  <xq> standard quoted strings
+ *  <xqs> quote stop (detect continued strings)
  *  <xe> extended quoted strings (support backslash escape sequences)
  *  <xdolq> $foo$ quoted strings
  *  <xui> quoted identifier with Unicode escapes
@@ -185,6 +188,7 @@ extern void core_yyset_column(int column_no, yyscan_t yyscanner);
 %x xd
 %x xh
 %x xq
+%x xqs
 %x xe
 %x xdolq
 %x xui
@@ -231,19 +235,9 @@ special_whitespace		({space}|{comment}{newline})
 horiz_whitespace		({horiz_space}|{comment})
 whitespace_with_newline	({horiz_whitespace}*{newline}{special_whitespace}*)
 
-/*
- * To ensure that {quotecontinue} can be scanned without having to back up
- * if the full pattern isn't matched, we include trailing whitespace in
- * {quotestop}.  This matches all cases where {quotecontinue} fails to match,
- * except for {quote} followed by whitespace and just one "-" (not two,
- * which would start a {comment}).  To cover that we have {quotefail}.
- * The actions for {quotestop} and {quotefail} must throw back characters
- * beyond the quote proper.
- */
 quote			'
-quotestop		{quote}{whitespace}*
-quotecontinue	{quote}{whitespace_with_newline}{quote}
-quotefail		{quote}{whitespace}*"-"
+quotecontinue		{whitespace_with_newline}{quote}
+quotecontinuefail	{whitespace}*{other}?
 
 /* Bit string
  * It is tempting to scan the string for only those characters
@@ -476,21 +470,10 @@ other			.
 					startlit();
 					addlitchar('b', yyscanner);
 				}
-<xb>{quotestop}	|
-<xb>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
-					yylval->str = litbufdup(yyscanner);
-					return BCONST;
-				}
 <xh>{xhinside}	|
 <xb>{xbinside}	{
 					addlit(yytext, yyleng, yyscanner);
 				}
-<xh>{quotecontinue}	|
-<xb>{quotecontinue}	{
-					/* ignore */
-				}
 <xb><<EOF>>		{ yyerror("unterminated bit string literal"); }
 
 {xhstart}		{
@@ -505,13 +488,6 @@ other			.
 					startlit();
 					addlitchar('x', yyscanner);
 				}
-<xh>{quotestop}	|
-<xh>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
-					yylval->str = litbufdup(yyscanner);
-					return XCONST;
-				}
 <xh><<EOF>>		{ yyerror("unterminated hexadecimal string literal"); }
 
 {xnstart}		{
@@ -568,28 +544,63 @@ other			.
 					BEGIN(xus);
 					startlit();
 				}
-<xq,xe>{quotestop}	|
-<xq,xe>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
+
+<xb,xh,xq,xe,xus>{quote} {
 					/*
-					 * check that the data remains valid if it might have been
-					 * made invalid by unescaping any chars.
+					 * When we are scanning a quoted string and see an end
+					 * quote, we must look ahead for a possible continuation.
+					 * If we don't see one, we know the last quote was in
+					 * fact the end of the string.
 					 */
-					if (yyextra->saw_non_ascii)
-						pg_verifymbstr(yyextra->literalbuf,
-									   yyextra->literallen,
-									   false);
-					yylval->str = litbufdup(yyscanner);
-					return SCONST;
+					state_before = YYSTATE;
+					BEGIN(xqs);
 				}
-<xus>{quotestop} |
-<xus>{quotefail} {
-					/* throw back all but the quote */
-					yyless(1);
-					/* xusend state looks for possible UESCAPE */
-					BEGIN(xusend);
+<xqs>{quotecontinue} {
+					BEGIN(state_before);
+				}
+<xqs><<EOF>> |
+<xqs>{quotecontinuefail} {
+					/*
+					 * throw back everything and handle the string
+					 * we scanned previously
+					 */
+					yyless(0);
+
+					switch (state_before)
+					{
+						case xb:
+							BEGIN(INITIAL);
+							yylval->str = litbufdup(yyscanner);
+							return BCONST;
+						case xh:
+							BEGIN(INITIAL);
+							yylval->str = litbufdup(yyscanner);
+							return XCONST;
+						case xe:
+							/* fallthrough */
+						case xq:
+							BEGIN(INITIAL);
+
+							/*
+							 * Check that the data remains valid if it
+							 * might have been made invalid by unescaping
+							 * any chars.
+							 */
+							if (yyextra->saw_non_ascii)
+								pg_verifymbstr(yyextra->literalbuf,
+											   yyextra->literallen,
+											   false);
+							yylval->str = litbufdup(yyscanner);
+							return SCONST;
+						case xus:
+							/* xusend state looks for possible UESCAPE */
+							BEGIN(xusend);
+							break;
+						default:
+							yyerror("unhandled previous state after endquote");
+					}
 				}
+
 <xusend>{whitespace} {
 					/* stay in xusend state over whitespace */
 				}
@@ -693,9 +704,6 @@ other			.
 					if (c == '\0' || IS_HIGHBIT_SET(c))
 						yyextra->saw_non_ascii = true;
 				}
-<xq,xe,xus>{quotecontinue} {
-					/* ignore */
-				}
 <xe>\\			{
 					addlitchar(yytext[0], yyscanner);
 				}
-- 
2.17.2 (Apple Git-113)

Tom Lane

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: John Naylor (#7)

Re: benchmarking Flex practices

John Naylor <john.naylor@2ndquadrant.com> writes:

0001 is a small patch to remove some unneeded generality from the
current rules. This lowers the number of elements in the yy_transition
array from 37045 to 36201.

I don't particularly like 0001. The two bits like this

-whitespace		({space}+|{comment})
+whitespace		({space}|{comment})

seem likely to create performance problems for runs of whitespace, in that
the lexer will now have to execute the associated action once per space
character not just once for the whole run. Those actions are empty, but
I don't think flex optimizes for that, and it's really flex's per-action
overhead that I'm worried about. Note the comment in the "Performance"
section of the flex manual:

Another area where the user can increase a scanner's performance (and
one that's easier to implement) arises from the fact that the longer
the tokens matched, the faster the scanner will run. This is because
with long tokens the processing of most input characters takes place
in the (short) inner scanning loop, and does not often have to go
through the additional work of setting up the scanning environment
(e.g., `yytext') for the action.

There are a bunch of higher-order productions that use "{whitespace}*",
which is surely a bit redundant given the contents of {whitespace}.
But maybe we could address that by replacing "{whitespace}*" with
"{opt_whitespace}" defined as

opt_whitespace ({space}*|{comment})

Not sure what impact if any that'd have on table size, but I'm quite sure
that {whitespace} was defined with an eye to avoiding unnecessary
lexer action cycles.

As for the other two bits that are like

-<xe>.			{
-					/* This is only needed for \ just before EOF */
+<xe>\\			{

my recollection is that those productions are defined that way to avoid a
flex warning about not all possible input characters being accounted for
in the <xe> (resp. <xdolq>) state. Maybe that warning is
flex-version-dependent, or maybe this was just a worry and not something
that actually produced a warning ... but I'm hesitant to change it.
If we ever did get to flex's default action, that action is to echo the
current input character to stdout, which would be Very Bad.

As far as I can see, the point of 0002 is to have just one set of
flex rules for the various variants of quotecontinue processing.
That sounds OK, though I'm a bit surprised it makes this much difference
in the table size. I would suggest that "state_before" needs a less
generic name (maybe "state_before_xqs"?) and more than no comment.
Possibly more to the point, it's not okay to have static state variables
in the core scanner, so that variable needs to be kept in yyextra.
(Don't remember offhand whether it's any more acceptable in the other
scanners.)

regards, tom lane

John Naylor

john.naylor@2ndquadrant.com

over 6 years ago

In reply to: Tom Lane (#8)

Re: benchmarking Flex practices

On Wed, Jul 3, 2019 at 5:35 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

John Naylor <john.naylor@2ndquadrant.com> writes:

0001 is a small patch to remove some unneeded generality from the
current rules. This lowers the number of elements in the yy_transition
array from 37045 to 36201.

I don't particularly like 0001. The two bits like this
-whitespace             ({space}+|{comment})
+whitespace             ({space}|{comment})
seem likely to create performance problems for runs of whitespace, in that
the lexer will now have to execute the associated action once per space
character not just once for the whole run.

Okay.

There are a bunch of higher-order productions that use "{whitespace}*",
which is surely a bit redundant given the contents of {whitespace}.
But maybe we could address that by replacing "{whitespace}*" with
"{opt_whitespace}" defined as

opt_whitespace ({space}*|{comment})

Not sure what impact if any that'd have on table size, but I'm quite sure
that {whitespace} was defined with an eye to avoiding unnecessary
lexer action cycles.

It turns out that {opt_whitespace} as defined above is not equivalent
to {whitespace}* , since the former is either a single comment or a
single run of 0 or more whitespace chars (if I understand correctly).
Using {opt_whitespace} for the UESCAPE rules on top of v3-0002, the
regression tests pass, but queries like this fail with a syntax error:

# select U&'d!0061t!+000061' uescape --comment
'!';

There was in fact a substantial size reduction, though, so for
curiosity's sake I tried just replacing {whitespace}* with {space}* in
the UESCAPE rules, and the table shrank from 30367 (that's with 0002
only) to 24661.

As for the other two bits that are like
-<xe>.                  {
-                                       /* This is only needed for \ just before EOF */
+<xe>\\                 {
my recollection is that those productions are defined that way to avoid a
flex warning about not all possible input characters being accounted for
in the <xe> (resp. <xdolq>) state. Maybe that warning is
flex-version-dependent, or maybe this was just a worry and not something
that actually produced a warning ... but I'm hesitant to change it.
If we ever did get to flex's default action, that action is to echo the
current input character to stdout, which would be Very Bad.

FWIW, I tried Flex 2.5.35 and 2.6.4 with no warnings, and I did get a
warning when I deleted any of those two rules. I'll leave them out for
now, since this change was only good for ~500 fewer elements in the
transition array.

As far as I can see, the point of 0002 is to have just one set of
flex rules for the various variants of quotecontinue processing.
That sounds OK, though I'm a bit surprised it makes this much difference
in the table size. I would suggest that "state_before" needs a less
generic name (maybe "state_before_xqs"?) and more than no comment.
Possibly more to the point, it's not okay to have static state variables
in the core scanner, so that variable needs to be kept in yyextra.
(Don't remember offhand whether it's any more acceptable in the other
scanners.)

Ah yes, I got this idea from the ECPG scanner, which is not reentrant. Will fix.

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#10

John Naylor

john.naylor@2ndquadrant.com

over 6 years ago

In reply to: Tom Lane (#8)

3 attachment(s)

Re: benchmarking Flex practices

On Wed, Jul 3, 2019 at 5:35 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

As far as I can see, the point of 0002 is to have just one set of
flex rules for the various variants of quotecontinue processing.
That sounds OK, though I'm a bit surprised it makes this much difference
in the table size. I would suggest that "state_before" needs a less
generic name (maybe "state_before_xqs"?) and more than no comment.
Possibly more to the point, it's not okay to have static state variables
in the core scanner, so that variable needs to be kept in yyextra.

v4-0001 is basically the same as v3-0002, with the state variable in
yyextra. Since follow-on patches use it as well, I've named it
state_before_quote_stop. I failed to come up with a nicer short name.
With this applied, the transition table is reduced from 37045 to
30367. Since that's uncomfortably close to the 32k limit for 16 bit
members, I hacked away further at UESCAPE bloat.

0002 unifies xusend and xuiend by saving the state of xui as well.
This actually causes a performance regression, but it's more of a
refactoring patch to prevent from having to create two additional
start conditions in 0003 (of course it could be done that way if
desired, but the savings won't be as great). In any case, the table is
now down to 26074.

0003 creates a separate start condition so that UESCAPE and the
expected quoted character after it are detected in separate states.
This allows us to use standard whitespace skipping techniques and also
to greatly simplify the uescapefail rule. The final size of the table
is 23696. Removing UESCAPE entirely results in 21860, so this likely
the most compact size of this feature.

Performance is very similar to HEAD. Parsing the information schema
might be a hair faster and pgbench-like queries with simple strings a
hair slower, but the difference seems within the noise of variation.
Parsing strings with UESCAPE likewise seems about the same.

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

v4-0001-Replace-the-Flex-quotestop-rules-with-a-new-exclu.patchapplication/octet-stream; name=v4-0001-Replace-the-Flex-quotestop-rules-with-a-new-exclu.patchDownload

From f854b4c50cd93c2149199112923f1ecdd4c66c11 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@2ndquadrant.com>
Date: Fri, 5 Jul 2019 14:04:13 +0700
Subject: [PATCH v4 1/3] Replace the Flex quotestop rules with a new exclusive
 state

When Flex encounters a quote while inside any kind of quoted string,
it saves the current state and enters a new state in order to
detect possible string continuations. This brings the number of
scanner states from 37045 to 30367, which is small enough to allow
Flex to use 16-bit types in the yy_transition array.
---
 src/backend/parser/scan.l    | 108 ++++++++++++++++++-----------------
 src/include/parser/scanner.h |   3 +
 2 files changed, 60 insertions(+), 51 deletions(-)

diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l
index e1cae859e8..cbf3f6deca 100644
--- a/src/backend/parser/scan.l
+++ b/src/backend/parser/scan.l
@@ -168,6 +168,7 @@ extern void core_yyset_column(int column_no, yyscan_t yyscanner);
  *  <xd> delimited identifiers (double-quoted identifiers)
  *  <xh> hexadecimal numeric string
  *  <xq> standard quoted strings
+ *  <xqs> quote stop (detect continued strings)
  *  <xe> extended quoted strings (support backslash escape sequences)
  *  <xdolq> $foo$ quoted strings
  *  <xui> quoted identifier with Unicode escapes
@@ -185,6 +186,7 @@ extern void core_yyset_column(int column_no, yyscan_t yyscanner);
 %x xd
 %x xh
 %x xq
+%x xqs
 %x xe
 %x xdolq
 %x xui
@@ -231,19 +233,9 @@ special_whitespace		({space}+|{comment}{newline})
 horiz_whitespace		({horiz_space}|{comment})
 whitespace_with_newline	({horiz_whitespace}*{newline}{special_whitespace}*)
 
-/*
- * To ensure that {quotecontinue} can be scanned without having to back up
- * if the full pattern isn't matched, we include trailing whitespace in
- * {quotestop}.  This matches all cases where {quotecontinue} fails to match,
- * except for {quote} followed by whitespace and just one "-" (not two,
- * which would start a {comment}).  To cover that we have {quotefail}.
- * The actions for {quotestop} and {quotefail} must throw back characters
- * beyond the quote proper.
- */
 quote			'
-quotestop		{quote}{whitespace}*
-quotecontinue	{quote}{whitespace_with_newline}{quote}
-quotefail		{quote}{whitespace}*"-"
+quotecontinue		{whitespace_with_newline}{quote}
+quotecontinuefail	{whitespace}*{other}?
 
 /* Bit string
  * It is tempting to scan the string for only those characters
@@ -476,21 +468,10 @@ other			.
 					startlit();
 					addlitchar('b', yyscanner);
 				}
-<xb>{quotestop}	|
-<xb>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
-					yylval->str = litbufdup(yyscanner);
-					return BCONST;
-				}
 <xh>{xhinside}	|
 <xb>{xbinside}	{
 					addlit(yytext, yyleng, yyscanner);
 				}
-<xh>{quotecontinue}	|
-<xb>{quotecontinue}	{
-					/* ignore */
-				}
 <xb><<EOF>>		{ yyerror("unterminated bit string literal"); }
 
 {xhstart}		{
@@ -505,13 +486,6 @@ other			.
 					startlit();
 					addlitchar('x', yyscanner);
 				}
-<xh>{quotestop}	|
-<xh>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
-					yylval->str = litbufdup(yyscanner);
-					return XCONST;
-				}
 <xh><<EOF>>		{ yyerror("unterminated hexadecimal string literal"); }
 
 {xnstart}		{
@@ -568,28 +542,63 @@ other			.
 					BEGIN(xus);
 					startlit();
 				}
-<xq,xe>{quotestop}	|
-<xq,xe>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
+
+<xb,xh,xq,xe,xus>{quote} {
 					/*
-					 * check that the data remains valid if it might have been
-					 * made invalid by unescaping any chars.
+					 * When we are scanning a quoted string and see an end
+					 * quote, we must look ahead for a possible continuation.
+					 * If we don't see one, we know the last quote was in
+					 * fact the end of the string.
 					 */
-					if (yyextra->saw_non_ascii)
-						pg_verifymbstr(yyextra->literalbuf,
-									   yyextra->literallen,
-									   false);
-					yylval->str = litbufdup(yyscanner);
-					return SCONST;
+					yyextra->state_before_quote_stop = YYSTATE;
+					BEGIN(xqs);
 				}
-<xus>{quotestop} |
-<xus>{quotefail} {
-					/* throw back all but the quote */
-					yyless(1);
-					/* xusend state looks for possible UESCAPE */
-					BEGIN(xusend);
+<xqs>{quotecontinue} {
+					BEGIN(yyextra->state_before_quote_stop);
+				}
+<xqs><<EOF>> |
+<xqs>{quotecontinuefail} {
+					/*
+					 * throw back everything and handle the string
+					 * we scanned previously
+					 */
+					yyless(0);
+
+					switch (yyextra->state_before_quote_stop)
+					{
+						case xb:
+							BEGIN(INITIAL);
+							yylval->str = litbufdup(yyscanner);
+							return BCONST;
+						case xh:
+							BEGIN(INITIAL);
+							yylval->str = litbufdup(yyscanner);
+							return XCONST;
+						case xe:
+							/* fallthrough */
+						case xq:
+							BEGIN(INITIAL);
+
+							/*
+							 * Check that the data remains valid if it
+							 * might have been made invalid by unescaping
+							 * any chars.
+							 */
+							if (yyextra->saw_non_ascii)
+								pg_verifymbstr(yyextra->literalbuf,
+											   yyextra->literallen,
+											   false);
+							yylval->str = litbufdup(yyscanner);
+							return SCONST;
+						case xus:
+							/* xusend state looks for possible UESCAPE */
+							BEGIN(xusend);
+							break;
+						default:
+							yyerror("unhandled previous state after endquote");
+					}
 				}
+
 <xusend>{whitespace} {
 					/* stay in xusend state over whitespace */
 				}
@@ -693,9 +702,6 @@ other			.
 					if (c == '\0' || IS_HIGHBIT_SET(c))
 						yyextra->saw_non_ascii = true;
 				}
-<xq,xe,xus>{quotecontinue} {
-					/* ignore */
-				}
 <xe>.			{
 					/* This is only needed for \ just before EOF */
 					addlitchar(yytext[0], yyscanner);
diff --git a/src/include/parser/scanner.h b/src/include/parser/scanner.h
index 731a2bd264..9b5f5eaad1 100644
--- a/src/include/parser/scanner.h
+++ b/src/include/parser/scanner.h
@@ -99,6 +99,9 @@ typedef struct core_yy_extra_type
 	int			literallen;		/* actual current string length */
 	int			literalalloc;	/* current allocated buffer size */
 
+	/* start condition when end quote is detected */
+	int			state_before_quote_stop;
+
 	int			xcdepth;		/* depth of nesting in slash-star comments */
 	char	   *dolqstart;		/* current $foo$ quote start string */
 
-- 
2.17.2 (Apple Git-113)

v4-0002-Unify-xuiend-and-xusend-into-a-single-start-condi.patchapplication/octet-stream; name=v4-0002-Unify-xuiend-and-xusend-into-a-single-start-condi.patchDownload

From 9a5dfd7172aaf588612fe820f26e3134270a6eec Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@2ndquadrant.com>
Date: Fri, 5 Jul 2019 14:22:42 +0700
Subject: [PATCH v4 2/3] Unify xuiend and xusend into a single start condition

Whether scanning a string or an identifier with unicode escapes, we
enter a single state to look for a possible UESCAPE. This shrinks
the transition array to 26074.
---
 src/backend/parser/scan.l | 127 +++++++++++++++++++-------------------
 1 file changed, 63 insertions(+), 64 deletions(-)

diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l
index cbf3f6deca..c0aa6cd22e 100644
--- a/src/backend/parser/scan.l
+++ b/src/backend/parser/scan.l
@@ -172,9 +172,9 @@ extern void core_yyset_column(int column_no, yyscan_t yyscanner);
  *  <xe> extended quoted strings (support backslash escape sequences)
  *  <xdolq> $foo$ quoted strings
  *  <xui> quoted identifier with Unicode escapes
- *  <xuiend> end of a quoted identifier with Unicode escapes, UESCAPE can follow
  *  <xus> quoted string with Unicode escapes
- *  <xusend> end of a quoted string with Unicode escapes, UESCAPE can follow
+ *  <xuend> end of a quoted string or identifier with Unicode escapes,
+ *    UESCAPE can follow
  *  <xeu> Unicode surrogate pair in extended quoted string
  *
  * Remember to add an <<EOF>> case whenever you add a new exclusive state!
@@ -190,9 +190,8 @@ extern void core_yyset_column(int column_no, yyscan_t yyscanner);
 %x xe
 %x xdolq
 %x xui
-%x xuiend
 %x xus
-%x xusend
+%x xuend
 %x xeu
 
 /*
@@ -591,39 +590,14 @@ other			.
 							yylval->str = litbufdup(yyscanner);
 							return SCONST;
 						case xus:
-							/* xusend state looks for possible UESCAPE */
-							BEGIN(xusend);
+							/* xuend state looks for possible UESCAPE */
+							BEGIN(xuend);
 							break;
 						default:
 							yyerror("unhandled previous state after endquote");
 					}
 				}
 
-<xusend>{whitespace} {
-					/* stay in xusend state over whitespace */
-				}
-<xusend><<EOF>> |
-<xusend>{other} |
-<xusend>{xustop1} {
-					/* no UESCAPE after the quote, throw back everything */
-					yyless(0);
-					BEGIN(INITIAL);
-					yylval->str = litbuf_udeescape('\\', yyscanner);
-					return SCONST;
-				}
-<xusend>{xustop2} {
-					/* found UESCAPE after the end quote */
-					BEGIN(INITIAL);
-					if (!check_uescapechar(yytext[yyleng - 2]))
-					{
-						SET_YYLLOC();
-						ADVANCE_YYLLOC(yyleng - 2);
-						yyerror("invalid Unicode escape character");
-					}
-					yylval->str = litbuf_udeescape(yytext[yyleng - 2],
-												   yyscanner);
-					return SCONST;
-				}
 <xq,xe,xus>{xqdouble} {
 					addlitchar('\'', yyscanner);
 				}
@@ -776,52 +750,77 @@ other			.
 					return IDENT;
 				}
 <xui>{dquote} {
-					yyless(1);
-					/* xuiend state looks for possible UESCAPE */
-					BEGIN(xuiend);
+					/* xuend state looks for possible UESCAPE */
+					yyextra->state_before_quote_stop = YYSTATE;
+					BEGIN(xuend);
 				}
-<xuiend>{whitespace} {
-					/* stay in xuiend state over whitespace */
+
+<xuend>{whitespace} {
+					/* stay in xuend state over whitespace */
 				}
-<xuiend><<EOF>> |
-<xuiend>{other} |
-<xuiend>{xustop1} {
+<xuend><<EOF>> |
+<xuend>{other} |
+<xuend>{xustop1} {
 					/* no UESCAPE after the quote, throw back everything */
-					char	   *ident;
-					int			identlen;
-
 					yyless(0);
 
-					BEGIN(INITIAL);
-					if (yyextra->literallen == 0)
-						yyerror("zero-length delimited identifier");
-					ident = litbuf_udeescape('\\', yyscanner);
-					identlen = strlen(ident);
-					if (identlen >= NAMEDATALEN)
-						truncate_identifier(ident, identlen, true);
-					yylval->str = ident;
-					return IDENT;
+					if (yyextra->state_before_quote_stop == xus)
+					{
+						BEGIN(INITIAL);
+						yylval->str = litbuf_udeescape('\\', yyscanner);
+						return SCONST;
+					}
+					else if (yyextra->state_before_quote_stop == xui)
+					{
+						char	   *ident;
+						int			identlen;
+
+						BEGIN(INITIAL);
+						if (yyextra->literallen == 0)
+							yyerror("zero-length delimited identifier");
+						ident = litbuf_udeescape('\\', yyscanner);
+						identlen = strlen(ident);
+						if (identlen >= NAMEDATALEN)
+							truncate_identifier(ident, identlen, true);
+						yylval->str = ident;
+						return IDENT;
+					}
+					else
+						yyerror("unhandled previous state in xuend");
 				}
-<xuiend>{xustop2}	{
+<xuend>{xustop2} {
 					/* found UESCAPE after the end quote */
-					char	   *ident;
-					int			identlen;
-
-					BEGIN(INITIAL);
-					if (yyextra->literallen == 0)
-						yyerror("zero-length delimited identifier");
 					if (!check_uescapechar(yytext[yyleng - 2]))
 					{
 						SET_YYLLOC();
 						ADVANCE_YYLLOC(yyleng - 2);
 						yyerror("invalid Unicode escape character");
 					}
-					ident = litbuf_udeescape(yytext[yyleng - 2], yyscanner);
-					identlen = strlen(ident);
-					if (identlen >= NAMEDATALEN)
-						truncate_identifier(ident, identlen, true);
-					yylval->str = ident;
-					return IDENT;
+
+					if (yyextra->state_before_quote_stop == xus)
+					{
+						BEGIN(INITIAL);
+						yylval->str = litbuf_udeescape(yytext[yyleng - 2],
+													   yyscanner);
+						return SCONST;
+					}
+					else if (yyextra->state_before_quote_stop == xui)
+					{
+						char	   *ident;
+						int			identlen;
+
+						BEGIN(INITIAL);
+						if (yyextra->literallen == 0)
+							yyerror("zero-length delimited identifier");
+						ident = litbuf_udeescape(yytext[yyleng - 2], yyscanner);
+						identlen = strlen(ident);
+						if (identlen >= NAMEDATALEN)
+							truncate_identifier(ident, identlen, true);
+						yylval->str = ident;
+						return IDENT;
+					}
+					else
+						yyerror("unhandled previous state in xuend");
 				}
 <xd,xui>{xddouble}	{
 					addlitchar('"', yyscanner);
-- 
2.17.2 (Apple Git-113)

v4-0003-Use-separate-start-conditions-for-both-UESCAPE-an.patchapplication/octet-stream; name=v4-0003-Use-separate-start-conditions-for-both-UESCAPE-an.patchDownload

From 8295efb9994e28c8b0c9b0e4992c1ed3cf891791 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@2ndquadrant.com>
Date: Fri, 5 Jul 2019 14:26:00 +0700
Subject: [PATCH v4 3/3] Use separate start conditions for both UESCAPE and the
 following character.

This shrinks the transition array to 23696 elements and simplifies the
uescape/uescapefail rules.
---
 src/backend/parser/scan.l | 36 +++++++++++++++++++++++++-----------
 1 file changed, 25 insertions(+), 11 deletions(-)

diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l
index c0aa6cd22e..1837636273 100644
--- a/src/backend/parser/scan.l
+++ b/src/backend/parser/scan.l
@@ -175,6 +175,7 @@ extern void core_yyset_column(int column_no, yyscan_t yyscanner);
  *  <xus> quoted string with Unicode escapes
  *  <xuend> end of a quoted string or identifier with Unicode escapes,
  *    UESCAPE can follow
+ *  <xuchar> escape character for Unicode escapes
  *  <xeu> Unicode surrogate pair in extended quoted string
  *
  * Remember to add an <<EOF>> case whenever you add a new exclusive state!
@@ -192,6 +193,7 @@ extern void core_yyset_column(int column_no, yyscan_t yyscanner);
 %x xui
 %x xus
 %x xuend
+%x xuchar
 %x xeu
 
 /*
@@ -295,10 +297,14 @@ xdstop			{dquote}
 xddouble		{dquote}{dquote}
 xdinside		[^"]+
 
-/* Unicode escapes */
-uescape			[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']{quote}
+/* Optional UESCAPE after a quoted string or identifier with Unicode escapes. */
+uescape			[uU][eE][sS][cC][aA][pP][eE]
 /* error rule to avoid backup */
-uescapefail		[uU][eE][sS][cC][aA][pP][eE]{whitespace}*"-"|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*|[uU][eE][sS][cC][aA][pP]|[uU][eE][sS][cC][aA]|[uU][eE][sS][cC]|[uU][eE][sS]|[uU][eE]|[uU]
+uescapefail		[uU][eE][sS][cC][aA][pP]|[uU][eE][sS][cC][aA]|[uU][eE][sS][cC]|[uU][eE][sS]|[uU][eE]|[uU]
+
+/* escape character */
+uescchar		{quote}[^']{quote}
+uesccharfail	{quote}[^']|{other}
 
 /* Quoted identifier with Unicode escapes */
 xuistart		[uU]&{dquote}
@@ -306,9 +312,8 @@ xuistart		[uU]&{dquote}
 /* Quoted string with Unicode escapes */
 xusstart		[uU]&{quote}
 
-/* Optional UESCAPE after a quoted string or identifier with Unicode escapes. */
-xustop1		{uescapefail}?
-xustop2		{uescape}
+/* End of string or identifier with Unicode escapes but no UESCAPE */
+xustop			{uescapefail}?
 
 /* error rule to avoid backup */
 xufailed		[uU]&
@@ -755,12 +760,12 @@ other			.
 					BEGIN(xuend);
 				}
 
-<xuend>{whitespace} {
-					/* stay in xuend state over whitespace */
+<xuend,xuchar>{whitespace} {
+					/* stay in xuend/xuchar state over whitespace */
 				}
 <xuend><<EOF>> |
 <xuend>{other} |
-<xuend>{xustop1} {
+<xuend>{xustop} {
 					/* no UESCAPE after the quote, throw back everything */
 					yyless(0);
 
@@ -788,8 +793,11 @@ other			.
 					else
 						yyerror("unhandled previous state in xuend");
 				}
-<xuend>{xustop2} {
+<xuend>{uescape} {
 					/* found UESCAPE after the end quote */
+					BEGIN(xuchar);
+				}
+<xuchar>{uescchar} {
 					if (!check_uescapechar(yytext[yyleng - 2]))
 					{
 						SET_YYLLOC();
@@ -820,8 +828,14 @@ other			.
 						return IDENT;
 					}
 					else
-						yyerror("unhandled previous state in xuend");
+						yyerror("unhandled previous state in xuchar");
+				}
+<xuchar><<EOF>> |
+<xuchar>{uesccharfail} {
+					SET_YYLLOC();
+					yyerror("missing or invalid Unicode escape character");
 				}
+
 <xd,xui>{xddouble}	{
 					addlitchar('"', yyscanner);
 				}
-- 
2.17.2 (Apple Git-113)

#11

Tom Lane

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: John Naylor (#10)

1 attachment(s)

Re: benchmarking Flex practices

John Naylor <john.naylor@2ndquadrant.com> writes:

[ v4 patches for trimming lexer table size ]

I reviewed this and it looks pretty solid. One gripe I have is
that I think it's best to limit backup-prevention tokens such as
quotecontinuefail so that they match only exact prefixes of their
"success" tokens. This seems clearer to me, and in at least some cases
it can save a few flex states. The attached v5 patch does it like that
and gets us down to 22331 states (from 23696). In some places it looks
like you did that to avoid writing an explicit "{other}" match rule for
an exclusive state, but I think it's better for readability and
separation of concerns to go ahead and have those explicit rules
(and it seems to make no difference table-size-wise).

I also made some cosmetic changes (mostly improving comments) and
smashed the patch series down to 1 patch, because I preferred to
review it that way and we're not really going to commit these
separately.

I did a little bit of portability testing, to the extent of verifying
that the oldest and newest Flex versions I have handy (2.5.33 and 2.6.4)
agree on the table size change and get through regression tests. So
I think we should be good from that end.

We still need to propagate these changes into the psql and ecpg lexers,
but I assume you were waiting to agree on the core patch before touching
those. If you're good with the changes I made here, have at it.

regards, tom lane

Attachments:

v5-smaller-scanner-tables.patchtext/x-diff; charset=us-ascii; name=v5-smaller-scanner-tables.patchDownload

diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l
index e1cae85..899da09 100644
--- a/src/backend/parser/scan.l
+++ b/src/backend/parser/scan.l
@@ -168,12 +168,14 @@ extern void core_yyset_column(int column_no, yyscan_t yyscanner);
  *  <xd> delimited identifiers (double-quoted identifiers)
  *  <xh> hexadecimal numeric string
  *  <xq> standard quoted strings
+ *  <xqs> quote stop (detect continued strings)
  *  <xe> extended quoted strings (support backslash escape sequences)
  *  <xdolq> $foo$ quoted strings
  *  <xui> quoted identifier with Unicode escapes
- *  <xuiend> end of a quoted identifier with Unicode escapes, UESCAPE can follow
  *  <xus> quoted string with Unicode escapes
- *  <xusend> end of a quoted string with Unicode escapes, UESCAPE can follow
+ *  <xuend> end of a quoted string or identifier with Unicode escapes,
+ *    UESCAPE can follow
+ *  <xuchar> expecting escape character literal after UESCAPE
  *  <xeu> Unicode surrogate pair in extended quoted string
  *
  * Remember to add an <<EOF>> case whenever you add a new exclusive state!
@@ -185,12 +187,13 @@ extern void core_yyset_column(int column_no, yyscan_t yyscanner);
 %x xd
 %x xh
 %x xq
+%x xqs
 %x xe
 %x xdolq
 %x xui
-%x xuiend
 %x xus
-%x xusend
+%x xuend
+%x xuchar
 %x xeu
 
 /*
@@ -231,19 +234,18 @@ special_whitespace		({space}+|{comment}{newline})
 horiz_whitespace		({horiz_space}|{comment})
 whitespace_with_newline	({horiz_whitespace}*{newline}{special_whitespace}*)
 
+quote			'
+/* If we see {quote} then {quotecontinue}, the quoted string continues */
+quotecontinue	{whitespace_with_newline}{quote}
+
 /*
- * To ensure that {quotecontinue} can be scanned without having to back up
- * if the full pattern isn't matched, we include trailing whitespace in
- * {quotestop}.  This matches all cases where {quotecontinue} fails to match,
- * except for {quote} followed by whitespace and just one "-" (not two,
- * which would start a {comment}).  To cover that we have {quotefail}.
- * The actions for {quotestop} and {quotefail} must throw back characters
- * beyond the quote proper.
+ * {quotecontinuefail} is needed to avoid lexer backup when we fail to match
+ * {quotecontinue}.  It might seem that this could just be {whitespace}*,
+ * but if there's a dash after {whitespace_with_newline}, it must be consumed
+ * to see if there's another dash --- which would start a {comment} and thus
+ * allow continuation of the {quotecontinue} token.
  */
-quote			'
-quotestop		{quote}{whitespace}*
-quotecontinue	{quote}{whitespace_with_newline}{quote}
-quotefail		{quote}{whitespace}*"-"
+quotecontinuefail	{whitespace}*"-"?
 
 /* Bit string
  * It is tempting to scan the string for only those characters
@@ -304,10 +306,15 @@ xdstop			{dquote}
 xddouble		{dquote}{dquote}
 xdinside		[^"]+
 
-/* Unicode escapes */
-uescape			[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']{quote}
+/* Optional UESCAPE after a quoted string or identifier with Unicode escapes */
+uescape			[uU][eE][sS][cC][aA][pP][eE]
+/* error rule to avoid backup */
+uescapefail		[uU][eE][sS][cC][aA][pP]|[uU][eE][sS][cC][aA]|[uU][eE][sS][cC]|[uU][eE][sS]|[uU][eE]|[uU]
+
+/* escape character literal */
+uescchar		{quote}[^']{quote}
 /* error rule to avoid backup */
-uescapefail		[uU][eE][sS][cC][aA][pP][eE]{whitespace}*"-"|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*|[uU][eE][sS][cC][aA][pP]|[uU][eE][sS][cC][aA]|[uU][eE][sS][cC]|[uU][eE][sS]|[uU][eE]|[uU]
+uesccharfail	{quote}[^']|{quote}
 
 /* Quoted identifier with Unicode escapes */
 xuistart		[uU]&{dquote}
@@ -315,10 +322,6 @@ xuistart		[uU]&{dquote}
 /* Quoted string with Unicode escapes */
 xusstart		[uU]&{quote}
 
-/* Optional UESCAPE after a quoted string or identifier with Unicode escapes. */
-xustop1		{uescapefail}?
-xustop2		{uescape}
-
 /* error rule to avoid backup */
 xufailed		[uU]&
 
@@ -476,21 +479,10 @@ other			.
 					startlit();
 					addlitchar('b', yyscanner);
 				}
-<xb>{quotestop}	|
-<xb>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
-					yylval->str = litbufdup(yyscanner);
-					return BCONST;
-				}
 <xh>{xhinside}	|
 <xb>{xbinside}	{
 					addlit(yytext, yyleng, yyscanner);
 				}
-<xh>{quotecontinue}	|
-<xb>{quotecontinue}	{
-					/* ignore */
-				}
 <xb><<EOF>>		{ yyerror("unterminated bit string literal"); }
 
 {xhstart}		{
@@ -505,13 +497,6 @@ other			.
 					startlit();
 					addlitchar('x', yyscanner);
 				}
-<xh>{quotestop}	|
-<xh>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
-					yylval->str = litbufdup(yyscanner);
-					return XCONST;
-				}
 <xh><<EOF>>		{ yyerror("unterminated hexadecimal string literal"); }
 
 {xnstart}		{
@@ -568,53 +553,71 @@ other			.
 					BEGIN(xus);
 					startlit();
 				}
-<xq,xe>{quotestop}	|
-<xq,xe>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
+
+<xb,xh,xq,xe,xus>{quote} {
 					/*
-					 * check that the data remains valid if it might have been
-					 * made invalid by unescaping any chars.
+					 * When we are scanning a quoted string and see an end
+					 * quote, we must look ahead for a possible continuation.
+					 * If we don't see one, we know the end quote was in fact
+					 * the end of the string.  To reduce the lexer table size,
+					 * we use a single "xqs" state to do the lookahead for all
+					 * types of strings.
 					 */
-					if (yyextra->saw_non_ascii)
-						pg_verifymbstr(yyextra->literalbuf,
-									   yyextra->literallen,
-									   false);
-					yylval->str = litbufdup(yyscanner);
-					return SCONST;
-				}
-<xus>{quotestop} |
-<xus>{quotefail} {
-					/* throw back all but the quote */
-					yyless(1);
-					/* xusend state looks for possible UESCAPE */
-					BEGIN(xusend);
+					yyextra->state_before_quote_stop = YYSTATE;
+					BEGIN(xqs);
 				}
-<xusend>{whitespace} {
-					/* stay in xusend state over whitespace */
+<xqs>{quotecontinue} {
+					/*
+					 * Found a quote continuation, so return to the in-quote
+					 * state and continue scanning the literal.
+					 */
+					BEGIN(yyextra->state_before_quote_stop);
 				}
-<xusend><<EOF>> |
-<xusend>{other} |
-<xusend>{xustop1} {
-					/* no UESCAPE after the quote, throw back everything */
+<xqs>{quotecontinuefail} |
+<xqs>{other} |
+<xqs><<EOF>> {
+					/*
+					 * Failed to see a quote continuation.  Throw back
+					 * everything after the end quote, and handle the string
+					 * according to the state we were in previously.
+					 */
 					yyless(0);
-					BEGIN(INITIAL);
-					yylval->str = litbuf_udeescape('\\', yyscanner);
-					return SCONST;
-				}
-<xusend>{xustop2} {
-					/* found UESCAPE after the end quote */
-					BEGIN(INITIAL);
-					if (!check_uescapechar(yytext[yyleng - 2]))
+
+					switch (yyextra->state_before_quote_stop)
 					{
-						SET_YYLLOC();
-						ADVANCE_YYLLOC(yyleng - 2);
-						yyerror("invalid Unicode escape character");
+						case xb:
+							BEGIN(INITIAL);
+							yylval->str = litbufdup(yyscanner);
+							return BCONST;
+						case xh:
+							BEGIN(INITIAL);
+							yylval->str = litbufdup(yyscanner);
+							return XCONST;
+						case xe:
+							/* fallthrough */
+						case xq:
+							BEGIN(INITIAL);
+
+							/*
+							 * Check that the data remains valid if it
+							 * might have been made invalid by unescaping
+							 * any chars.
+							 */
+							if (yyextra->saw_non_ascii)
+								pg_verifymbstr(yyextra->literalbuf,
+											   yyextra->literallen,
+											   false);
+							yylval->str = litbufdup(yyscanner);
+							return SCONST;
+						case xus:
+							/* xuend state looks for possible UESCAPE */
+							BEGIN(xuend);
+							break;
+						default:
+							yyerror("unhandled previous state in xqs");
 					}
-					yylval->str = litbuf_udeescape(yytext[yyleng - 2],
-												   yyscanner);
-					return SCONST;
 				}
+
 <xq,xe,xus>{xqdouble} {
 					addlitchar('\'', yyscanner);
 				}
@@ -693,9 +696,6 @@ other			.
 					if (c == '\0' || IS_HIGHBIT_SET(c))
 						yyextra->saw_non_ascii = true;
 				}
-<xq,xe,xus>{quotecontinue} {
-					/* ignore */
-				}
 <xe>.			{
 					/* This is only needed for \ just before EOF */
 					addlitchar(yytext[0], yyscanner);
@@ -770,53 +770,89 @@ other			.
 					return IDENT;
 				}
 <xui>{dquote} {
-					yyless(1);
-					/* xuiend state looks for possible UESCAPE */
-					BEGIN(xuiend);
+					/* xuend state looks for possible UESCAPE */
+					yyextra->state_before_quote_stop = YYSTATE;
+					BEGIN(xuend);
 				}
-<xuiend>{whitespace} {
-					/* stay in xuiend state over whitespace */
+
+<xuend,xuchar>{whitespace} {
+					/* stay in xuend or xuchar state over whitespace */
 				}
-<xuiend><<EOF>> |
-<xuiend>{other} |
-<xuiend>{xustop1} {
+<xuend>{uescapefail} |
+<xuend>{other} |
+<xuend><<EOF>> {
 					/* no UESCAPE after the quote, throw back everything */
-					char	   *ident;
-					int			identlen;
-
 					yyless(0);
 
-					BEGIN(INITIAL);
-					if (yyextra->literallen == 0)
-						yyerror("zero-length delimited identifier");
-					ident = litbuf_udeescape('\\', yyscanner);
-					identlen = strlen(ident);
-					if (identlen >= NAMEDATALEN)
-						truncate_identifier(ident, identlen, true);
-					yylval->str = ident;
-					return IDENT;
+					if (yyextra->state_before_quote_stop == xus)
+					{
+						BEGIN(INITIAL);
+						yylval->str = litbuf_udeescape('\\', yyscanner);
+						return SCONST;
+					}
+					else if (yyextra->state_before_quote_stop == xui)
+					{
+						char	   *ident;
+						int			identlen;
+
+						BEGIN(INITIAL);
+						if (yyextra->literallen == 0)
+							yyerror("zero-length delimited identifier");
+						ident = litbuf_udeescape('\\', yyscanner);
+						identlen = strlen(ident);
+						if (identlen >= NAMEDATALEN)
+							truncate_identifier(ident, identlen, true);
+						yylval->str = ident;
+						return IDENT;
+					}
+					else
+						yyerror("unhandled previous state in xuend");
 				}
-<xuiend>{xustop2}	{
+<xuend>{uescape} {
 					/* found UESCAPE after the end quote */
-					char	   *ident;
-					int			identlen;
-
-					BEGIN(INITIAL);
-					if (yyextra->literallen == 0)
-						yyerror("zero-length delimited identifier");
+					BEGIN(xuchar);
+				}
+<xuchar>{uescchar} {
+					/* found escape character literal after UESCAPE */
 					if (!check_uescapechar(yytext[yyleng - 2]))
 					{
 						SET_YYLLOC();
 						ADVANCE_YYLLOC(yyleng - 2);
 						yyerror("invalid Unicode escape character");
 					}
-					ident = litbuf_udeescape(yytext[yyleng - 2], yyscanner);
-					identlen = strlen(ident);
-					if (identlen >= NAMEDATALEN)
-						truncate_identifier(ident, identlen, true);
-					yylval->str = ident;
-					return IDENT;
+
+					if (yyextra->state_before_quote_stop == xus)
+					{
+						BEGIN(INITIAL);
+						yylval->str = litbuf_udeescape(yytext[yyleng - 2],
+													   yyscanner);
+						return SCONST;
+					}
+					else if (yyextra->state_before_quote_stop == xui)
+					{
+						char	   *ident;
+						int			identlen;
+
+						BEGIN(INITIAL);
+						if (yyextra->literallen == 0)
+							yyerror("zero-length delimited identifier");
+						ident = litbuf_udeescape(yytext[yyleng - 2], yyscanner);
+						identlen = strlen(ident);
+						if (identlen >= NAMEDATALEN)
+							truncate_identifier(ident, identlen, true);
+						yylval->str = ident;
+						return IDENT;
+					}
+					else
+						yyerror("unhandled previous state in xuchar");
+				}
+<xuchar>{uesccharfail} |
+<xuchar>{other} |
+<xuchar><<EOF>> {
+					SET_YYLLOC();
+					yyerror("missing or invalid Unicode escape character");
 				}
+
 <xd,xui>{xddouble}	{
 					addlitchar('"', yyscanner);
 				}
diff --git a/src/include/parser/scanner.h b/src/include/parser/scanner.h
index 731a2bd..72c2a28 100644
--- a/src/include/parser/scanner.h
+++ b/src/include/parser/scanner.h
@@ -99,6 +99,7 @@ typedef struct core_yy_extra_type
 	int			literallen;		/* actual current string length */
 	int			literalalloc;	/* current allocated buffer size */
 
+	int			state_before_quote_stop;	/* start cond. before end quote */
 	int			xcdepth;		/* depth of nesting in slash-star comments */
 	char	   *dolqstart;		/* current $foo$ quote start string */

#12

John Naylor

john.naylor@2ndquadrant.com

over 6 years ago

In reply to: Tom Lane (#11)

2 attachment(s)

Re: benchmarking Flex practices

On Wed, Jul 10, 2019 at 3:15 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

John Naylor <john.naylor@2ndquadrant.com> writes:

[ v4 patches for trimming lexer table size ]

I reviewed this and it looks pretty solid. One gripe I have is
that I think it's best to limit backup-prevention tokens such as
quotecontinuefail so that they match only exact prefixes of their
"success" tokens. This seems clearer to me, and in at least some cases
it can save a few flex states. The attached v5 patch does it like that
and gets us down to 22331 states (from 23696). In some places it looks
like you did that to avoid writing an explicit "{other}" match rule for
an exclusive state, but I think it's better for readability and
separation of concerns to go ahead and have those explicit rules
(and it seems to make no difference table-size-wise).

Looks good to me.

We still need to propagate these changes into the psql and ecpg lexers,
but I assume you were waiting to agree on the core patch before touching
those. If you're good with the changes I made here, have at it.

I just made a couple additional cosmetic adjustments that made sense
when diff'ing with the other scanners. Make check-world passes. Some
notes:

The pre-existing ecpg var "state_before" was a bit confusing when
combined with the new var "state_before_quote_stop", and the former is
also used with C-comments, so I decided to go with
"state_before_lit_start" and "state_before_lit_stop". Even though
comments aren't literals, it's less of a stretch than referring to
quotes. To keep things consistent, I went with the latter var in psql
and core.

To get the regression tests to pass, I had to add this:

 psql_scan_in_quote(PsqlScanState state)
 {
- return state->start_state != INITIAL;
+ return state->start_state != INITIAL &&
+ state->start_state != xqs;
 }

...otherwise with parens we sometimes don't get the right prompt and
we get empty lines echoed. Adding xuend and xuchar here didn't seem to
make a difference. There might be something subtle I'm missing, so I
thought I'd mention it.

With the unicode escape rules brought over, the diff to the ecpg
scanner is much cleaner now. The diff for C-comment rules were still
pretty messy in comparison, so I made an attempt to clean that up in
0002. A bit off-topic, but I thought I should offer that while it was
fresh in my head.

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

v6-0001-Reduce-the-number-of-states-in-the-core-scanner-t.patchapplication/octet-stream; name=v6-0001-Reduce-the-number-of-states-in-the-core-scanner-t.patchDownload

From 5ea5886fb44e8bc85753400ea4b1375daf8b2d2d Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@2ndquadrant.com>
Date: Fri, 12 Jul 2019 13:16:44 +0700
Subject: [PATCH v6 1/2] Reduce the number of states in the core scanner table

Previously, the core scanner had 37045 states, which required Flex
to use 32-bit types in the yy_transition array. Refactor the Flex
rules to reduce the number of states to 22331. With 16-bit types,
this shrinks the backend binary by about 200kB.

1. When Flex encounters a quote while inside any kind of quoted
string, it saves the current start condition and enters a new one in
order to detect possible string continuations.

2. Unify xusend and xuiend into a single start condition to detect
a possible UESCAPE. If one is found, enter a new start condition to
scan the escape character.

Sync psql and ECPG scanners to match.
---
 src/backend/parser/scan.l           | 265 ++++++++++++++++------------
 src/fe_utils/psqlscan.l             | 169 ++++++++++--------
 src/include/fe_utils/psqlscan_int.h |   1 +
 src/include/parser/scanner.h        |   1 +
 src/interfaces/ecpg/preproc/pgc.l   | 263 +++++++++++++++++++--------
 5 files changed, 436 insertions(+), 263 deletions(-)

diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l
index e1cae859e8..d2ccb438f6 100644
--- a/src/backend/parser/scan.l
+++ b/src/backend/parser/scan.l
@@ -168,12 +168,14 @@ extern void core_yyset_column(int column_no, yyscan_t yyscanner);
  *  <xd> delimited identifiers (double-quoted identifiers)
  *  <xh> hexadecimal numeric string
  *  <xq> standard quoted strings
+ *  <xqs> quote stop (detect continued strings)
  *  <xe> extended quoted strings (support backslash escape sequences)
  *  <xdolq> $foo$ quoted strings
  *  <xui> quoted identifier with Unicode escapes
- *  <xuiend> end of a quoted identifier with Unicode escapes, UESCAPE can follow
  *  <xus> quoted string with Unicode escapes
- *  <xusend> end of a quoted string with Unicode escapes, UESCAPE can follow
+ *  <xuend> end of a quoted string or identifier with Unicode escapes,
+ *    UESCAPE can follow
+ *  <xuchar> expecting escape character literal after UESCAPE
  *  <xeu> Unicode surrogate pair in extended quoted string
  *
  * Remember to add an <<EOF>> case whenever you add a new exclusive state!
@@ -185,12 +187,13 @@ extern void core_yyset_column(int column_no, yyscan_t yyscanner);
 %x xd
 %x xh
 %x xq
+%x xqs
 %x xe
 %x xdolq
 %x xui
-%x xuiend
 %x xus
-%x xusend
+%x xuend
+%x xuchar
 %x xeu
 
 /*
@@ -231,19 +234,18 @@ special_whitespace		({space}+|{comment}{newline})
 horiz_whitespace		({horiz_space}|{comment})
 whitespace_with_newline	({horiz_whitespace}*{newline}{special_whitespace}*)
 
+quote			'
+/* If we see {quote} then {quotecontinue}, the quoted string continues */
+quotecontinue	{whitespace_with_newline}{quote}
+
 /*
- * To ensure that {quotecontinue} can be scanned without having to back up
- * if the full pattern isn't matched, we include trailing whitespace in
- * {quotestop}.  This matches all cases where {quotecontinue} fails to match,
- * except for {quote} followed by whitespace and just one "-" (not two,
- * which would start a {comment}).  To cover that we have {quotefail}.
- * The actions for {quotestop} and {quotefail} must throw back characters
- * beyond the quote proper.
+ * {quotecontinuefail} is needed to avoid lexer backup when we fail to match
+ * {quotecontinue}.  It might seem that this could just be {whitespace}*,
+ * but if there's a dash after {whitespace_with_newline}, it must be consumed
+ * to see if there's another dash --- which would start a {comment} and thus
+ * allow continuation of the {quotecontinue} token.
  */
-quote			'
-quotestop		{quote}{whitespace}*
-quotecontinue	{quote}{whitespace_with_newline}{quote}
-quotefail		{quote}{whitespace}*"-"
+quotecontinuefail	{whitespace}*"-"?
 
 /* Bit string
  * It is tempting to scan the string for only those characters
@@ -304,10 +306,15 @@ xdstop			{dquote}
 xddouble		{dquote}{dquote}
 xdinside		[^"]+
 
-/* Unicode escapes */
-uescape			[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']{quote}
+/* Optional UESCAPE after a quoted string or identifier with Unicode escapes */
+uescape			[uU][eE][sS][cC][aA][pP][eE]
 /* error rule to avoid backup */
-uescapefail		[uU][eE][sS][cC][aA][pP][eE]{whitespace}*"-"|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*|[uU][eE][sS][cC][aA][pP]|[uU][eE][sS][cC][aA]|[uU][eE][sS][cC]|[uU][eE][sS]|[uU][eE]|[uU]
+uescapefail		[uU][eE][sS][cC][aA][pP]|[uU][eE][sS][cC][aA]|[uU][eE][sS][cC]|[uU][eE][sS]|[uU][eE]|[uU]
+
+/* escape character literal */
+uescchar		{quote}[^']{quote}
+/* error rule to avoid backup */
+uesccharfail	{quote}[^']|{quote}
 
 /* Quoted identifier with Unicode escapes */
 xuistart		[uU]&{dquote}
@@ -315,10 +322,6 @@ xuistart		[uU]&{dquote}
 /* Quoted string with Unicode escapes */
 xusstart		[uU]&{quote}
 
-/* Optional UESCAPE after a quoted string or identifier with Unicode escapes. */
-xustop1		{uescapefail}?
-xustop2		{uescape}
-
 /* error rule to avoid backup */
 xufailed		[uU]&
 
@@ -476,21 +479,10 @@ other			.
 					startlit();
 					addlitchar('b', yyscanner);
 				}
-<xb>{quotestop}	|
-<xb>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
-					yylval->str = litbufdup(yyscanner);
-					return BCONST;
-				}
 <xh>{xhinside}	|
 <xb>{xbinside}	{
 					addlit(yytext, yyleng, yyscanner);
 				}
-<xh>{quotecontinue}	|
-<xb>{quotecontinue}	{
-					/* ignore */
-				}
 <xb><<EOF>>		{ yyerror("unterminated bit string literal"); }
 
 {xhstart}		{
@@ -505,13 +497,6 @@ other			.
 					startlit();
 					addlitchar('x', yyscanner);
 				}
-<xh>{quotestop}	|
-<xh>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
-					yylval->str = litbufdup(yyscanner);
-					return XCONST;
-				}
 <xh><<EOF>>		{ yyerror("unterminated hexadecimal string literal"); }
 
 {xnstart}		{
@@ -568,53 +553,71 @@ other			.
 					BEGIN(xus);
 					startlit();
 				}
-<xq,xe>{quotestop}	|
-<xq,xe>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
+
+<xb,xh,xq,xe,xus>{quote} {
 					/*
-					 * check that the data remains valid if it might have been
-					 * made invalid by unescaping any chars.
+					 * When we are scanning a quoted string and see an end
+					 * quote, we must look ahead for a possible continuation.
+					 * If we don't see one, we know the end quote was in fact
+					 * the end of the string.  To reduce the lexer table size,
+					 * we use a single "xqs" state to do the lookahead for all
+					 * types of strings.
 					 */
-					if (yyextra->saw_non_ascii)
-						pg_verifymbstr(yyextra->literalbuf,
-									   yyextra->literallen,
-									   false);
-					yylval->str = litbufdup(yyscanner);
-					return SCONST;
-				}
-<xus>{quotestop} |
-<xus>{quotefail} {
-					/* throw back all but the quote */
-					yyless(1);
-					/* xusend state looks for possible UESCAPE */
-					BEGIN(xusend);
+					yyextra->state_before_lit_stop = YYSTATE;
+					BEGIN(xqs);
 				}
-<xusend>{whitespace} {
-					/* stay in xusend state over whitespace */
+<xqs>{quotecontinue} {
+					/*
+					 * Found a quote continuation, so return to the in-quote
+					 * state and continue scanning the literal.
+					 */
+					BEGIN(yyextra->state_before_lit_stop);
 				}
-<xusend><<EOF>> |
-<xusend>{other} |
-<xusend>{xustop1} {
-					/* no UESCAPE after the quote, throw back everything */
+<xqs>{quotecontinuefail} |
+<xqs><<EOF>> |
+<xqs>{other}	{
+					/*
+					 * Failed to see a quote continuation.  Throw back
+					 * everything after the end quote, and handle the string
+					 * according to the state we were in previously.
+					 */
 					yyless(0);
-					BEGIN(INITIAL);
-					yylval->str = litbuf_udeescape('\\', yyscanner);
-					return SCONST;
-				}
-<xusend>{xustop2} {
-					/* found UESCAPE after the end quote */
-					BEGIN(INITIAL);
-					if (!check_uescapechar(yytext[yyleng - 2]))
+
+					switch (yyextra->state_before_lit_stop)
 					{
-						SET_YYLLOC();
-						ADVANCE_YYLLOC(yyleng - 2);
-						yyerror("invalid Unicode escape character");
+						case xb:
+							BEGIN(INITIAL);
+							yylval->str = litbufdup(yyscanner);
+							return BCONST;
+						case xh:
+							BEGIN(INITIAL);
+							yylval->str = litbufdup(yyscanner);
+							return XCONST;
+						case xq:
+							/* fallthrough */
+						case xe:
+							BEGIN(INITIAL);
+
+							/*
+							 * Check that the data remains valid if it
+							 * might have been made invalid by unescaping
+							 * any chars.
+							 */
+							if (yyextra->saw_non_ascii)
+								pg_verifymbstr(yyextra->literalbuf,
+											   yyextra->literallen,
+											   false);
+							yylval->str = litbufdup(yyscanner);
+							return SCONST;
+						case xus:
+							/* xuend state looks for possible UESCAPE */
+							BEGIN(xuend);
+							break;
+						default:
+							yyerror("unhandled previous state in xqs");
 					}
-					yylval->str = litbuf_udeescape(yytext[yyleng - 2],
-												   yyscanner);
-					return SCONST;
 				}
+
 <xq,xe,xus>{xqdouble} {
 					addlitchar('\'', yyscanner);
 				}
@@ -693,9 +696,6 @@ other			.
 					if (c == '\0' || IS_HIGHBIT_SET(c))
 						yyextra->saw_non_ascii = true;
 				}
-<xq,xe,xus>{quotecontinue} {
-					/* ignore */
-				}
 <xe>.			{
 					/* This is only needed for \ just before EOF */
 					addlitchar(yytext[0], yyscanner);
@@ -770,53 +770,88 @@ other			.
 					return IDENT;
 				}
 <xui>{dquote} {
-					yyless(1);
-					/* xuiend state looks for possible UESCAPE */
-					BEGIN(xuiend);
+					if (yyextra->literallen == 0)
+						yyerror("zero-length delimited identifier");
+
+					/* xuend state looks for possible UESCAPE */
+					yyextra->state_before_lit_stop = YYSTATE;
+					BEGIN(xuend);
 				}
-<xuiend>{whitespace} {
-					/* stay in xuiend state over whitespace */
+
+<xuend,xuchar>{whitespace} {
+					/* stay in xuend or xuchar state over whitespace */
 				}
-<xuiend><<EOF>> |
-<xuiend>{other} |
-<xuiend>{xustop1} {
+<xuend>{uescapefail} |
+<xuend><<EOF>> |
+<xuend>{other}	{
 					/* no UESCAPE after the quote, throw back everything */
-					char	   *ident;
-					int			identlen;
-
 					yyless(0);
 
-					BEGIN(INITIAL);
-					if (yyextra->literallen == 0)
-						yyerror("zero-length delimited identifier");
-					ident = litbuf_udeescape('\\', yyscanner);
-					identlen = strlen(ident);
-					if (identlen >= NAMEDATALEN)
-						truncate_identifier(ident, identlen, true);
-					yylval->str = ident;
-					return IDENT;
+					if (yyextra->state_before_lit_stop == xus)
+					{
+						BEGIN(INITIAL);
+						yylval->str = litbuf_udeescape('\\', yyscanner);
+						return SCONST;
+					}
+					else if (yyextra->state_before_lit_stop == xui)
+					{
+						char	   *ident;
+						int			identlen;
+
+						BEGIN(INITIAL);
+						ident = litbuf_udeescape('\\', yyscanner);
+						identlen = strlen(ident);
+						if (identlen >= NAMEDATALEN)
+							truncate_identifier(ident, identlen, true);
+						yylval->str = ident;
+						return IDENT;
+					}
+					else
+						yyerror("unhandled previous state in xuend");
 				}
-<xuiend>{xustop2}	{
+<xuend>{uescape} {
 					/* found UESCAPE after the end quote */
-					char	   *ident;
-					int			identlen;
-
-					BEGIN(INITIAL);
-					if (yyextra->literallen == 0)
-						yyerror("zero-length delimited identifier");
+					BEGIN(xuchar);
+				}
+<xuchar>{uescchar} {
+					/* found escape character literal after UESCAPE */
 					if (!check_uescapechar(yytext[yyleng - 2]))
 					{
 						SET_YYLLOC();
 						ADVANCE_YYLLOC(yyleng - 2);
 						yyerror("invalid Unicode escape character");
 					}
-					ident = litbuf_udeescape(yytext[yyleng - 2], yyscanner);
-					identlen = strlen(ident);
-					if (identlen >= NAMEDATALEN)
-						truncate_identifier(ident, identlen, true);
-					yylval->str = ident;
-					return IDENT;
+
+					if (yyextra->state_before_lit_stop == xus)
+					{
+						BEGIN(INITIAL);
+						yylval->str = litbuf_udeescape(yytext[yyleng - 2],
+													   yyscanner);
+						return SCONST;
+					}
+					else if (yyextra->state_before_lit_stop == xui)
+					{
+						char	   *ident;
+						int			identlen;
+
+						BEGIN(INITIAL);
+						ident = litbuf_udeescape(yytext[yyleng - 2], yyscanner);
+						identlen = strlen(ident);
+						if (identlen >= NAMEDATALEN)
+							truncate_identifier(ident, identlen, true);
+						yylval->str = ident;
+						return IDENT;
+					}
+					else
+						yyerror("unhandled previous state in xuchar");
 				}
+<xuchar>{uesccharfail} |
+<xuchar><<EOF>> |
+<xuchar>{other} {
+					SET_YYLLOC();
+					yyerror("missing or invalid Unicode escape character");
+				}
+
 <xd,xui>{xddouble}	{
 					addlitchar('"', yyscanner);
 				}
diff --git a/src/fe_utils/psqlscan.l b/src/fe_utils/psqlscan.l
index ce20936339..a66c0f4c6e 100644
--- a/src/fe_utils/psqlscan.l
+++ b/src/fe_utils/psqlscan.l
@@ -114,12 +114,14 @@ extern void psql_yyset_column(int column_no, yyscan_t yyscanner);
  *  <xd> delimited identifiers (double-quoted identifiers)
  *  <xh> hexadecimal numeric string
  *  <xq> standard quoted strings
+ *  <xqs> quote stop (detect continued strings)
  *  <xe> extended quoted strings (support backslash escape sequences)
  *  <xdolq> $foo$ quoted strings
  *  <xui> quoted identifier with Unicode escapes
- *  <xuiend> end of a quoted identifier with Unicode escapes, UESCAPE can follow
  *  <xus> quoted string with Unicode escapes
- *  <xusend> end of a quoted string with Unicode escapes, UESCAPE can follow
+ *  <xuend> end of a quoted string or identifier with Unicode escapes,
+ *    UESCAPE can follow
+ *  <xuchar> expecting escape character literal after UESCAPE
  *
  * Note: we intentionally don't mimic the backend's <xeu> state; we have
  * no need to distinguish it from <xe> state, and no good way to get out
@@ -132,12 +134,13 @@ extern void psql_yyset_column(int column_no, yyscan_t yyscanner);
 %x xd
 %x xh
 %x xq
+%x xqs
 %x xe
 %x xdolq
 %x xui
-%x xuiend
 %x xus
-%x xusend
+%x xuend
+%x xuchar
 
 /*
  * In order to make the world safe for Windows and Mac clients as well as
@@ -177,19 +180,18 @@ special_whitespace		({space}+|{comment}{newline})
 horiz_whitespace		({horiz_space}|{comment})
 whitespace_with_newline	({horiz_whitespace}*{newline}{special_whitespace}*)
 
+quote			'
+/* If we see {quote} then {quotecontinue}, the quoted string continues */
+quotecontinue	{whitespace_with_newline}{quote}
+
 /*
- * To ensure that {quotecontinue} can be scanned without having to back up
- * if the full pattern isn't matched, we include trailing whitespace in
- * {quotestop}.  This matches all cases where {quotecontinue} fails to match,
- * except for {quote} followed by whitespace and just one "-" (not two,
- * which would start a {comment}).  To cover that we have {quotefail}.
- * The actions for {quotestop} and {quotefail} must throw back characters
- * beyond the quote proper.
+ * {quotecontinuefail} is needed to avoid lexer backup when we fail to match
+ * {quotecontinue}.  It might seem that this could just be {whitespace}*,
+ * but if there's a dash after {whitespace_with_newline}, it must be consumed
+ * to see if there's another dash --- which would start a {comment} and thus
+ * allow continuation of the {quotecontinue} token.
  */
-quote			'
-quotestop		{quote}{whitespace}*
-quotecontinue	{quote}{whitespace_with_newline}{quote}
-quotefail		{quote}{whitespace}*"-"
+quotecontinuefail	{whitespace}*"-"?
 
 /* Bit string
  * It is tempting to scan the string for only those characters
@@ -250,10 +252,15 @@ xdstop			{dquote}
 xddouble		{dquote}{dquote}
 xdinside		[^"]+
 
-/* Unicode escapes */
-uescape			[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']{quote}
+/* Optional UESCAPE after a quoted string or identifier with Unicode escapes */
+uescape			[uU][eE][sS][cC][aA][pP][eE]
 /* error rule to avoid backup */
-uescapefail		[uU][eE][sS][cC][aA][pP][eE]{whitespace}*"-"|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*|[uU][eE][sS][cC][aA][pP]|[uU][eE][sS][cC][aA]|[uU][eE][sS][cC]|[uU][eE][sS]|[uU][eE]|[uU]
+uescapefail		[uU][eE][sS][cC][aA][pP]|[uU][eE][sS][cC][aA]|[uU][eE][sS][cC]|[uU][eE][sS]|[uU][eE]|[uU]
+
+/* escape character literal */
+uescchar		{quote}[^']{quote}
+/* error rule to avoid backup */
+uesccharfail	{quote}[^']|{quote}
 
 /* Quoted identifier with Unicode escapes */
 xuistart		[uU]&{dquote}
@@ -261,10 +268,6 @@ xuistart		[uU]&{dquote}
 /* Quoted string with Unicode escapes */
 xusstart		[uU]&{quote}
 
-/* Optional UESCAPE after a quoted string or identifier with Unicode escapes. */
-xustop1		{uescapefail}?
-xustop2		{uescape}
-
 /* error rule to avoid backup */
 xufailed		[uU]&
 
@@ -438,20 +441,10 @@ other			.
 					BEGIN(xb);
 					ECHO;
 				}
-<xb>{quotestop}	|
-<xb>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
-					ECHO;
-				}
 <xh>{xhinside}	|
 <xb>{xbinside}	{
 					ECHO;
 				}
-<xh>{quotecontinue}	|
-<xb>{quotecontinue}	{
-					ECHO;
-				}
 
 {xhstart}		{
 					/* Hexadecimal bit type.
@@ -463,12 +456,6 @@ other			.
 					BEGIN(xh);
 					ECHO;
 				}
-<xh>{quotestop}	|
-<xh>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
-					ECHO;
-				}
 
 {xnstart}		{
 					yyless(1);	/* eat only 'n' this time */
@@ -490,32 +477,59 @@ other			.
 					BEGIN(xus);
 					ECHO;
 				}
-<xq,xe>{quotestop}	|
-<xq,xe>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
-					ECHO;
-				}
-<xus>{quotestop} |
-<xus>{quotefail} {
-					/* throw back all but the quote */
-					yyless(1);
-					BEGIN(xusend);
+
+<xb,xh,xq,xe,xus>{quote} {
+					/*
+					 * When we are scanning a quoted string and see an end
+					 * quote, we must look ahead for a possible continuation.
+					 * If we don't see one, we know the end quote was in fact
+					 * the end of the string.  To reduce the lexer table size,
+					 * we use a single "xqs" state to do the lookahead for all
+					 * types of strings.
+					 */
+					cur_state->state_before_lit_stop = YYSTATE;
+					BEGIN(xqs);
 					ECHO;
 				}
-<xusend>{whitespace} {
+<xqs>{quotecontinue} {
+					/*
+					 * Found a quote continuation, so return to the in-quote
+					 * state and continue scanning the literal.
+					 */
+					BEGIN(cur_state->state_before_lit_stop);
 					ECHO;
 				}
-<xusend>{other} |
-<xusend>{xustop1} {
+<xqs>{quotecontinuefail} |
+<xqs>{other}	{
+					/*
+					 * Failed to see a quote continuation.  Throw back
+					 * everything after the end quote, and enter start condition
+					 * according to the state we were in previously.
+					 */
 					yyless(0);
-					BEGIN(INITIAL);
-					ECHO;
-				}
-<xusend>{xustop2} {
-					BEGIN(INITIAL);
-					ECHO;
+
+					switch (cur_state->state_before_lit_stop)
+					{
+						case xb:
+							BEGIN(INITIAL);
+							break;
+						case xh:
+							BEGIN(INITIAL);
+							break;
+						case xq:
+							/* fallthrough */
+						case xe:
+							BEGIN(INITIAL);
+							break;
+						case xus:
+							/* xuend state looks for possible UESCAPE */
+							BEGIN(xuend);
+							break;
+						default:
+							fprintf(stderr, "unhandled previous state in xuend\n");
+					}
 				}
+
 <xq,xe,xus>{xqdouble} {
 					ECHO;
 				}
@@ -540,9 +554,6 @@ other			.
 <xe>{xehexesc}  {
 					ECHO;
 				}
-<xq,xe,xus>{quotecontinue} {
-					ECHO;
-				}
 <xe>.			{
 					/* This is only needed for \ just before EOF */
 					ECHO;
@@ -600,23 +611,39 @@ other			.
 					ECHO;
 				}
 <xui>{dquote} {
-					yyless(1);
-					BEGIN(xuiend);
+					/* xuend state looks for possible UESCAPE */
+					yyextra->state_before_lit_stop = YYSTATE;
+					BEGIN(xuend);
 					ECHO;
 				}
-<xuiend>{whitespace} {
+
+<xuend,xuchar>{whitespace} {
+					/* stay in xuend or xuchar state over whitespace */
 					ECHO;
 				}
-<xuiend>{other} |
-<xuiend>{xustop1} {
+<xuend>{uescapefail} |
+<xuend>{other}	{
+					/* no UESCAPE after the quote, throw back everything */
 					yyless(0);
 					BEGIN(INITIAL);
 					ECHO;
 				}
-<xuiend>{xustop2}	{
+<xuend>{uescape} {
+					/* found UESCAPE after the end quote */
+					BEGIN(xuchar);
+					ECHO;
+				}
+<xuchar>{uescchar} {
+					/* found escape character literal after UESCAPE */
 					BEGIN(INITIAL);
 					ECHO;
 				}
+<xuchar>{uesccharfail} |
+<xuchar>{other} {
+					BEGIN(INITIAL);
+					ECHO;
+				}
+
 <xd,xui>{xddouble}	{
 					ECHO;
 				}
@@ -1084,8 +1111,9 @@ psql_scan(PsqlScanState state,
 			switch (state->start_state)
 			{
 				case INITIAL:
-				case xuiend:	/* we treat these like INITIAL */
-				case xusend:
+				case xqs:		/* we treat these like INITIAL */
+				case xuend:
+				case xuchar:
 					if (state->paren_depth > 0)
 					{
 						result = PSCAN_INCOMPLETE;
@@ -1240,7 +1268,8 @@ psql_scan_reselect_sql_lexer(PsqlScanState state)
 bool
 psql_scan_in_quote(PsqlScanState state)
 {
-	return state->start_state != INITIAL;
+	return state->start_state != INITIAL &&
+			state->start_state != xqs;
 }
 
 /*
diff --git a/src/include/fe_utils/psqlscan_int.h b/src/include/fe_utils/psqlscan_int.h
index 2acb380078..00567c1b1e 100644
--- a/src/include/fe_utils/psqlscan_int.h
+++ b/src/include/fe_utils/psqlscan_int.h
@@ -110,6 +110,7 @@ typedef struct PsqlScanStateData
 	 * and updated with its finishing state on exit.
 	 */
 	int			start_state;	/* yylex's starting/finishing state */
+	int			state_before_lit_stop;	/* start cond. before end quote */
 	int			paren_depth;	/* depth of nesting in parentheses */
 	int			xcdepth;		/* depth of nesting in slash-star comments */
 	char	   *dolqstart;		/* current $foo$ quote start string */
diff --git a/src/include/parser/scanner.h b/src/include/parser/scanner.h
index 731a2bd264..256c1570bf 100644
--- a/src/include/parser/scanner.h
+++ b/src/include/parser/scanner.h
@@ -99,6 +99,7 @@ typedef struct core_yy_extra_type
 	int			literallen;		/* actual current string length */
 	int			literalalloc;	/* current allocated buffer size */
 
+	int			state_before_lit_stop;	/* start cond. before end quote */
 	int			xcdepth;		/* depth of nesting in slash-star comments */
 	char	   *dolqstart;		/* current $foo$ quote start string */
 
diff --git a/src/interfaces/ecpg/preproc/pgc.l b/src/interfaces/ecpg/preproc/pgc.l
index 488c89b7f4..1eefbc05f6 100644
--- a/src/interfaces/ecpg/preproc/pgc.l
+++ b/src/interfaces/ecpg/preproc/pgc.l
@@ -6,6 +6,9 @@
  *
  * This is a modified version of src/backend/parser/scan.l
  *
+ * The ecpg scanner is not backup-free, so the fail rules are
+ * only here to simplify syncing this file with scan.l.
+ *
  *
  * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -61,7 +64,10 @@ static bool isdefine(void);
 static bool isinformixdefine(void);
 
 char *token_start;
-static int state_before;
+
+/* vars to keep track of start conditions when scanning literals */
+static int state_before_lit_start;
+static int state_before_lit_stop;
 
 struct _yy_buffer
 {
@@ -112,14 +118,21 @@ static struct _if_value
  *  <xh> hexadecimal numeric string
  *  <xn> national character quoted strings
  *  <xq> standard quoted strings
+ *  <xqs> quote stop (detect continued strings)
  *  <xe> extended quoted strings (support backslash escape sequences)
  *  <xqc> single-quoted strings in C
  *  <xdolq> $foo$ quoted strings
  *  <xui> quoted identifier with Unicode escapes
  *  <xus> quoted string with Unicode escapes
+ *  <xuend> end of a quoted string or identifier with Unicode escapes,
+ *    UESCAPE can follow
+ *  <xuchar> expecting escape character literal after UESCAPE
  *  <xcond> condition of an EXEC SQL IFDEF construct
  *  <xskip> skipping the inactive part of an EXEC SQL IFDEF construct
  *
+ * Note: we intentionally don't mimic the backend's <xeu> state; we have
+ * no need to distinguish it from <xe> state.
+ *
  * Remember to add an <<EOF>> case whenever you add a new exclusive state!
  * The default one is probably not the right thing.
  */
@@ -132,11 +145,14 @@ static struct _if_value
 %x xh
 %x xn
 %x xq
+%x xqs
 %x xe
 %x xqc
 %x xdolq
 %x xui
 %x xus
+%x xuend
+%x xuchar
 %x xcond
 %x xskip
 
@@ -181,9 +197,17 @@ horiz_whitespace		({horiz_space}|{comment})
 whitespace_with_newline	({horiz_whitespace}*{newline}{whitespace}*)
 
 quote			'
-quotestop		{quote}{whitespace}*
-quotecontinue	{quote}{whitespace_with_newline}{quote}
-quotefail		{quote}{whitespace}*"-"
+/* If we see {quote} then {quotecontinue}, the quoted string continues */
+quotecontinue	{whitespace_with_newline}{quote}
+
+/*
+ * {quotecontinuefail} is needed to avoid lexer backup when we fail to match
+ * {quotecontinue}.  It might seem that this could just be {whitespace}*,
+ * but if there's a dash after {whitespace_with_newline}, it must be consumed
+ * to see if there's another dash --- which would start a {comment} and thus
+ * allow continuation of the {quotecontinue} token.
+ */
+quotecontinuefail	{whitespace}*"-"?
 
 /* Bit string
  */
@@ -237,19 +261,21 @@ xdstop			{dquote}
 xddouble		{dquote}{dquote}
 xdinside		[^"]+
 
-/* Unicode escapes */
-/* (The ecpg scanner is not backup-free, so the fail rules in scan.l are
- * not needed here, but could be added if desired.)
- */
-uescape			[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']{quote}
+/* Optional UESCAPE after a quoted string or identifier with Unicode escapes */
+uescape			[uU][eE][sS][cC][aA][pP][eE]
+/* error rule to avoid backup */
+uescapefail		[uU][eE][sS][cC][aA][pP]|[uU][eE][sS][cC][aA]|[uU][eE][sS][cC]|[uU][eE][sS]|[uU][eE]|[uU]
+
+/* escape character literal */
+uescchar		{quote}[^']{quote}
+/* error rule to avoid backup */
+uesccharfail	{quote}[^']|{quote}
 
 /* Quoted identifier with Unicode escapes */
 xuistart		[uU]&{dquote}
-xuistop			{dquote}({whitespace}*{uescape})?
 
 /* Quoted string with Unicode escapes */
 xusstart		[uU]&{quote}
-xusstop			{quote}({whitespace}*{uescape})?
 
 /* special stuff for C strings */
 xdcqq			\\\\
@@ -411,7 +437,7 @@ cppline			{space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})((\/\*[^*/]*\*+
 
 {xcstart}		{
 					token_start = yytext;
-					state_before = YYSTATE;
+					state_before_lit_start = YYSTATE;
 					xcdepth = 0;
 					BEGIN(xcsql);
 					/* Put back any characters past slash-star; see above */
@@ -422,7 +448,7 @@ cppline			{space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})((\/\*[^*/]*\*+
 
 <C>{xcstart}	{
 					token_start = yytext;
-					state_before = YYSTATE;
+					state_before_lit_start = YYSTATE;
 					xcdepth = 0;
 					BEGIN(xcc);
 					/* Put back any characters past slash-star; see above */
@@ -440,7 +466,7 @@ cppline			{space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})((\/\*[^*/]*\*+
 					if (xcdepth <= 0)
 					{
 						ECHO;
-						BEGIN(state_before);
+						BEGIN(state_before_lit_start);
 						token_start = NULL;
 					}
 					else
@@ -451,7 +477,7 @@ cppline			{space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})((\/\*[^*/]*\*+
 				}
 <xcc>{xcstop}	{
 					ECHO;
-					BEGIN(state_before);
+					BEGIN(state_before_lit_start);
 					token_start = NULL;
 				}
 
@@ -482,23 +508,10 @@ cppline			{space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})((\/\*[^*/]*\*+
 				}
 } /* <SQL> */
 
-<xb>{quotestop}	|
-<xb>{quotefail} {
-					yyless(1);
-					BEGIN(SQL);
-					if (literalbuf[strspn(literalbuf, "01") + 1] != '\0')
-						mmerror(PARSE_ERROR, ET_ERROR, "invalid bit string literal");
-					base_yylval.str = mm_strdup(literalbuf);
-					return BCONST;
-				}
 <xh>{xhinside}	|
 <xb>{xbinside}	{
 					addlit(yytext, yyleng);
 				}
-<xh>{quotecontinue}	|
-<xb>{quotecontinue}	{
-					/* ignore */
-				}
 <xb><<EOF>>		{ mmfatal(PARSE_ERROR, "unterminated bit string literal"); }
 
 <SQL>{xhstart}	{
@@ -507,19 +520,11 @@ cppline			{space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})((\/\*[^*/]*\*+
 					startlit();
 					addlitchar('x');
 				}
-<xh>{quotestop}	|
-<xh>{quotefail} {
-					yyless(1);
-					BEGIN(SQL);
-					base_yylval.str = mm_strdup(literalbuf);
-					return XCONST;
-				}
-
 <xh><<EOF>>		{ mmfatal(PARSE_ERROR, "unterminated hexadecimal string literal"); }
 
 <C>{xqstart}	{
 					token_start = yytext;
-					state_before = YYSTATE;
+					state_before_lit_start = YYSTATE;
 					BEGIN(xqc);
 					startlit();
 				}
@@ -530,59 +535,98 @@ cppline			{space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})((\/\*[^*/]*\*+
 					 * Transfer it as-is to the backend.
 					 */
 					token_start = yytext;
-					state_before = YYSTATE;
+					state_before_lit_start = YYSTATE;
 					BEGIN(xn);
 					startlit();
 				}
 
 {xqstart}		{
 					token_start = yytext;
-					state_before = YYSTATE;
+					state_before_lit_start = YYSTATE;
 					BEGIN(xq);
 					startlit();
 				}
 {xestart}		{
 					token_start = yytext;
-					state_before = YYSTATE;
+					state_before_lit_start = YYSTATE;
 					BEGIN(xe);
 					startlit();
 				}
 {xusstart}		{
 					token_start = yytext;
-					state_before = YYSTATE;
+					state_before_lit_start = YYSTATE;
 					BEGIN(xus);
 					startlit();
 					addlit(yytext, yyleng);
 				}
 } /* <SQL> */
 
-<xq,xqc>{quotestop} |
-<xq,xqc>{quotefail} {
-					yyless(1);
-					BEGIN(state_before);
-					base_yylval.str = mm_strdup(literalbuf);
-					return SCONST;
-				}
-<xe>{quotestop} |
-<xe>{quotefail} {
-					yyless(1);
-					BEGIN(state_before);
-					base_yylval.str = mm_strdup(literalbuf);
-					return ECONST;
+<xb,xh,xq,xqc,xe,xn,xus>{quote} {
+					/*
+					 * When we are scanning a quoted string and see an end
+					 * quote, we must look ahead for a possible continuation.
+					 * If we don't see one, we know the end quote was in fact
+					 * the end of the string.  To reduce the lexer table size,
+					 * we use a single "xqs" state to do the lookahead for all
+					 * types of strings.
+					 */
+					state_before_lit_stop = YYSTATE;
+					BEGIN(xqs);
 				}
-<xn>{quotestop} |
-<xn>{quotefail} {
-					yyless(1);
-					BEGIN(state_before);
-					base_yylval.str = mm_strdup(literalbuf);
-					return NCONST;
+<xqs>{quotecontinue} {
+					/*
+					 * Found a quote continuation, so return to the in-quote
+					 * state and continue scanning the literal.
+					 */
+					BEGIN(state_before_lit_stop);
 				}
-<xus>{xusstop} {
-					addlit(yytext, yyleng);
-					BEGIN(state_before);
-					base_yylval.str = mm_strdup(literalbuf);
-					return UCONST;
+<xqs>{quotecontinuefail} |
+<xqs><<EOF>> |
+<xqs>{other}	{
+					/*
+					 * Failed to see a quote continuation.  Throw back
+					 * everything after the end quote, and handle the string
+					 * according to the state we were in previously.
+					 */
+					yyless(0);
+
+					switch (state_before_lit_stop)
+					{
+						case xb:
+							BEGIN(state_before_lit_start);
+							if (literalbuf[strspn(literalbuf, "01") + 1] != '\0')
+								mmerror(PARSE_ERROR, ET_ERROR, "invalid bit string literal");
+							base_yylval.str = mm_strdup(literalbuf);
+							return BCONST;
+						case xh:
+							BEGIN(state_before_lit_start);
+							base_yylval.str = mm_strdup(literalbuf);
+							return XCONST;
+						case xq:
+							/* fallthrough */
+						case xqc:
+							BEGIN(state_before_lit_start);
+							base_yylval.str = mm_strdup(literalbuf);
+							return SCONST;
+						case xe:
+							BEGIN(state_before_lit_start);
+							base_yylval.str = mm_strdup(literalbuf);
+							return ECONST;
+						case xn:
+							BEGIN(state_before_lit_start);
+							base_yylval.str = mm_strdup(literalbuf);
+							return NCONST;
+						case xus:
+							/* xuend state looks for possible UESCAPE */
+							BEGIN(xuend);
+							/* add end quote for the backend */
+							addlitchar('\'');
+							break;
+						default:
+							mmfatal(PARSE_ERROR, "unhandled previous state in xuend\n");
+					}
 				}
+
 <xq,xe,xn,xus>{xqdouble}	{ addlitchar('\''); }
 <xqc>{xqcquote}	{
 					addlitchar('\\');
@@ -604,9 +648,6 @@ cppline			{space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})((\/\*[^*/]*\*+
 <xe>{xehexesc}  {
 					addlit(yytext, yyleng);
 				}
-<xq,xqc,xe,xn,xus>{quotecontinue}	{
-					/* ignore */
-				}
 <xe>.			{
 					/* This is only needed for \ just before EOF */
 					addlitchar(yytext[0]);
@@ -666,12 +707,12 @@ cppline			{space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})((\/\*[^*/]*\*+
 
 <SQL>{
 {xdstart}		{
-					state_before = YYSTATE;
+					state_before_lit_start = YYSTATE;
 					BEGIN(xd);
 					startlit();
 				}
 {xuistart}		{
-					state_before = YYSTATE;
+					state_before_lit_start = YYSTATE;
 					BEGIN(xui);
 					startlit();
 					addlit(yytext, yyleng);
@@ -679,7 +720,7 @@ cppline			{space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})((\/\*[^*/]*\*+
 } /* <SQL> */
 
 <xd>{xdstop}	{
-					BEGIN(state_before);
+					BEGIN(state_before_lit_start);
 					if (literallen == 0)
 						mmerror(PARSE_ERROR, ET_ERROR, "zero-length delimited identifier");
 					/* The backend will truncate the identifier here. We do not as it does not change the result. */
@@ -687,19 +728,85 @@ cppline			{space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})((\/\*[^*/]*\*+
 					return CSTRING;
 				}
 <xdc>{xdstop}	{
-					BEGIN(state_before);
+					BEGIN(state_before_lit_start);
 					base_yylval.str = mm_strdup(literalbuf);
 					return CSTRING;
 				}
-<xui>{xuistop}	{
-					BEGIN(state_before);
+
+<xui>{dquote}	{
 					if (literallen == 2) /* "U&" */
 						mmerror(PARSE_ERROR, ET_ERROR, "zero-length delimited identifier");
-					/* The backend will truncate the identifier here. We do not as it does not change the result. */
+					/* xuend state looks for possible UESCAPE */
+					state_before_lit_stop = YYSTATE;
+					BEGIN(xuend);
 					addlit(yytext, yyleng);
-					base_yylval.str = mm_strdup(literalbuf);
-					return UIDENT;
 				}
+
+<xuend,xuchar>{whitespace} {
+					/* stay in xuend or xuchar state over whitespace */
+				}
+<xuend>{uescapefail} |
+<xuend><<EOF>> |
+<xuend>{other}	{
+					/* no UESCAPE after the quote, throw back everything */
+					yyless(0);
+					BEGIN(state_before_lit_start);
+
+					if (state_before_lit_stop == xus)
+					{
+						base_yylval.str = mm_strdup(literalbuf);
+						return UCONST;
+					}
+					else if (state_before_lit_stop == xui)
+					{
+						/*
+						 * The backend will truncate the identifier here.
+						 * We do not as it does not change the result.
+						 */
+						base_yylval.str = mm_strdup(literalbuf);
+						return UIDENT;
+					}
+					else
+						mmfatal(PARSE_ERROR, "unhandled previous state in xuend");
+				}
+<xuend>{uescape} {
+					/* found UESCAPE after the end quote */
+					BEGIN(xuchar);
+					/* normalize whitespace */
+					addlitchar(' ');
+					addlit(yytext, yyleng);
+				}
+<xuchar>{uescchar} {
+					/* found escape character literal after UESCAPE */
+					BEGIN(state_before_lit_start);
+					/* normalize whitespace */
+					addlitchar(' ');
+					addlit(yytext, yyleng);
+
+					if (state_before_lit_stop == xus)
+					{
+						base_yylval.str = mm_strdup(literalbuf);
+						return UCONST;
+					}
+					else if (state_before_lit_stop == xui)
+					{
+						/*
+						 * The backend will truncate the identifier here.
+						 * We do not as it does not change the result.
+						 */
+						base_yylval.str = mm_strdup(literalbuf);
+						return UIDENT;
+					}
+					else
+						mmfatal(PARSE_ERROR, "unhandled previous state in xuchar");
+				}
+<xuchar>{uesccharfail} |
+<xuchar><<EOF>> |
+<xuchar>{other} {
+					BEGIN(state_before_lit_start);
+					mmerror(PARSE_ERROR, ET_ERROR, "missing or invalid Unicode escape character");
+				}
+
 <xd,xui>{xddouble}	{
 					addlitchar('"');
 				}
@@ -708,7 +815,7 @@ cppline			{space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})((\/\*[^*/]*\*+
 				}
 <xd,xui><<EOF>>	{ mmfatal(PARSE_ERROR, "unterminated quoted identifier"); }
 <C>{xdstart}	{
-					state_before = YYSTATE;
+					state_before_lit_start = YYSTATE;
 					BEGIN(xdc);
 					startlit();
 				}
-- 
2.17.2 (Apple Git-113)

v6-0002-Merge-ECPG-scanner-states-for-C-style-comments.patchapplication/octet-stream; name=v6-0002-Merge-ECPG-scanner-states-for-C-style-comments.patchDownload

From 8b0e1fd8c797917de0b445d2497fe3d12e2ba03c Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@2ndquadrant.com>
Date: Fri, 12 Jul 2019 14:05:17 +0700
Subject: [PATCH v6 2/2] Merge ECPG scanner states for C-style comments

This makes the ECPG scanner more similar to the backend scanner.
In passing, make some cosmetic adjustments to reduce the diffs
between the three core scanners.
---
 src/backend/parser/scan.l         |  2 +-
 src/fe_utils/psqlscan.l           |  2 +-
 src/interfaces/ecpg/preproc/pgc.l | 75 ++++++++++++++++---------------
 3 files changed, 40 insertions(+), 39 deletions(-)

diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l
index d2ccb438f6..4df96267cc 100644
--- a/src/backend/parser/scan.l
+++ b/src/backend/parser/scan.l
@@ -769,7 +769,7 @@ other			.
 					yylval->str = ident;
 					return IDENT;
 				}
-<xui>{dquote} {
+<xui>{dquote}	{
 					if (yyextra->literallen == 0)
 						yyerror("zero-length delimited identifier");
 
diff --git a/src/fe_utils/psqlscan.l b/src/fe_utils/psqlscan.l
index a66c0f4c6e..85d179c421 100644
--- a/src/fe_utils/psqlscan.l
+++ b/src/fe_utils/psqlscan.l
@@ -610,7 +610,7 @@ other			.
 					BEGIN(INITIAL);
 					ECHO;
 				}
-<xui>{dquote} {
+<xui>{dquote}	{
 					/* xuend state looks for possible UESCAPE */
 					yyextra->state_before_lit_stop = YYSTATE;
 					BEGIN(xuend);
diff --git a/src/interfaces/ecpg/preproc/pgc.l b/src/interfaces/ecpg/preproc/pgc.l
index 1eefbc05f6..a9a170df5b 100644
--- a/src/interfaces/ecpg/preproc/pgc.l
+++ b/src/interfaces/ecpg/preproc/pgc.l
@@ -111,8 +111,7 @@ static struct _if_value
  * and to eliminate parsing troubles for numeric strings.
  * Exclusive states:
  *  <xb> bit string literal
- *  <xcc> extended C-style comments in C
- *  <xcsql> extended C-style comments in SQL
+ *  <xc> extended C-style comments
  *  <xd> delimited identifiers (double-quoted identifiers)
  *  <xdc> double-quoted strings in C
  *  <xh> hexadecimal numeric string
@@ -138,8 +137,7 @@ static struct _if_value
  */
 
 %x xb
-%x xcc
-%x xcsql
+%x xc
 %x xd
 %x xdc
 %x xh
@@ -434,54 +432,58 @@ cppline			{space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})((\/\*[^*/]*\*+
 {whitespace}	{
 					/* ignore */
 				}
+} /* <SQL> */
 
+<C,SQL>{
 {xcstart}		{
 					token_start = yytext;
 					state_before_lit_start = YYSTATE;
 					xcdepth = 0;
-					BEGIN(xcsql);
+					BEGIN(xc);
 					/* Put back any characters past slash-star; see above */
 					yyless(2);
 					fputs("/*", yyout);
 				}
-} /* <SQL> */
+} /* <C,SQL> */
 
-<C>{xcstart}	{
-					token_start = yytext;
-					state_before_lit_start = YYSTATE;
-					xcdepth = 0;
-					BEGIN(xcc);
-					/* Put back any characters past slash-star; see above */
-					yyless(2);
-					fputs("/*", yyout);
-				}
-<xcc>{xcstart}	{ ECHO; }
-<xcsql>{xcstart}	{
-					xcdepth++;
-					/* Put back any characters past slash-star; see above */
-					yyless(2);
-					fputs("/_*", yyout);
-				}
-<xcsql>{xcstop}	{
-					if (xcdepth <= 0)
+<xc>{
+{xcstart}		{
+					if (state_before_lit_start == SQL)
 					{
-						ECHO;
-						BEGIN(state_before_lit_start);
-						token_start = NULL;
+						xcdepth++;
+						/* Put back any characters past slash-star; see above */
+						yyless(2);
+						fputs("/_*", yyout);
 					}
-					else
+					else if (state_before_lit_start == C)
 					{
-						xcdepth--;
-						fputs("*_/", yyout);
+						ECHO;
 					}
 				}
-<xcc>{xcstop}	{
-					ECHO;
-					BEGIN(state_before_lit_start);
-					token_start = NULL;
+
+{xcstop}		{
+					if (state_before_lit_start == SQL)
+					{
+						if (xcdepth <= 0)
+						{
+							ECHO;
+							BEGIN(SQL);
+							token_start = NULL;
+						}
+						else
+						{
+							xcdepth--;
+							fputs("*_/", yyout);
+						}
+					}
+					else if (state_before_lit_start == C)
+					{
+						ECHO;
+						BEGIN(C);
+						token_start = NULL;
+					}
 				}
 
-<xcc,xcsql>{
 {xcinside}		{
 					ECHO;
 				}
@@ -497,7 +499,7 @@ cppline			{space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})((\/\*[^*/]*\*+
 <<EOF>>			{
 					mmfatal(PARSE_ERROR, "unterminated /* comment");
 				}
-} /* <xcc,xcsql> */
+} /* <xc> */
 
 <SQL>{
 {xbstart}		{
@@ -732,7 +734,6 @@ cppline			{space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})((\/\*[^*/]*\*+
 					base_yylval.str = mm_strdup(literalbuf);
 					return CSTRING;
 				}
-
 <xui>{dquote}	{
 					if (literallen == 2) /* "U&" */
 						mmerror(PARSE_ERROR, ET_ERROR, "zero-length delimited identifier");
-- 
2.17.2 (Apple Git-113)

#13

Tom Lane

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: John Naylor (#12)

Re: benchmarking Flex practices

John Naylor <john.naylor@2ndquadrant.com> writes:

The pre-existing ecpg var "state_before" was a bit confusing when
combined with the new var "state_before_quote_stop", and the former is
also used with C-comments, so I decided to go with
"state_before_lit_start" and "state_before_lit_stop". Even though
comments aren't literals, it's less of a stretch than referring to
quotes. To keep things consistent, I went with the latter var in psql
and core.

Hm, what do you think of "state_before_str_stop" instead? It seems
to me that both "quote" and "lit" are pretty specific terms, so
maybe we need something a bit vaguer.

To get the regression tests to pass, I had to add this:
psql_scan_in_quote(PsqlScanState state)
{
- return state->start_state != INITIAL;
+ return state->start_state != INITIAL &&
+ state->start_state != xqs;
}
...otherwise with parens we sometimes don't get the right prompt and
we get empty lines echoed. Adding xuend and xuchar here didn't seem to
make a difference. There might be something subtle I'm missing, so I
thought I'd mention it.

I think you would see a difference if the regression tests had any cases
with blank lines between a Unicode string/ident and the associated
UESCAPE and escape-character literal.

While poking at that, I also came across this unhappiness:

regression=# select u&'foo' uescape 'bogus';
regression'#

that is, psql thinks we're still in a literal at this point. That's
because the uesccharfail rule eats "'b" and then we go to INITIAL
state, so that consuming the last "'" puts us back in a string state.
The backend would have thrown an error before parsing as far as the
incomplete literal, so it doesn't care (or probably not, anyway),
but that's not an option for psql.

My first reaction as to how to fix this was to rip the xuend and
xuchar states out of psql, and let it just lex UESCAPE as an
identifier and the escape-character literal like any other literal.
psql doesn't need to account for the escape character's effect on
the meaning of the Unicode literal, so it doesn't have any need to
lex the sequence as one big token. I think the same is true of ecpg
though I've not looked really closely.

However, my second reaction was that maybe you were on to something
upthread when you speculated about postponing de-escaping of
Unicode literals into the grammar. If we did it like that then
we would not need to have this difference between the backend and
frontend lexers, and we'd not have to worry about what
psql_scan_in_quote should do about the whitespace before and after
UESCAPE, either.

So I'm feeling like maybe we should experiment to see what that
solution looks like, before we commit to going in this direction.
What do you think?

With the unicode escape rules brought over, the diff to the ecpg
scanner is much cleaner now. The diff for C-comment rules were still
pretty messy in comparison, so I made an attempt to clean that up in
0002. A bit off-topic, but I thought I should offer that while it was
fresh in my head.

I didn't really review this, but it looked like a fairly plausible
change of the same ilk, ie combine rules by adding memory of the
previous start state.

regards, tom lane

#14

John Naylor

john.naylor@2ndquadrant.com

over 6 years ago

In reply to: Tom Lane (#13)

1 attachment(s)

Re: benchmarking Flex practices

On Sun, Jul 21, 2019 at 3:14 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

John Naylor <john.naylor@2ndquadrant.com> writes:

The pre-existing ecpg var "state_before" was a bit confusing when
combined with the new var "state_before_quote_stop", and the former is
also used with C-comments, so I decided to go with
"state_before_lit_start" and "state_before_lit_stop". Even though
comments aren't literals, it's less of a stretch than referring to
quotes. To keep things consistent, I went with the latter var in psql
and core.

Hm, what do you think of "state_before_str_stop" instead? It seems
to me that both "quote" and "lit" are pretty specific terms, so
maybe we need something a bit vaguer.

Sounds fine to me.

While poking at that, I also came across this unhappiness:

regression=# select u&'foo' uescape 'bogus';
regression'#

that is, psql thinks we're still in a literal at this point. That's
because the uesccharfail rule eats "'b" and then we go to INITIAL
state, so that consuming the last "'" puts us back in a string state.
The backend would have thrown an error before parsing as far as the
incomplete literal, so it doesn't care (or probably not, anyway),
but that's not an option for psql.

My first reaction as to how to fix this was to rip the xuend and
xuchar states out of psql, and let it just lex UESCAPE as an
identifier and the escape-character literal like any other literal.
psql doesn't need to account for the escape character's effect on
the meaning of the Unicode literal, so it doesn't have any need to
lex the sequence as one big token. I think the same is true of ecpg
though I've not looked really closely.

However, my second reaction was that maybe you were on to something
upthread when you speculated about postponing de-escaping of
Unicode literals into the grammar. If we did it like that then
we would not need to have this difference between the backend and
frontend lexers, and we'd not have to worry about what
psql_scan_in_quote should do about the whitespace before and after
UESCAPE, either.

So I'm feeling like maybe we should experiment to see what that
solution looks like, before we commit to going in this direction.
What do you think?

Given the above wrinkles, I thought it was worth trying. Attached is a
rough patch (don't mind the #include mess yet :-) ) that works like
this:

The lexer returns UCONST from xus and UIDENT from xui. The grammar has
rules that are effectively:

SCONST { do nothing}
| UCONST { esc char is backslash }
| UCONST UESCAPE SCONST { esc char is from $3 }

...where UESCAPE is now an unreserved keyword. To prevent shift-reduce
conflicts, I added UIDENT to the %nonassoc precedence list to match
IDENT, and for UESCAPE I added a %left precedence declaration. Maybe
there's a more principled way. I also added an unsigned char type to
the %union, but it worked fine on my compiler without it.

litbuf_udeescape() and check_uescapechar() were moved to gram.y. The
former had be massaged to give error messages similar to HEAD. They're
not quite identical, but the position info is preserved. Some of the
functions I moved around don't seem to have any test coverage, so I
should eventually do some work in that regard.

Notes:

-Binary size is very close to v6. That is to say the grammar tables
grew by about the same amount the scanner table shrank, so the binary
is still about 200kB smaller than HEAD.
-Performance is very close to v6 with the information_schema and
pgbench-like queries with standard strings, which is to say also very
close to HEAD. When the latter was changed to use Unicode escapes,
however, it was about 15% slower than HEAD. That's a big regression
and I haven't tried to pinpoint why.
-psql was changed to follow suit. It doesn't think it's inside a
string with your too-long escape char above, and it removes all blank
lines from this query output:

$ cat >> test-uesc-lit.sql
SELECT

u&'!0041'

uescape

'!'

as col
;

On HEAD and v6 I get this:

$ ./inst/bin/psql -a -f test-uesc-lit.sql

SELECT
u&'!0041'

uescape
'!'
as col
;
col
-----
A
(1 row)

-The ecpg changes here are only the bare minimum from HEAD to get it
to compile, since I'm borrowing its additional token names (although
they mean slightly different things). After a bit of experimentation,
it's clear there's a bit more work needed to get it functional, and
it's not easy to debug, so I'm putting that off until we decide
whether this is the way forward.

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

v7-draft-handle-uescapes-in-parser.patchapplication/octet-stream; name=v7-draft-handle-uescapes-in-parser.patchDownload

diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index c97bb367f8..adc96a31ae 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -55,11 +55,14 @@
 #include "catalog/pg_trigger.h"
 #include "commands/defrem.h"
 #include "commands/trigger.h"
+#include "common/string.h"
+#include "mb/pg_wchar.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
 #include "parser/gramparse.h"
 #include "parser/parser.h"
 #include "parser/parse_expr.h"
+#include "parser/scansup.h"
 #include "storage/lmgr.h"
 #include "utils/date.h"
 #include "utils/datetime.h"
@@ -188,6 +191,8 @@ static void processCASbits(int cas_bits, int location, const char *constrType,
 			   bool *deferrable, bool *initdeferred, bool *not_valid,
 			   bool *no_inherit, core_yyscan_t yyscanner);
 static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
+static char * str_udeescape(unsigned char escape, char *str, int position, core_yyscan_t yyscanner);
+static bool check_uescapechar(unsigned char escape);
 
 %}
 
@@ -208,6 +213,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	const char			*keyword;
 
 	char				chr;
+	unsigned char		uchr;
 	bool				boolean;
 	JoinType			jtype;
 	DropBehavior		dbehavior;
@@ -528,6 +534,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 
 %type <ival>	Iconst SignedIconst
 %type <str>		Sconst comment_text notify_payload
+%type <str>		Ident
+%type <uchr>	Uescape
 %type <str>		RoleId opt_boolean_or_string
 %type <list>	var_list
 %type <str>		ColId ColLabel var_name type_function_name param_name
@@ -599,7 +607,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
  * DOT_DOT is unused in the core SQL grammar, and so will always provoke
  * parse errors.  It is needed by PL/pgSQL.
  */
-%token <str>	IDENT FCONST SCONST BCONST XCONST Op
+%token <str>	IDENT UIDENT FCONST SCONST UCONST BCONST XCONST Op
 %token <ival>	ICONST PARAM
 %token			TYPECAST DOT_DOT COLON_EQUALS EQUALS_GREATER
 %token			LESS_EQUALS GREATER_EQUALS NOT_EQUALS
@@ -689,7 +697,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	TREAT TRIGGER TRIM TRUE_P
 	TRUNCATE TRUSTED TYPE_P TYPES_P
 
-	UNBOUNDED UNCOMMITTED UNENCRYPTED UNION UNIQUE UNKNOWN UNLISTEN UNLOGGED
+	UESCAPE UNBOUNDED UNCOMMITTED UNENCRYPTED UNION UNIQUE UNKNOWN UNLISTEN UNLOGGED
 	UNTIL UPDATE USER USING
 
 	VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
@@ -757,7 +765,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
  * blame any funny behavior of UNBOUNDED on the SQL standard, though.
  */
 %nonassoc	UNBOUNDED		/* ideally should have same precedence as IDENT */
-%nonassoc	IDENT GENERATED NULL_P PARTITION RANGE ROWS GROUPS PRECEDING FOLLOWING CUBE ROLLUP
+%nonassoc	IDENT UIDENT GENERATED NULL_P PARTITION RANGE ROWS GROUPS PRECEDING FOLLOWING CUBE ROLLUP
+%left		UESCAPE
 %left		Op OPERATOR		/* multi-character ops and user-defined operators */
 %left		'+' '-'
 %left		'*' '/' '%'
@@ -1048,7 +1057,7 @@ AlterOptRoleElem:
 				{
 					$$ = makeDefElem("rolemembers", (Node *)$2, @1);
 				}
-			| IDENT
+			| Ident
 				{
 					/*
 					 * We handle identifiers that aren't parser keywords with
@@ -1602,14 +1611,14 @@ opt_boolean_or_string:
  * - an integer or floating point number
  * - a time interval per SQL99
  * ColId gives reduce/reduce errors against ConstInterval and LOCAL,
- * so use IDENT (meaning we reject anything that is a key word).
+ * so use Ident (meaning we reject anything that is a key word).
  */
 zone_value:
 			Sconst
 				{
 					$$ = makeStringConst($1, @1);
 				}
-			| IDENT
+			| Ident
 				{
 					$$ = makeStringConst($1, @1);
 				}
@@ -3871,7 +3880,7 @@ PartitionSpec: PARTITION BY part_strategy '(' part_params ')'
 				}
 		;
 
-part_strategy:	IDENT					{ $$ = $1; }
+part_strategy:	Ident					{ $$ = $1; }
 				| unreserved_keyword	{ $$ = pstrdup($1); }
 		;
 
@@ -5262,7 +5271,7 @@ RowSecurityOptionalToRole:
 		;
 
 RowSecurityDefaultPermissive:
-			AS IDENT
+			AS Ident
 				{
 					if (strcmp($2, "permissive") == 0)
 						$$ = true;
@@ -5831,11 +5840,11 @@ old_aggr_list: old_aggr_elem						{ $$ = list_make1($1); }
 		;
 
 /*
- * Must use IDENT here to avoid reduce/reduce conflicts; fortunately none of
+ * Must use Ident here to avoid reduce/reduce conflicts; fortunately none of
  * the item names needed in old aggregate definitions are likely to become
  * SQL keywords.
  */
-old_aggr_elem:  IDENT '=' def_arg
+old_aggr_elem:  Ident '=' def_arg
 				{
 					$$ = makeDefElem($1, (Node *)$3, @1);
 				}
@@ -10113,7 +10122,7 @@ createdb_opt_item:
 /*
  * Ideally we'd use ColId here, but that causes shift/reduce conflicts against
  * the ALTER DATABASE SET/RESET syntaxes.  Instead call out specific keywords
- * we need, and allow IDENT so that database option names don't have to be
+ * we need, and allow Ident so that database option names don't have to be
  * parser keywords unless they are already keywords for other reasons.
  *
  * XXX this coding technique is fragile since if someone makes a formerly
@@ -10122,7 +10131,7 @@ createdb_opt_item:
  * exercising every such option, at least at the syntax level.
  */
 createdb_opt_name:
-			IDENT							{ $$ = $1; }
+			Ident							{ $$ = $1; }
 			| CONNECTION LIMIT				{ $$ = pstrdup("connection_limit"); }
 			| ENCODING						{ $$ = pstrdup($1); }
 			| LOCATION						{ $$ = pstrdup($1); }
@@ -12424,7 +12433,7 @@ xmltable_column_option_list:
 		;
 
 xmltable_column_option_el:
-			IDENT b_expr
+			Ident b_expr
 				{ $$ = makeDefElem($1, $2, @1); }
 			| DEFAULT b_expr
 				{ $$ = makeDefElem("default", $2, @1); }
@@ -14412,7 +14421,7 @@ extract_list:
  * - thomas 2001-04-12
  */
 extract_arg:
-			IDENT									{ $$ = $1; }
+			Ident									{ $$ = $1; }
 			| YEAR_P								{ $$ = "year"; }
 			| MONTH_P								{ $$ = "month"; }
 			| DAY_P									{ $$ = "day"; }
@@ -14655,7 +14664,7 @@ target_el:	a_expr AS ColLabel
 			 * as an infix expression, which we accomplish by assigning
 			 * IDENT a precedence higher than POSTFIXOP.
 			 */
-			| a_expr IDENT
+			| a_expr Ident
 				{
 					$$ = makeNode(ResTarget);
 					$$->name = $2;
@@ -14874,13 +14883,69 @@ AexprConst: Iconst
 		;
 
 Iconst:		ICONST									{ $$ = $1; };
-Sconst:		SCONST									{ $$ = $1; };
+Sconst:		SCONST
+				{
+					$$ = $1;
+				}
+			| UCONST
+				{
+					$$ = str_udeescape('\\', $1, @1, yyscanner);
+				}
+			| UCONST Uescape
+				{
+					$$ = str_udeescape($2, $1, @1, yyscanner);
+				}
+		;
 
 SignedIconst: Iconst								{ $$ = $1; }
 			| '+' Iconst							{ $$ = + $2; }
 			| '-' Iconst							{ $$ = - $2; }
 		;
 
+Ident:		IDENT
+				{
+					$$ = $1;
+				}
+			| UIDENT
+				{
+					char 	   *ident;
+					int			identlen;
+
+					ident = str_udeescape('\\', $1, @1, yyscanner);
+					identlen = strlen(ident);
+					if (identlen >= NAMEDATALEN)
+						truncate_identifier(ident, identlen, true);
+					$$ = ident;
+				}
+			| UIDENT Uescape
+				{
+					char 	   *ident;
+					int			identlen;
+
+					ident = str_udeescape($2, $1, @1, yyscanner);
+					identlen = strlen(ident);
+					if (identlen >= NAMEDATALEN)
+						truncate_identifier(ident, identlen, true);
+					$$ = ident;
+				}
+		;
+
+Uescape:	UESCAPE SCONST
+				{
+					int esc_length = strlen($2);
+					unsigned char escape = $2[0];
+
+					if (esc_length != 1 ||
+						!check_uescapechar(escape))
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("invalid Unicode escape character \"%s\"", $2),
+								 parser_errposition(@2 + 1)));
+
+					$$ = escape;
+				}
+		;
+
 /* Role specifications */
 RoleId:		RoleSpec
 				{
@@ -14971,21 +15036,21 @@ role_list:	RoleSpec
 
 /* Column identifier --- names that can be column, table, etc names.
  */
-ColId:		IDENT									{ $$ = $1; }
+ColId:		Ident									{ $$ = $1; }
 			| unreserved_keyword					{ $$ = pstrdup($1); }
 			| col_name_keyword						{ $$ = pstrdup($1); }
 		;
 
 /* Type/function identifier --- names that can be type or function names.
  */
-type_function_name:	IDENT							{ $$ = $1; }
+type_function_name:	Ident							{ $$ = $1; }
 			| unreserved_keyword					{ $$ = pstrdup($1); }
 			| type_func_name_keyword				{ $$ = pstrdup($1); }
 		;
 
 /* Any not-fully-reserved word --- these names can be, eg, role names.
  */
-NonReservedWord:	IDENT							{ $$ = $1; }
+NonReservedWord:	Ident							{ $$ = $1; }
 			| unreserved_keyword					{ $$ = pstrdup($1); }
 			| col_name_keyword						{ $$ = pstrdup($1); }
 			| type_func_name_keyword				{ $$ = pstrdup($1); }
@@ -14994,7 +15059,7 @@ NonReservedWord:	IDENT							{ $$ = $1; }
 /* Column label --- allowed labels in "AS" clauses.
  * This presently includes *all* Postgres keywords.
  */
-ColLabel:	IDENT									{ $$ = $1; }
+ColLabel:	Ident									{ $$ = $1; }
 			| unreserved_keyword					{ $$ = pstrdup($1); }
 			| col_name_keyword						{ $$ = pstrdup($1); }
 			| type_func_name_keyword				{ $$ = pstrdup($1); }
@@ -15282,6 +15347,7 @@ unreserved_keyword:
 			| TRUSTED
 			| TYPE_P
 			| TYPES_P
+			| UESCAPE
 			| UNBOUNDED
 			| UNCOMMITTED
 			| UNENCRYPTED
@@ -16351,3 +16417,161 @@ parser_init(base_yy_extra_type *yyext)
 {
 	yyext->parsetree = NIL;		/* in case grammar forgets to set it */
 }
+
+/* handle unicode escapes */
+static char *
+str_udeescape(unsigned char escape, char *str, int position,
+				core_yyscan_t yyscanner)
+{
+	char	   *new,
+			   *in,
+			   *out;
+	int			str_length;
+	pg_wchar	pair_first = 0;
+
+	str_length = strlen(str);
+
+	/*
+	 * This relies on the subtle assumption that a UTF-8 expansion cannot be
+	 * longer than its escaped representation.
+	 */
+	new = palloc(str_length + 1);
+
+	in = str;
+	out = new;
+	while (*in)
+	{
+		if (in[0] == escape)
+		{
+			if (in[1] == escape)
+			{
+				if (pair_first)
+					goto invalid_pair;
+				*out++ = escape;
+				in += 2;
+			}
+			else if (isxdigit((unsigned char) in[1]) &&
+					 isxdigit((unsigned char) in[2]) &&
+					 isxdigit((unsigned char) in[3]) &&
+					 isxdigit((unsigned char) in[4]))
+			{
+				pg_wchar	unicode;
+
+				unicode = (hexval(in[1]) << 12) +
+					(hexval(in[2]) << 8) +
+					(hexval(in[3]) << 4) +
+					hexval(in[4]);
+				check_unicode_value(unicode, in, yyscanner);
+				if (pair_first)
+				{
+					if (is_utf16_surrogate_second(unicode))
+					{
+						unicode = surrogate_pair_to_codepoint(pair_first, unicode);
+						pair_first = 0;
+					}
+					else
+						goto invalid_pair;
+				}
+				else if (is_utf16_surrogate_second(unicode))
+					goto invalid_pair;
+
+				if (is_utf16_surrogate_first(unicode))
+					pair_first = unicode;
+				else
+				{
+					unicode_to_utf8(unicode, (unsigned char *) out);
+					out += pg_mblen(out);
+				}
+				in += 5;
+			}
+			else if (in[1] == '+' &&
+					 isxdigit((unsigned char) in[2]) &&
+					 isxdigit((unsigned char) in[3]) &&
+					 isxdigit((unsigned char) in[4]) &&
+					 isxdigit((unsigned char) in[5]) &&
+					 isxdigit((unsigned char) in[6]) &&
+					 isxdigit((unsigned char) in[7]))
+			{
+				pg_wchar	unicode;
+
+				unicode = (hexval(in[2]) << 20) +
+					(hexval(in[3]) << 16) +
+					(hexval(in[4]) << 12) +
+					(hexval(in[5]) << 8) +
+					(hexval(in[6]) << 4) +
+					hexval(in[7]);
+				check_unicode_value(unicode, in, yyscanner);
+				if (pair_first)
+				{
+					if (is_utf16_surrogate_second(unicode))
+					{
+						unicode = surrogate_pair_to_codepoint(pair_first, unicode);
+						pair_first = 0;
+					}
+					else
+						goto invalid_pair;
+				}
+				else if (is_utf16_surrogate_second(unicode))
+					goto invalid_pair;
+
+				if (is_utf16_surrogate_first(unicode))
+					pair_first = unicode;
+				else
+				{
+					unicode_to_utf8(unicode, (unsigned char *) out);
+					out += pg_mblen(out);
+				}
+				in += 8;
+			}
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("invalid Unicode escape value"),
+						 parser_errposition(position + in - str + 3))); /* 3 for U&" */
+		}
+		else
+		{
+			if (pair_first)
+				goto invalid_pair;
+
+			*out++ = *in++;
+		}
+	}
+
+	/* unfinished surrogate pair? */
+	if (pair_first)
+		goto invalid_pair;
+
+	*out = '\0';
+
+	/*
+	 * We could skip pg_verifymbstr if we didn't process any non-7-bit-ASCII
+	 * codes; but it's probably not worth the trouble, since this isn't likely
+	 * to be a performance-critical path.
+	 */
+	pg_verifymbstr(new, out - new, false);
+	return new;
+
+invalid_pair:
+	ereport(ERROR,
+			(errcode(ERRCODE_SYNTAX_ERROR),
+			 errmsg("invalid Unicode surrogate pair"),
+			 parser_errposition(position + in - str + 3))); /* 3 for U&" */
+
+}
+
+/* is 'escape' acceptable as Unicode escape character (UESCAPE syntax) ? */
+static bool
+check_uescapechar(unsigned char escape)
+{
+	if (isxdigit(escape)
+		|| escape == '+'
+		|| escape == '\''
+		|| escape == '"'
+		|| scanner_isspace(escape))
+	{
+		return false;
+	}
+	else
+		return true;
+}
diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l
index e1cae859e8..5d6996739f 100644
--- a/src/backend/parser/scan.l
+++ b/src/backend/parser/scan.l
@@ -110,14 +110,9 @@ const uint16 ScanKeywordTokens[] = {
 static void addlit(char *ytext, int yleng, core_yyscan_t yyscanner);
 static void addlitchar(unsigned char ychar, core_yyscan_t yyscanner);
 static char *litbufdup(core_yyscan_t yyscanner);
-static char *litbuf_udeescape(unsigned char escape, core_yyscan_t yyscanner);
 static unsigned char unescape_single_char(unsigned char c, core_yyscan_t yyscanner);
 static int	process_integer_literal(const char *token, YYSTYPE *lval);
-static bool is_utf16_surrogate_first(pg_wchar c);
-static bool is_utf16_surrogate_second(pg_wchar c);
-static pg_wchar surrogate_pair_to_codepoint(pg_wchar first, pg_wchar second);
 static void addunicode(pg_wchar c, yyscan_t yyscanner);
-static bool check_uescapechar(unsigned char escape);
 
 #define yyerror(msg)  scanner_yyerror(msg, yyscanner)
 
@@ -168,12 +163,11 @@ extern void core_yyset_column(int column_no, yyscan_t yyscanner);
  *  <xd> delimited identifiers (double-quoted identifiers)
  *  <xh> hexadecimal numeric string
  *  <xq> standard quoted strings
+ *  <xqs> quote stop (detect continued strings)
  *  <xe> extended quoted strings (support backslash escape sequences)
  *  <xdolq> $foo$ quoted strings
  *  <xui> quoted identifier with Unicode escapes
- *  <xuiend> end of a quoted identifier with Unicode escapes, UESCAPE can follow
  *  <xus> quoted string with Unicode escapes
- *  <xusend> end of a quoted string with Unicode escapes, UESCAPE can follow
  *  <xeu> Unicode surrogate pair in extended quoted string
  *
  * Remember to add an <<EOF>> case whenever you add a new exclusive state!
@@ -185,12 +179,11 @@ extern void core_yyset_column(int column_no, yyscan_t yyscanner);
 %x xd
 %x xh
 %x xq
+%x xqs
 %x xe
 %x xdolq
 %x xui
-%x xuiend
 %x xus
-%x xusend
 %x xeu
 
 /*
@@ -231,19 +224,18 @@ special_whitespace		({space}+|{comment}{newline})
 horiz_whitespace		({horiz_space}|{comment})
 whitespace_with_newline	({horiz_whitespace}*{newline}{special_whitespace}*)
 
+quote			'
+/* If we see {quote} then {quotecontinue}, the quoted string continues */
+quotecontinue	{whitespace_with_newline}{quote}
+
 /*
- * To ensure that {quotecontinue} can be scanned without having to back up
- * if the full pattern isn't matched, we include trailing whitespace in
- * {quotestop}.  This matches all cases where {quotecontinue} fails to match,
- * except for {quote} followed by whitespace and just one "-" (not two,
- * which would start a {comment}).  To cover that we have {quotefail}.
- * The actions for {quotestop} and {quotefail} must throw back characters
- * beyond the quote proper.
+ * {quotecontinuefail} is needed to avoid lexer backup when we fail to match
+ * {quotecontinue}.  It might seem that this could just be {whitespace}*,
+ * but if there's a dash after {whitespace_with_newline}, it must be consumed
+ * to see if there's another dash --- which would start a {comment} and thus
+ * allow continuation of the {quotecontinue} token.
  */
-quote			'
-quotestop		{quote}{whitespace}*
-quotecontinue	{quote}{whitespace_with_newline}{quote}
-quotefail		{quote}{whitespace}*"-"
+quotecontinuefail	{whitespace}*"-"?
 
 /* Bit string
  * It is tempting to scan the string for only those characters
@@ -304,21 +296,12 @@ xdstop			{dquote}
 xddouble		{dquote}{dquote}
 xdinside		[^"]+
 
-/* Unicode escapes */
-uescape			[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']{quote}
-/* error rule to avoid backup */
-uescapefail		[uU][eE][sS][cC][aA][pP][eE]{whitespace}*"-"|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*|[uU][eE][sS][cC][aA][pP]|[uU][eE][sS][cC][aA]|[uU][eE][sS][cC]|[uU][eE][sS]|[uU][eE]|[uU]
-
 /* Quoted identifier with Unicode escapes */
 xuistart		[uU]&{dquote}
 
 /* Quoted string with Unicode escapes */
 xusstart		[uU]&{quote}
 
-/* Optional UESCAPE after a quoted string or identifier with Unicode escapes. */
-xustop1		{uescapefail}?
-xustop2		{uescape}
-
 /* error rule to avoid backup */
 xufailed		[uU]&
 
@@ -476,21 +459,10 @@ other			.
 					startlit();
 					addlitchar('b', yyscanner);
 				}
-<xb>{quotestop}	|
-<xb>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
-					yylval->str = litbufdup(yyscanner);
-					return BCONST;
-				}
 <xh>{xhinside}	|
 <xb>{xbinside}	{
 					addlit(yytext, yyleng, yyscanner);
 				}
-<xh>{quotecontinue}	|
-<xb>{quotecontinue}	{
-					/* ignore */
-				}
 <xb><<EOF>>		{ yyerror("unterminated bit string literal"); }
 
 {xhstart}		{
@@ -505,13 +477,6 @@ other			.
 					startlit();
 					addlitchar('x', yyscanner);
 				}
-<xh>{quotestop}	|
-<xh>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
-					yylval->str = litbufdup(yyscanner);
-					return XCONST;
-				}
 <xh><<EOF>>		{ yyerror("unterminated hexadecimal string literal"); }
 
 {xnstart}		{
@@ -568,53 +533,67 @@ other			.
 					BEGIN(xus);
 					startlit();
 				}
-<xq,xe>{quotestop}	|
-<xq,xe>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
+
+<xb,xh,xq,xe,xus>{quote} {
 					/*
-					 * check that the data remains valid if it might have been
-					 * made invalid by unescaping any chars.
+					 * When we are scanning a quoted string and see an end
+					 * quote, we must look ahead for a possible continuation.
+					 * If we don't see one, we know the end quote was in fact
+					 * the end of the string.  To reduce the lexer table size,
+					 * we use a single "xqs" state to do the lookahead for all
+					 * types of strings.
 					 */
-					if (yyextra->saw_non_ascii)
-						pg_verifymbstr(yyextra->literalbuf,
-									   yyextra->literallen,
-									   false);
-					yylval->str = litbufdup(yyscanner);
-					return SCONST;
-				}
-<xus>{quotestop} |
-<xus>{quotefail} {
-					/* throw back all but the quote */
-					yyless(1);
-					/* xusend state looks for possible UESCAPE */
-					BEGIN(xusend);
+					yyextra->state_before_str_stop = YYSTATE;
+					BEGIN(xqs);
 				}
-<xusend>{whitespace} {
-					/* stay in xusend state over whitespace */
+<xqs>{quotecontinue} {
+					/*
+					 * Found a quote continuation, so return to the in-quote
+					 * state and continue scanning the literal.
+					 */
+					BEGIN(yyextra->state_before_str_stop);
 				}
-<xusend><<EOF>> |
-<xusend>{other} |
-<xusend>{xustop1} {
-					/* no UESCAPE after the quote, throw back everything */
+<xqs>{quotecontinuefail} |
+<xqs><<EOF>> |
+<xqs>{other}	{
+					/*
+					 * Failed to see a quote continuation.  Throw back
+					 * everything after the end quote, and handle the string
+					 * according to the state we were in previously.
+					 */
 					yyless(0);
 					BEGIN(INITIAL);
-					yylval->str = litbuf_udeescape('\\', yyscanner);
-					return SCONST;
-				}
-<xusend>{xustop2} {
-					/* found UESCAPE after the end quote */
-					BEGIN(INITIAL);
-					if (!check_uescapechar(yytext[yyleng - 2]))
+
+					switch (yyextra->state_before_str_stop)
 					{
-						SET_YYLLOC();
-						ADVANCE_YYLLOC(yyleng - 2);
-						yyerror("invalid Unicode escape character");
+						case xb:
+							yylval->str = litbufdup(yyscanner);
+							return BCONST;
+						case xh:
+							yylval->str = litbufdup(yyscanner);
+							return XCONST;
+						case xq:
+							/* fallthrough */
+						case xe:
+							/*
+							 * Check that the data remains valid if it
+							 * might have been made invalid by unescaping
+							 * any chars.
+							 */
+							if (yyextra->saw_non_ascii)
+								pg_verifymbstr(yyextra->literalbuf,
+											   yyextra->literallen,
+											   false);
+							yylval->str = litbufdup(yyscanner);
+							return SCONST;
+						case xus:
+							yylval->str = litbufdup(yyscanner);
+							return UCONST;
+						default:
+							yyerror("unhandled previous state in xqs");
 					}
-					yylval->str = litbuf_udeescape(yytext[yyleng - 2],
-												   yyscanner);
-					return SCONST;
 				}
+
 <xq,xe,xus>{xqdouble} {
 					addlitchar('\'', yyscanner);
 				}
@@ -693,9 +672,6 @@ other			.
 					if (c == '\0' || IS_HIGHBIT_SET(c))
 						yyextra->saw_non_ascii = true;
 				}
-<xq,xe,xus>{quotecontinue} {
-					/* ignore */
-				}
 <xe>.			{
 					/* This is only needed for \ just before EOF */
 					addlitchar(yytext[0], yyscanner);
@@ -770,53 +746,14 @@ other			.
 					return IDENT;
 				}
 <xui>{dquote} {
-					yyless(1);
-					/* xuiend state looks for possible UESCAPE */
-					BEGIN(xuiend);
-				}
-<xuiend>{whitespace} {
-					/* stay in xuiend state over whitespace */
-				}
-<xuiend><<EOF>> |
-<xuiend>{other} |
-<xuiend>{xustop1} {
-					/* no UESCAPE after the quote, throw back everything */
-					char	   *ident;
-					int			identlen;
-
-					yyless(0);
-
-					BEGIN(INITIAL);
 					if (yyextra->literallen == 0)
 						yyerror("zero-length delimited identifier");
-					ident = litbuf_udeescape('\\', yyscanner);
-					identlen = strlen(ident);
-					if (identlen >= NAMEDATALEN)
-						truncate_identifier(ident, identlen, true);
-					yylval->str = ident;
-					return IDENT;
-				}
-<xuiend>{xustop2}	{
-					/* found UESCAPE after the end quote */
-					char	   *ident;
-					int			identlen;
 
 					BEGIN(INITIAL);
-					if (yyextra->literallen == 0)
-						yyerror("zero-length delimited identifier");
-					if (!check_uescapechar(yytext[yyleng - 2]))
-					{
-						SET_YYLLOC();
-						ADVANCE_YYLLOC(yyleng - 2);
-						yyerror("invalid Unicode escape character");
-					}
-					ident = litbuf_udeescape(yytext[yyleng - 2], yyscanner);
-					identlen = strlen(ident);
-					if (identlen >= NAMEDATALEN)
-						truncate_identifier(ident, identlen, true);
-					yylval->str = ident;
-					return IDENT;
+					yylval->str = litbufdup(yyscanner);
+					return UIDENT;
 				}
+
 <xd,xui>{xddouble}	{
 					addlitchar('"', yyscanner);
 				}
@@ -1288,7 +1225,7 @@ process_integer_literal(const char *token, YYSTYPE *lval)
 	return ICONST;
 }
 
-static unsigned int
+extern unsigned int
 hexval(unsigned char c)
 {
 	if (c >= '0' && c <= '9')
@@ -1301,7 +1238,7 @@ hexval(unsigned char c)
 	return 0;					/* not reached */
 }
 
-static void
+extern void
 check_unicode_value(pg_wchar c, char *loc, core_yyscan_t yyscanner)
 {
 	if (GetDatabaseEncoding() == PG_UTF8)
@@ -1314,19 +1251,19 @@ check_unicode_value(pg_wchar c, char *loc, core_yyscan_t yyscanner)
 	}
 }
 
-static bool
+extern bool
 is_utf16_surrogate_first(pg_wchar c)
 {
 	return (c >= 0xD800 && c <= 0xDBFF);
 }
 
-static bool
+extern bool
 is_utf16_surrogate_second(pg_wchar c)
 {
 	return (c >= 0xDC00 && c <= 0xDFFF);
 }
 
-static pg_wchar
+extern pg_wchar
 surrogate_pair_to_codepoint(pg_wchar first, pg_wchar second)
 {
 	return ((first & 0x3FF) << 10) + 0x10000 + (second & 0x3FF);
@@ -1349,172 +1286,6 @@ addunicode(pg_wchar c, core_yyscan_t yyscanner)
 	addlit(buf, pg_mblen(buf), yyscanner);
 }
 
-/* is 'escape' acceptable as Unicode escape character (UESCAPE syntax) ? */
-static bool
-check_uescapechar(unsigned char escape)
-{
-	if (isxdigit(escape)
-		|| escape == '+'
-		|| escape == '\''
-		|| escape == '"'
-		|| scanner_isspace(escape))
-	{
-		return false;
-	}
-	else
-		return true;
-}
-
-/* like litbufdup, but handle unicode escapes */
-static char *
-litbuf_udeescape(unsigned char escape, core_yyscan_t yyscanner)
-{
-	char	   *new;
-	char	   *litbuf,
-			   *in,
-			   *out;
-	pg_wchar	pair_first = 0;
-
-	/* Make literalbuf null-terminated to simplify the scanning loop */
-	litbuf = yyextra->literalbuf;
-	litbuf[yyextra->literallen] = '\0';
-
-	/*
-	 * This relies on the subtle assumption that a UTF-8 expansion cannot be
-	 * longer than its escaped representation.
-	 */
-	new = palloc(yyextra->literallen + 1);
-
-	in = litbuf;
-	out = new;
-	while (*in)
-	{
-		if (in[0] == escape)
-		{
-			if (in[1] == escape)
-			{
-				if (pair_first)
-				{
-					ADVANCE_YYLLOC(in - litbuf + 3);	/* 3 for U&" */
-					yyerror("invalid Unicode surrogate pair");
-				}
-				*out++ = escape;
-				in += 2;
-			}
-			else if (isxdigit((unsigned char) in[1]) &&
-					 isxdigit((unsigned char) in[2]) &&
-					 isxdigit((unsigned char) in[3]) &&
-					 isxdigit((unsigned char) in[4]))
-			{
-				pg_wchar	unicode;
-
-				unicode = (hexval(in[1]) << 12) +
-					(hexval(in[2]) << 8) +
-					(hexval(in[3]) << 4) +
-					hexval(in[4]);
-				check_unicode_value(unicode, in, yyscanner);
-				if (pair_first)
-				{
-					if (is_utf16_surrogate_second(unicode))
-					{
-						unicode = surrogate_pair_to_codepoint(pair_first, unicode);
-						pair_first = 0;
-					}
-					else
-					{
-						ADVANCE_YYLLOC(in - litbuf + 3);		/* 3 for U&" */
-						yyerror("invalid Unicode surrogate pair");
-					}
-				}
-				else if (is_utf16_surrogate_second(unicode))
-					yyerror("invalid Unicode surrogate pair");
-
-				if (is_utf16_surrogate_first(unicode))
-					pair_first = unicode;
-				else
-				{
-					unicode_to_utf8(unicode, (unsigned char *) out);
-					out += pg_mblen(out);
-				}
-				in += 5;
-			}
-			else if (in[1] == '+' &&
-					 isxdigit((unsigned char) in[2]) &&
-					 isxdigit((unsigned char) in[3]) &&
-					 isxdigit((unsigned char) in[4]) &&
-					 isxdigit((unsigned char) in[5]) &&
-					 isxdigit((unsigned char) in[6]) &&
-					 isxdigit((unsigned char) in[7]))
-			{
-				pg_wchar	unicode;
-
-				unicode = (hexval(in[2]) << 20) +
-					(hexval(in[3]) << 16) +
-					(hexval(in[4]) << 12) +
-					(hexval(in[5]) << 8) +
-					(hexval(in[6]) << 4) +
-					hexval(in[7]);
-				check_unicode_value(unicode, in, yyscanner);
-				if (pair_first)
-				{
-					if (is_utf16_surrogate_second(unicode))
-					{
-						unicode = surrogate_pair_to_codepoint(pair_first, unicode);
-						pair_first = 0;
-					}
-					else
-					{
-						ADVANCE_YYLLOC(in - litbuf + 3);		/* 3 for U&" */
-						yyerror("invalid Unicode surrogate pair");
-					}
-				}
-				else if (is_utf16_surrogate_second(unicode))
-					yyerror("invalid Unicode surrogate pair");
-
-				if (is_utf16_surrogate_first(unicode))
-					pair_first = unicode;
-				else
-				{
-					unicode_to_utf8(unicode, (unsigned char *) out);
-					out += pg_mblen(out);
-				}
-				in += 8;
-			}
-			else
-			{
-				ADVANCE_YYLLOC(in - litbuf + 3);		/* 3 for U&" */
-				yyerror("invalid Unicode escape value");
-			}
-		}
-		else
-		{
-			if (pair_first)
-			{
-				ADVANCE_YYLLOC(in - litbuf + 3);		/* 3 for U&" */
-				yyerror("invalid Unicode surrogate pair");
-			}
-			*out++ = *in++;
-		}
-	}
-
-	/* unfinished surrogate pair? */
-	if (pair_first)
-	{
-		ADVANCE_YYLLOC(in - litbuf + 3);				/* 3 for U&" */
-		yyerror("invalid Unicode surrogate pair");
-	}
-
-	*out = '\0';
-
-	/*
-	 * We could skip pg_verifymbstr if we didn't process any non-7-bit-ASCII
-	 * codes; but it's probably not worth the trouble, since this isn't likely
-	 * to be a performance-critical path.
-	 */
-	pg_verifymbstr(new, out - new, false);
-	return new;
-}
-
 static unsigned char
 unescape_single_char(unsigned char c, core_yyscan_t yyscanner)
 {
diff --git a/src/fe_utils/psqlscan.l b/src/fe_utils/psqlscan.l
index ce20936339..eba7490078 100644
--- a/src/fe_utils/psqlscan.l
+++ b/src/fe_utils/psqlscan.l
@@ -114,12 +114,11 @@ extern void psql_yyset_column(int column_no, yyscan_t yyscanner);
  *  <xd> delimited identifiers (double-quoted identifiers)
  *  <xh> hexadecimal numeric string
  *  <xq> standard quoted strings
+ *  <xqs> quote stop (detect continued strings)
  *  <xe> extended quoted strings (support backslash escape sequences)
  *  <xdolq> $foo$ quoted strings
  *  <xui> quoted identifier with Unicode escapes
- *  <xuiend> end of a quoted identifier with Unicode escapes, UESCAPE can follow
  *  <xus> quoted string with Unicode escapes
- *  <xusend> end of a quoted string with Unicode escapes, UESCAPE can follow
  *
  * Note: we intentionally don't mimic the backend's <xeu> state; we have
  * no need to distinguish it from <xe> state, and no good way to get out
@@ -132,12 +131,11 @@ extern void psql_yyset_column(int column_no, yyscan_t yyscanner);
 %x xd
 %x xh
 %x xq
+%x xqs
 %x xe
 %x xdolq
 %x xui
-%x xuiend
 %x xus
-%x xusend
 
 /*
  * In order to make the world safe for Windows and Mac clients as well as
@@ -177,19 +175,18 @@ special_whitespace		({space}+|{comment}{newline})
 horiz_whitespace		({horiz_space}|{comment})
 whitespace_with_newline	({horiz_whitespace}*{newline}{special_whitespace}*)
 
+quote			'
+/* If we see {quote} then {quotecontinue}, the quoted string continues */
+quotecontinue	{whitespace_with_newline}{quote}
+
 /*
- * To ensure that {quotecontinue} can be scanned without having to back up
- * if the full pattern isn't matched, we include trailing whitespace in
- * {quotestop}.  This matches all cases where {quotecontinue} fails to match,
- * except for {quote} followed by whitespace and just one "-" (not two,
- * which would start a {comment}).  To cover that we have {quotefail}.
- * The actions for {quotestop} and {quotefail} must throw back characters
- * beyond the quote proper.
+ * {quotecontinuefail} is needed to avoid lexer backup when we fail to match
+ * {quotecontinue}.  It might seem that this could just be {whitespace}*,
+ * but if there's a dash after {whitespace_with_newline}, it must be consumed
+ * to see if there's another dash --- which would start a {comment} and thus
+ * allow continuation of the {quotecontinue} token.
  */
-quote			'
-quotestop		{quote}{whitespace}*
-quotecontinue	{quote}{whitespace_with_newline}{quote}
-quotefail		{quote}{whitespace}*"-"
+quotecontinuefail	{whitespace}*"-"?
 
 /* Bit string
  * It is tempting to scan the string for only those characters
@@ -250,21 +247,12 @@ xdstop			{dquote}
 xddouble		{dquote}{dquote}
 xdinside		[^"]+
 
-/* Unicode escapes */
-uescape			[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']{quote}
-/* error rule to avoid backup */
-uescapefail		[uU][eE][sS][cC][aA][pP][eE]{whitespace}*"-"|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*|[uU][eE][sS][cC][aA][pP]|[uU][eE][sS][cC][aA]|[uU][eE][sS][cC]|[uU][eE][sS]|[uU][eE]|[uU]
-
 /* Quoted identifier with Unicode escapes */
 xuistart		[uU]&{dquote}
 
 /* Quoted string with Unicode escapes */
 xusstart		[uU]&{quote}
 
-/* Optional UESCAPE after a quoted string or identifier with Unicode escapes. */
-xustop1		{uescapefail}?
-xustop2		{uescape}
-
 /* error rule to avoid backup */
 xufailed		[uU]&
 
@@ -438,20 +426,10 @@ other			.
 					BEGIN(xb);
 					ECHO;
 				}
-<xb>{quotestop}	|
-<xb>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
-					ECHO;
-				}
 <xh>{xhinside}	|
 <xb>{xbinside}	{
 					ECHO;
 				}
-<xh>{quotecontinue}	|
-<xb>{quotecontinue}	{
-					ECHO;
-				}
 
 {xhstart}		{
 					/* Hexadecimal bit type.
@@ -463,12 +441,6 @@ other			.
 					BEGIN(xh);
 					ECHO;
 				}
-<xh>{quotestop}	|
-<xh>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
-					ECHO;
-				}
 
 {xnstart}		{
 					yyless(1);	/* eat only 'n' this time */
@@ -490,32 +462,38 @@ other			.
 					BEGIN(xus);
 					ECHO;
 				}
-<xq,xe>{quotestop}	|
-<xq,xe>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
-					ECHO;
-				}
-<xus>{quotestop} |
-<xus>{quotefail} {
-					/* throw back all but the quote */
-					yyless(1);
-					BEGIN(xusend);
+
+<xb,xh,xq,xe,xus>{quote} {
+					/*
+					 * When we are scanning a quoted string and see an end
+					 * quote, we must look ahead for a possible continuation.
+					 * If we don't see one, we know the end quote was in fact
+					 * the end of the string.  To reduce the lexer table size,
+					 * we use a single "xqs" state to do the lookahead for all
+					 * types of strings.
+					 */
+					cur_state->state_before_str_stop = YYSTATE;
+					BEGIN(xqs);
 					ECHO;
 				}
-<xusend>{whitespace} {
+<xqs>{quotecontinue} {
+					/*
+					 * Found a quote continuation, so return to the in-quote
+					 * state and continue scanning the literal.
+					 */
+					BEGIN(cur_state->state_before_str_stop);
 					ECHO;
 				}
-<xusend>{other} |
-<xusend>{xustop1} {
+<xqs>{quotecontinuefail} |
+<xqs>{other}	{
+					/*
+					 * Failed to see a quote continuation.  Throw back
+					 * everything after the end quote.
+					 */
 					yyless(0);
 					BEGIN(INITIAL);
-					ECHO;
-				}
-<xusend>{xustop2} {
-					BEGIN(INITIAL);
-					ECHO;
 				}
+
 <xq,xe,xus>{xqdouble} {
 					ECHO;
 				}
@@ -540,9 +518,6 @@ other			.
 <xe>{xehexesc}  {
 					ECHO;
 				}
-<xq,xe,xus>{quotecontinue} {
-					ECHO;
-				}
 <xe>.			{
 					/* This is only needed for \ just before EOF */
 					ECHO;
@@ -600,23 +575,10 @@ other			.
 					ECHO;
 				}
 <xui>{dquote} {
-					yyless(1);
-					BEGIN(xuiend);
-					ECHO;
-				}
-<xuiend>{whitespace} {
-					ECHO;
-				}
-<xuiend>{other} |
-<xuiend>{xustop1} {
-					yyless(0);
-					BEGIN(INITIAL);
-					ECHO;
-				}
-<xuiend>{xustop2}	{
 					BEGIN(INITIAL);
 					ECHO;
 				}
+
 <xd,xui>{xddouble}	{
 					ECHO;
 				}
@@ -1084,8 +1046,7 @@ psql_scan(PsqlScanState state,
 			switch (state->start_state)
 			{
 				case INITIAL:
-				case xuiend:	/* we treat these like INITIAL */
-				case xusend:
+				case xqs:		/* we treat this like INITIAL */
 					if (state->paren_depth > 0)
 					{
 						result = PSCAN_INCOMPLETE;
@@ -1240,7 +1201,8 @@ psql_scan_reselect_sql_lexer(PsqlScanState state)
 bool
 psql_scan_in_quote(PsqlScanState state)
 {
-	return state->start_state != INITIAL;
+	return state->start_state != INITIAL &&
+			state->start_state != xqs;
 }
 
 /*
diff --git a/src/include/fe_utils/psqlscan_int.h b/src/include/fe_utils/psqlscan_int.h
index 2acb380078..f53ccbf82e 100644
--- a/src/include/fe_utils/psqlscan_int.h
+++ b/src/include/fe_utils/psqlscan_int.h
@@ -110,6 +110,7 @@ typedef struct PsqlScanStateData
 	 * and updated with its finishing state on exit.
 	 */
 	int			start_state;	/* yylex's starting/finishing state */
+	int			state_before_str_stop;	/* start cond. before end quote */
 	int			paren_depth;	/* depth of nesting in parentheses */
 	int			xcdepth;		/* depth of nesting in slash-star comments */
 	char	   *dolqstart;		/* current $foo$ quote start string */
diff --git a/src/include/parser/gramparse.h b/src/include/parser/gramparse.h
index add64bc170..cf7c966362 100644
--- a/src/include/parser/gramparse.h
+++ b/src/include/parser/gramparse.h
@@ -21,6 +21,7 @@
 
 #include "nodes/parsenodes.h"
 #include "parser/scanner.h"
+#include "mb/pg_wchar.h"
 
 /*
  * NB: include gram.h only AFTER including scanner.h, because scanner.h
@@ -72,4 +73,12 @@ extern int	base_yylex(YYSTYPE *lvalp, YYLTYPE *llocp,
 extern void parser_init(base_yy_extra_type *yyext);
 extern int	base_yyparse(core_yyscan_t yyscanner);
 
+/* from scan.l */
+extern void check_unicode_value(pg_wchar c, char *loc, core_yyscan_t yyscanner);
+extern unsigned int hexval(unsigned char c);
+extern bool is_utf16_surrogate_first(pg_wchar c);
+extern bool is_utf16_surrogate_second(pg_wchar c);
+extern pg_wchar surrogate_pair_to_codepoint(pg_wchar first, pg_wchar second);
+
+
 #endif							/* GRAMPARSE_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 00ace8425e..5893d317d8 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -416,6 +416,7 @@ PG_KEYWORD("truncate", TRUNCATE, UNRESERVED_KEYWORD)
 PG_KEYWORD("trusted", TRUSTED, UNRESERVED_KEYWORD)
 PG_KEYWORD("type", TYPE_P, UNRESERVED_KEYWORD)
 PG_KEYWORD("types", TYPES_P, UNRESERVED_KEYWORD)
+PG_KEYWORD("uescape", UESCAPE, UNRESERVED_KEYWORD)
 PG_KEYWORD("unbounded", UNBOUNDED, UNRESERVED_KEYWORD)
 PG_KEYWORD("uncommitted", UNCOMMITTED, UNRESERVED_KEYWORD)
 PG_KEYWORD("unencrypted", UNENCRYPTED, UNRESERVED_KEYWORD)
diff --git a/src/include/parser/scanner.h b/src/include/parser/scanner.h
index 731a2bd264..571d5e273f 100644
--- a/src/include/parser/scanner.h
+++ b/src/include/parser/scanner.h
@@ -48,7 +48,7 @@ typedef union core_YYSTYPE
  * However, those are not defined in this file, because bison insists on
  * defining them for itself.  The token codes used by the core scanner are
  * the ASCII characters plus these:
- *	%token <str>	IDENT FCONST SCONST BCONST XCONST Op
+ *	%token <str>	IDENT UIDENT FCONST SCONST UCONST BCONST XCONST Op
  *	%token <ival>	ICONST PARAM
  *	%token			TYPECAST DOT_DOT COLON_EQUALS EQUALS_GREATER
  *	%token			LESS_EQUALS GREATER_EQUALS NOT_EQUALS
@@ -99,6 +99,7 @@ typedef struct core_yy_extra_type
 	int			literallen;		/* actual current string length */
 	int			literalalloc;	/* current allocated buffer size */
 
+	int			state_before_str_stop;	/* start cond. before end quote */
 	int			xcdepth;		/* depth of nesting in slash-star comments */
 	char	   *dolqstart;		/* current $foo$ quote start string */
 
diff --git a/src/interfaces/ecpg/preproc/ecpg.tokens b/src/interfaces/ecpg/preproc/ecpg.tokens
index 1d613af02f..749a9146ba 100644
--- a/src/interfaces/ecpg/preproc/ecpg.tokens
+++ b/src/interfaces/ecpg/preproc/ecpg.tokens
@@ -24,4 +24,4 @@
                 S_TYPEDEF
 
 %token CSTRING CVARIABLE CPP_LINE IP
-%token DOLCONST ECONST NCONST UCONST UIDENT
+%token DOLCONST ECONST NCONST
diff --git a/src/interfaces/ecpg/preproc/ecpg.trailer b/src/interfaces/ecpg/preproc/ecpg.trailer
index b303a9cbd0..dbf1abb5fb 100644
--- a/src/interfaces/ecpg/preproc/ecpg.trailer
+++ b/src/interfaces/ecpg/preproc/ecpg.trailer
@@ -1812,7 +1812,6 @@ ecpg_sconst:
 			$$[strlen($1)+3]='\0';
 			free($1);
 		}
-		| UCONST	{ $$ = $1; }
 		| DOLCONST	{ $$ = $1; }
 		;
 
@@ -1820,7 +1819,6 @@ ecpg_xconst:	XCONST		{ $$ = make_name(); } ;
 
 ecpg_ident:	IDENT		{ $$ = make_name(); }
 		| CSTRING	{ $$ = make3_str(mm_strdup("\""), $1, mm_strdup("\"")); }
-		| UIDENT	{ $$ = $1; }
 		;
 
 quoted_ident_stringvar: name
diff --git a/src/interfaces/ecpg/preproc/parse.pl b/src/interfaces/ecpg/preproc/parse.pl
index 3619706cdc..dc40b2974c 100644
--- a/src/interfaces/ecpg/preproc/parse.pl
+++ b/src/interfaces/ecpg/preproc/parse.pl
@@ -218,8 +218,8 @@ sub main
 				if ($a eq 'IDENT' && $prior eq '%nonassoc')
 				{
 
-					# add two more tokens to the list
-					$str = $str . "\n%nonassoc CSTRING\n%nonassoc UIDENT";
+					# add one more tokens to the list
+					$str = $str . "\n%nonassoc CSTRING";
 				}
 				$prior = $a;
 			}
diff --git a/src/pl/plpgsql/src/pl_gram.y b/src/pl/plpgsql/src/pl_gram.y
index dea95f4230..f0533d8407 100644
--- a/src/pl/plpgsql/src/pl_gram.y
+++ b/src/pl/plpgsql/src/pl_gram.y
@@ -232,7 +232,7 @@ static	void			check_raise_parameters(PLpgSQL_stmt_raise *stmt);
  * Some of these are not directly referenced in this file, but they must be
  * here anyway.
  */
-%token <str>	IDENT FCONST SCONST BCONST XCONST Op
+%token <str>	IDENT UIDENT FCONST SCONST UCONST BCONST XCONST Op
 %token <ival>	ICONST PARAM
 %token			TYPECAST DOT_DOT COLON_EQUALS EQUALS_GREATER
 %token			LESS_EQUALS GREATER_EQUALS NOT_EQUALS
diff --git a/src/test/regress/expected/strings.out b/src/test/regress/expected/strings.out
index 486c00b3b3..ea9697a736 100644
--- a/src/test/regress/expected/strings.out
+++ b/src/test/regress/expected/strings.out
@@ -48,15 +48,15 @@ SELECT 'tricky' AS U&"\" UESCAPE '!';
 (1 row)
 
 SELECT U&'wrong: \061';
-ERROR:  invalid Unicode escape value at or near "\061'"
+ERROR:  invalid Unicode escape value
 LINE 1: SELECT U&'wrong: \061';
                          ^
 SELECT U&'wrong: \+0061';
-ERROR:  invalid Unicode escape value at or near "\+0061'"
+ERROR:  invalid Unicode escape value
 LINE 1: SELECT U&'wrong: \+0061';
                          ^
 SELECT U&'wrong: +0061' UESCAPE '+';
-ERROR:  invalid Unicode escape character at or near "+'"
+ERROR:  invalid Unicode escape character "+"
 LINE 1: SELECT U&'wrong: +0061' UESCAPE '+';
                                          ^
 SET standard_conforming_strings TO off;

#15

Chapman Flack

chap@anastigmatix.net

over 6 years ago

In reply to: John Naylor (#14)

Re: benchmarking Flex practices

On 07/24/19 03:45, John Naylor wrote:

On Sun, Jul 21, 2019 at 3:14 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

However, my second reaction was that maybe you were on to something
upthread when you speculated about postponing de-escaping of
Unicode literals into the grammar. If we did it like that then

Wow, yay. I hadn't been following this thread, but I had just recently
looked over my own earlier musings [1]/messages/by-id/6688474e-7c28-b352-bcec-ea0ef59d7a1a@anastigmatix.net and started thinking "no, it would
be outlandish to ask the lexer to return utf-8 always ... but what about
postponing the de-escaping of Unicode literals into the grammar?" and
had started to think about when I might have a chance to try making a
patch.

With the de-escaping postponed, I think we'd be able to move beyond the
current odd situation where Unicode escapes can't describe non-ascii
characters, in exactly and only the cases where you need them to.

-Chap

[1]: /messages/by-id/6688474e-7c28-b352-bcec-ea0ef59d7a1a@anastigmatix.net
/messages/by-id/6688474e-7c28-b352-bcec-ea0ef59d7a1a@anastigmatix.net

#16

Tom Lane

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Chapman Flack (#15)

Re: benchmarking Flex practices

Chapman Flack <chap@anastigmatix.net> writes:

On 07/24/19 03:45, John Naylor wrote:

On Sun, Jul 21, 2019 at 3:14 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

However, my second reaction was that maybe you were on to something
upthread when you speculated about postponing de-escaping of
Unicode literals into the grammar. If we did it like that then

With the de-escaping postponed, I think we'd be able to move beyond the
current odd situation where Unicode escapes can't describe non-ascii
characters, in exactly and only the cases where you need them to.

How so? The grammar doesn't really have any more context information
than the lexer does. (In both cases, it would be ugly but not really
invalid for the transformation to depend on the database encoding,
I think.)

regards, tom lane

#17

Tom Lane

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: John Naylor (#14)

Re: benchmarking Flex practices

John Naylor <john.naylor@2ndquadrant.com> writes:

On Sun, Jul 21, 2019 at 3:14 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

So I'm feeling like maybe we should experiment to see what that
solution looks like, before we commit to going in this direction.
What do you think?

Given the above wrinkles, I thought it was worth trying. Attached is a
rough patch (don't mind the #include mess yet :-) ) that works like
this:

The lexer returns UCONST from xus and UIDENT from xui. The grammar has
rules that are effectively:

SCONST { do nothing}
| UCONST { esc char is backslash }
| UCONST UESCAPE SCONST { esc char is from $3 }

...where UESCAPE is now an unreserved keyword. To prevent shift-reduce
conflicts, I added UIDENT to the %nonassoc precedence list to match
IDENT, and for UESCAPE I added a %left precedence declaration. Maybe
there's a more principled way. I also added an unsigned char type to
the %union, but it worked fine on my compiler without it.

I think it might be better to drop the separate "Uescape" production and
just inline that into the calling rules, exactly per your sketch above.
You could avoid duplicating the escape-checking logic by moving that into
the str_udeescape support function. This would avoid the need for the
"uchr" union variant, but more importantly it seems likely to be more
future-proof: IME, any time you can avoid or postpone shift/reduce
decisions, it's better to do so.

I didn't try, but I think this might allow dropping the %left for
UESCAPE. That bothers me because I don't understand why it's
needed or what precedence level it ought to have.

litbuf_udeescape() and check_uescapechar() were moved to gram.y. The
former had be massaged to give error messages similar to HEAD. They're
not quite identical, but the position info is preserved. Some of the
functions I moved around don't seem to have any test coverage, so I
should eventually do some work in that regard.

I don't terribly like the cross-calls you have between gram.y and scan.l
in this formulation. If we have to make these functions (hexval() etc)
non-static anyway, maybe we should shove them all into scansup.c?

-Binary size is very close to v6. That is to say the grammar tables
grew by about the same amount the scanner table shrank, so the binary
is still about 200kB smaller than HEAD.

OK.

-Performance is very close to v6 with the information_schema and
pgbench-like queries with standard strings, which is to say also very
close to HEAD. When the latter was changed to use Unicode escapes,
however, it was about 15% slower than HEAD. That's a big regression
and I haven't tried to pinpoint why.

I don't quite follow what you changed to produce the slower test case?
But that seems to be something we'd better run to ground before
deciding whether to go this way.

-The ecpg changes here are only the bare minimum from HEAD to get it
to compile, since I'm borrowing its additional token names (although
they mean slightly different things). After a bit of experimentation,
it's clear there's a bit more work needed to get it functional, and
it's not easy to debug, so I'm putting that off until we decide
whether this is the way forward.

On the whole I like this approach, modulo the performance question.
Let's try to work that out before worrying about ecpg.

regards, tom lane

#18

John Naylor

john.naylor@2ndquadrant.com

over 6 years ago

In reply to: Tom Lane (#17)

1 attachment(s)

Re: benchmarking Flex practices

On Mon, Jul 29, 2019 at 10:40 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

John Naylor <john.naylor@2ndquadrant.com> writes:

The lexer returns UCONST from xus and UIDENT from xui. The grammar has
rules that are effectively:

SCONST { do nothing}
| UCONST { esc char is backslash }
| UCONST UESCAPE SCONST { esc char is from $3 }

...where UESCAPE is now an unreserved keyword. To prevent shift-reduce
conflicts, I added UIDENT to the %nonassoc precedence list to match
IDENT, and for UESCAPE I added a %left precedence declaration. Maybe
there's a more principled way. I also added an unsigned char type to
the %union, but it worked fine on my compiler without it.

I think it might be better to drop the separate "Uescape" production and
just inline that into the calling rules, exactly per your sketch above.
You could avoid duplicating the escape-checking logic by moving that into
the str_udeescape support function. This would avoid the need for the
"uchr" union variant, but more importantly it seems likely to be more
future-proof: IME, any time you can avoid or postpone shift/reduce
decisions, it's better to do so.

I didn't try, but I think this might allow dropping the %left for
UESCAPE. That bothers me because I don't understand why it's
needed or what precedence level it ought to have.

I tried this, and removing the %left still gives me a shift/reduce
conflict, so I put some effort in narrowing down what's happening. If
I remove the rules with UESCAPE individually, I find that precedence
is not needed for Sconst -- only for Ident. I tried reverting all the
rules to use the original "IDENT" token and one by one changed them to
"Ident", and found 6 places where doing so caused a shift-reduce
conflict:

createdb_opt_name
xmltable_column_option_el
ColId
type_function_name
NonReservedWord
ColLabel

Due to the number of affected places, that didn't seem like a useful
avenue to pursue, so I tried the following:

-Making UESCAPE a reserved keyword or separate token type works, but
other keyword types don't work. Not acceptable, but maybe useful info.
-Giving UESCAPE an %nonassoc precedence above UIDENT works, even if
UIDENT is the lowest in the list. This seems the least intrusive, so I
went with that for v8. One possible downside is that UIDENT now no
longer has the same precedence as IDENT. Not sure if it matters, but
could we fix that contextually with "%prec IDENT"?

litbuf_udeescape() and check_uescapechar() were moved to gram.y. The
former had be massaged to give error messages similar to HEAD. They're
not quite identical, but the position info is preserved. Some of the
functions I moved around don't seem to have any test coverage, so I
should eventually do some work in that regard.

I don't terribly like the cross-calls you have between gram.y and scan.l
in this formulation. If we have to make these functions (hexval() etc)
non-static anyway, maybe we should shove them all into scansup.c?

I ended up making them static inline in scansup.h since that seemed to
reduce the performance impact (results below). I cribbed some of the
surrogate pair queries from the jsonpath regression tests so we have
some coverage here. Diff'ing from HEAD to patch, the locations are
different for a couple cases (a side effect of the differen error
handling style from scan.l). The patch seems to consistently point at
an escape sequence, so I think it's okay to use that. HEAD, on the
other hand, sometimes points at the start of the whole string:

 select U&'\de04\d83d'; -- surrogates in wrong order
-psql:test_unicode.sql:10: ERROR:  invalid Unicode surrogate pair at
or near "U&'\de04\d83d'"
+psql:test_unicode.sql:10: ERROR:  invalid Unicode surrogate pair
 LINE 1: select U&'\de04\d83d';
-               ^
+                  ^
 select U&'\de04X'; -- orphan low surrogate
-psql:test_unicode.sql:12: ERROR:  invalid Unicode surrogate pair at
or near "U&'\de04X'"
+psql:test_unicode.sql:12: ERROR:  invalid Unicode surrogate pair
 LINE 1: select U&'\de04X';
-               ^
+                  ^

-Performance is very close to v6 with the information_schema and
pgbench-like queries with standard strings, which is to say also very
close to HEAD. When the latter was changed to use Unicode escapes,
however, it was about 15% slower than HEAD. That's a big regression
and I haven't tried to pinpoint why.

I don't quite follow what you changed to produce the slower test case?
But that seems to be something we'd better run to ground before
deciding whether to go this way.

So "pgbench str" below refers to driving the parser with this set of
queries repeated a couple hundred times in a string:

BEGIN;
UPDATE pgbench_accounts SET abalance = abalance + 'foobarbaz' WHERE
aid = 'foobarbaz';
SELECT abalance FROM pgbench_accounts WHERE aid = 'foobarbaz';
UPDATE pgbench_tellers SET tbalance = tbalance + 'foobarbaz' WHERE tid
= 'foobarbaz';
UPDATE pgbench_branches SET bbalance = bbalance + 'foobarbaz' WHERE
bid = 'foobarbaz';
INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES
('foobarbaz', 'foobarbaz', 'foobarbaz', 'foobarbaz',
CURRENT_TIMESTAMP);
END;

and "pgbench uesc" is the same, but the string is

U&'d!0061t!+000061'
uescape
'!'

Now that I think of it, the regression in v7 was largely due to the
fact that the parser has to call the lexer 3 times per string in this
case, and that's going to be slower no matter what we do. I added a
separate test with ordinary backslash escapes ("pgbench unicode"),
rebased v6-8 onto the same commit on master, and reran the performance
tests. The runs are generally +/- 1%:

master v6 v7 v8
info-schema 1.49s 1.48s 1.50s 1.53s
pgbench str 1.12s 1.13s 1.15s 1.17s
pgbench unicode 1.29s 1.29s 1.40s 1.36s
pgbench uesc 1.42s 1.44s 1.64s 1.58s

Inlining hexval() and friends seems to have helped somewhat for
unicode escapes, but I'd have to profile to improve that further.
However, v8 has regressed from v7 enough with both simple strings and
the information schema that it's a noticeable regression from HEAD.
I'm guessing getting rid of the "Uescape" production is to blame, but
I haven't tried reverting just that one piece. Since inlining the
rules didn't seem to help with the precedence hacks, it seems like the
separate production was a better way. Thoughts?

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

v8-draft-handle-uescapes-in-parser.patchapplication/octet-stream; name=v8-draft-handle-uescapes-in-parser.patchDownload

diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index c97bb367f8..864591e086 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -60,6 +60,7 @@
 #include "parser/gramparse.h"
 #include "parser/parser.h"
 #include "parser/parse_expr.h"
+#include "parser/scansup.h"
 #include "storage/lmgr.h"
 #include "utils/date.h"
 #include "utils/datetime.h"
@@ -188,6 +189,9 @@ static void processCASbits(int cas_bits, int location, const char *constrType,
 			   bool *deferrable, bool *initdeferred, bool *not_valid,
 			   bool *no_inherit, core_yyscan_t yyscanner);
 static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
+static char * str_udeescape(char *str, int str_pos, char *esc_str, int esc_pos, core_yyscan_t yyscanner);
+static bool check_uescapechar(unsigned char escape);
+static bool check_unicode_value(pg_wchar c, char *loc, core_yyscan_t yyscanner);
 
 %}
 
@@ -528,6 +532,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 
 %type <ival>	Iconst SignedIconst
 %type <str>		Sconst comment_text notify_payload
+%type <str>		Ident
 %type <str>		RoleId opt_boolean_or_string
 %type <list>	var_list
 %type <str>		ColId ColLabel var_name type_function_name param_name
@@ -599,7 +604,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
  * DOT_DOT is unused in the core SQL grammar, and so will always provoke
  * parse errors.  It is needed by PL/pgSQL.
  */
-%token <str>	IDENT FCONST SCONST BCONST XCONST Op
+%token <str>	IDENT UIDENT FCONST SCONST UCONST BCONST XCONST Op
 %token <ival>	ICONST PARAM
 %token			TYPECAST DOT_DOT COLON_EQUALS EQUALS_GREATER
 %token			LESS_EQUALS GREATER_EQUALS NOT_EQUALS
@@ -689,7 +694,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	TREAT TRIGGER TRIM TRUE_P
 	TRUNCATE TRUSTED TYPE_P TYPES_P
 
-	UNBOUNDED UNCOMMITTED UNENCRYPTED UNION UNIQUE UNKNOWN UNLISTEN UNLOGGED
+	UESCAPE UNBOUNDED UNCOMMITTED UNENCRYPTED UNION UNIQUE UNKNOWN UNLISTEN UNLOGGED
 	UNTIL UPDATE USER USING
 
 	VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
@@ -718,6 +723,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 
 
 /* Precedence: lowest to highest */
+%nonassoc	UIDENT			/* UESCAPE must be higher than UIDENT */
+%nonassoc	UESCAPE
 %nonassoc	SET				/* see relation_expr_opt_alias */
 %left		UNION EXCEPT
 %left		INTERSECT
@@ -1048,7 +1055,7 @@ AlterOptRoleElem:
 				{
 					$$ = makeDefElem("rolemembers", (Node *)$2, @1);
 				}
-			| IDENT
+			| Ident
 				{
 					/*
 					 * We handle identifiers that aren't parser keywords with
@@ -1602,14 +1609,14 @@ opt_boolean_or_string:
  * - an integer or floating point number
  * - a time interval per SQL99
  * ColId gives reduce/reduce errors against ConstInterval and LOCAL,
- * so use IDENT (meaning we reject anything that is a key word).
+ * so use Ident (meaning we reject anything that is a key word).
  */
 zone_value:
 			Sconst
 				{
 					$$ = makeStringConst($1, @1);
 				}
-			| IDENT
+			| Ident
 				{
 					$$ = makeStringConst($1, @1);
 				}
@@ -3871,7 +3878,7 @@ PartitionSpec: PARTITION BY part_strategy '(' part_params ')'
 				}
 		;
 
-part_strategy:	IDENT					{ $$ = $1; }
+part_strategy:	Ident					{ $$ = $1; }
 				| unreserved_keyword	{ $$ = pstrdup($1); }
 		;
 
@@ -5262,7 +5269,7 @@ RowSecurityOptionalToRole:
 		;
 
 RowSecurityDefaultPermissive:
-			AS IDENT
+			AS Ident
 				{
 					if (strcmp($2, "permissive") == 0)
 						$$ = true;
@@ -5831,11 +5838,11 @@ old_aggr_list: old_aggr_elem						{ $$ = list_make1($1); }
 		;
 
 /*
- * Must use IDENT here to avoid reduce/reduce conflicts; fortunately none of
+ * Must use Ident here to avoid reduce/reduce conflicts; fortunately none of
  * the item names needed in old aggregate definitions are likely to become
  * SQL keywords.
  */
-old_aggr_elem:  IDENT '=' def_arg
+old_aggr_elem:  Ident '=' def_arg
 				{
 					$$ = makeDefElem($1, (Node *)$3, @1);
 				}
@@ -10113,7 +10120,7 @@ createdb_opt_item:
 /*
  * Ideally we'd use ColId here, but that causes shift/reduce conflicts against
  * the ALTER DATABASE SET/RESET syntaxes.  Instead call out specific keywords
- * we need, and allow IDENT so that database option names don't have to be
+ * we need, and allow Ident so that database option names don't have to be
  * parser keywords unless they are already keywords for other reasons.
  *
  * XXX this coding technique is fragile since if someone makes a formerly
@@ -10122,7 +10129,7 @@ createdb_opt_item:
  * exercising every such option, at least at the syntax level.
  */
 createdb_opt_name:
-			IDENT							{ $$ = $1; }
+			Ident							{ $$ = $1; }
 			| CONNECTION LIMIT				{ $$ = pstrdup("connection_limit"); }
 			| ENCODING						{ $$ = pstrdup($1); }
 			| LOCATION						{ $$ = pstrdup($1); }
@@ -12424,7 +12431,7 @@ xmltable_column_option_list:
 		;
 
 xmltable_column_option_el:
-			IDENT b_expr
+			Ident b_expr
 				{ $$ = makeDefElem($1, $2, @1); }
 			| DEFAULT b_expr
 				{ $$ = makeDefElem("default", $2, @1); }
@@ -14412,7 +14419,7 @@ extract_list:
  * - thomas 2001-04-12
  */
 extract_arg:
-			IDENT									{ $$ = $1; }
+			Ident									{ $$ = $1; }
 			| YEAR_P								{ $$ = "year"; }
 			| MONTH_P								{ $$ = "month"; }
 			| DAY_P									{ $$ = "day"; }
@@ -14655,7 +14662,7 @@ target_el:	a_expr AS ColLabel
 			 * as an infix expression, which we accomplish by assigning
 			 * IDENT a precedence higher than POSTFIXOP.
 			 */
-			| a_expr IDENT
+			| a_expr Ident
 				{
 					$$ = makeNode(ResTarget);
 					$$->name = $2;
@@ -14874,13 +14881,53 @@ AexprConst: Iconst
 		;
 
 Iconst:		ICONST									{ $$ = $1; };
-Sconst:		SCONST									{ $$ = $1; };
+Sconst:		SCONST
+				{
+					$$ = $1;
+				}
+			| UCONST
+				{
+					$$ = str_udeescape($1, @1, "\\", 0, yyscanner);
+				}
+			| UCONST UESCAPE SCONST
+				{
+					$$ = str_udeescape($1, @1, $3, @3, yyscanner);
+				}
+		;
 
 SignedIconst: Iconst								{ $$ = $1; }
 			| '+' Iconst							{ $$ = + $2; }
 			| '-' Iconst							{ $$ = - $2; }
 		;
 
+Ident:		IDENT
+				{
+					$$ = $1;
+				}
+			| UIDENT
+				{
+					char 	   *ident;
+					int			identlen;
+
+					ident = str_udeescape($1, @1, "\\", 0, yyscanner);
+					identlen = strlen(ident);
+					if (identlen >= NAMEDATALEN)
+						truncate_identifier(ident, identlen, true);
+					$$ = ident;
+				}
+			| UIDENT UESCAPE SCONST
+				{
+					char 	   *ident;
+					int			identlen;
+
+					ident = str_udeescape($1, @1, $3, @3, yyscanner);
+					identlen = strlen(ident);
+					if (identlen >= NAMEDATALEN)
+						truncate_identifier(ident, identlen, true);
+					$$ = ident;
+				}
+		;
+
 /* Role specifications */
 RoleId:		RoleSpec
 				{
@@ -14971,21 +15018,21 @@ role_list:	RoleSpec
 
 /* Column identifier --- names that can be column, table, etc names.
  */
-ColId:		IDENT									{ $$ = $1; }
+ColId:		Ident									{ $$ = $1; }
 			| unreserved_keyword					{ $$ = pstrdup($1); }
 			| col_name_keyword						{ $$ = pstrdup($1); }
 		;
 
 /* Type/function identifier --- names that can be type or function names.
  */
-type_function_name:	IDENT							{ $$ = $1; }
+type_function_name:	Ident							{ $$ = $1; }
 			| unreserved_keyword					{ $$ = pstrdup($1); }
 			| type_func_name_keyword				{ $$ = pstrdup($1); }
 		;
 
 /* Any not-fully-reserved word --- these names can be, eg, role names.
  */
-NonReservedWord:	IDENT							{ $$ = $1; }
+NonReservedWord:	Ident							{ $$ = $1; }
 			| unreserved_keyword					{ $$ = pstrdup($1); }
 			| col_name_keyword						{ $$ = pstrdup($1); }
 			| type_func_name_keyword				{ $$ = pstrdup($1); }
@@ -14994,7 +15041,7 @@ NonReservedWord:	IDENT							{ $$ = $1; }
 /* Column label --- allowed labels in "AS" clauses.
  * This presently includes *all* Postgres keywords.
  */
-ColLabel:	IDENT									{ $$ = $1; }
+ColLabel:	Ident									{ $$ = $1; }
 			| unreserved_keyword					{ $$ = pstrdup($1); }
 			| col_name_keyword						{ $$ = pstrdup($1); }
 			| type_func_name_keyword				{ $$ = pstrdup($1); }
@@ -15282,6 +15329,7 @@ unreserved_keyword:
 			| TRUSTED
 			| TYPE_P
 			| TYPES_P
+			| UESCAPE
 			| UNBOUNDED
 			| UNCOMMITTED
 			| UNENCRYPTED
@@ -16351,3 +16399,202 @@ parser_init(base_yy_extra_type *yyext)
 {
 	yyext->parsetree = NIL;		/* in case grammar forgets to set it */
 }
+
+/* handle unicode escapes */
+static char *
+str_udeescape(char *str, int str_pos,
+			  char *esc_str, int esc_pos,
+			  core_yyscan_t yyscanner)
+{
+	char	   *new,
+			   *in,
+			   *out;
+	int			str_length,
+				esc_length;
+	pg_wchar	pair_first = 0;
+	unsigned char escape;
+
+	esc_length = strlen(esc_str);
+	escape = esc_str[0];
+
+	str_length = strlen(str);
+
+	if (esc_length != 1 ||
+		!check_uescapechar(escape))
+		ereport(ERROR,
+				(errcode(ERRCODE_SYNTAX_ERROR),
+				 errmsg("invalid Unicode escape character \"%s\"", esc_str),
+				 parser_errposition(esc_pos + 1))); /* 1 for the quote */
+
+	/*
+	 * This relies on the subtle assumption that a UTF-8 expansion cannot be
+	 * longer than its escaped representation.
+	 */
+	new = palloc(str_length + 1);
+
+	in = str;
+	out = new;
+	while (*in)
+	{
+		if (in[0] == escape)
+		{
+			if (in[1] == escape)
+			{
+				if (pair_first)
+					goto invalid_pair;
+				*out++ = escape;
+				in += 2;
+			}
+			else if (isxdigit((unsigned char) in[1]) &&
+					 isxdigit((unsigned char) in[2]) &&
+					 isxdigit((unsigned char) in[3]) &&
+					 isxdigit((unsigned char) in[4]))
+			{
+				pg_wchar	unicode;
+
+				unicode = (hexval(in[1]) << 12) +
+					(hexval(in[2]) << 8) +
+					(hexval(in[3]) << 4) +
+					hexval(in[4]);
+				if (!check_unicode_value(unicode, in, yyscanner))
+				{
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8"),
+							 parser_errposition(str_pos + (in - str) + 3))); /* 3 for U&" */
+				}
+
+				if (pair_first)
+				{
+					if (is_utf16_surrogate_second(unicode))
+					{
+						unicode = surrogate_pair_to_codepoint(pair_first, unicode);
+						pair_first = 0;
+					}
+					else
+						goto invalid_pair;
+				}
+				else if (is_utf16_surrogate_second(unicode))
+					goto invalid_pair;
+
+				if (is_utf16_surrogate_first(unicode))
+					pair_first = unicode;
+				else
+				{
+					unicode_to_utf8(unicode, (unsigned char *) out);
+					out += pg_mblen(out);
+				}
+				in += 5;
+			}
+			else if (in[1] == '+' &&
+					 isxdigit((unsigned char) in[2]) &&
+					 isxdigit((unsigned char) in[3]) &&
+					 isxdigit((unsigned char) in[4]) &&
+					 isxdigit((unsigned char) in[5]) &&
+					 isxdigit((unsigned char) in[6]) &&
+					 isxdigit((unsigned char) in[7]))
+			{
+				pg_wchar	unicode;
+
+				unicode = (hexval(in[2]) << 20) +
+					(hexval(in[3]) << 16) +
+					(hexval(in[4]) << 12) +
+					(hexval(in[5]) << 8) +
+					(hexval(in[6]) << 4) +
+					hexval(in[7]);
+				if (!check_unicode_value(unicode, in, yyscanner))
+				{
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8"),
+							 parser_errposition(str_pos + (in - str) + 3))); /* 3 for U&" */
+				}
+				if (pair_first)
+				{
+					if (is_utf16_surrogate_second(unicode))
+					{
+						unicode = surrogate_pair_to_codepoint(pair_first, unicode);
+						pair_first = 0;
+					}
+					else
+						goto invalid_pair;
+				}
+				else if (is_utf16_surrogate_second(unicode))
+					goto invalid_pair;
+
+				if (is_utf16_surrogate_first(unicode))
+					pair_first = unicode;
+				else
+				{
+					unicode_to_utf8(unicode, (unsigned char *) out);
+					out += pg_mblen(out);
+				}
+				in += 8;
+			}
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("invalid Unicode escape value"),
+						 parser_errposition(str_pos + (in - str) + 3))); /* 3 for U&" */
+		}
+		else
+		{
+			if (pair_first)
+				goto invalid_pair;
+
+			*out++ = *in++;
+		}
+	}
+
+	/* unfinished surrogate pair? */
+	if (pair_first)
+		goto invalid_pair;
+
+	*out = '\0';
+
+	/*
+	 * We could skip pg_verifymbstr if we didn't process any non-7-bit-ASCII
+	 * codes; but it's probably not worth the trouble, since this isn't likely
+	 * to be a performance-critical path.
+	 */
+	pg_verifymbstr(new, out - new, false);
+	return new;
+
+invalid_pair:
+	ereport(ERROR,
+			(errcode(ERRCODE_SYNTAX_ERROR),
+			 errmsg("invalid Unicode surrogate pair"),
+			 parser_errposition(str_pos + (in - str) + 3))); /* 3 for U&" */
+
+}
+
+/* is 'escape' acceptable as Unicode escape character (UESCAPE syntax) ? */
+static bool
+check_uescapechar(unsigned char escape)
+{
+	if (isxdigit(escape)
+		|| escape == '+'
+		|| escape == '\''
+		|| escape == '"'
+		|| scanner_isspace(escape))
+	{
+		return false;
+	}
+	else
+		return true;
+}
+
+/* XXX this not covered in tests as far as I know, so check error position in caller */
+static bool
+check_unicode_value(pg_wchar c, char *loc, core_yyscan_t yyscanner)
+{
+	if (GetDatabaseEncoding() == PG_UTF8)
+		return true;
+
+	if (c > 0x7F)
+	{
+		return false;
+	}
+	else
+		return true;
+}
diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l
index e1cae859e8..97bd54712e 100644
--- a/src/backend/parser/scan.l
+++ b/src/backend/parser/scan.l
@@ -110,14 +110,9 @@ const uint16 ScanKeywordTokens[] = {
 static void addlit(char *ytext, int yleng, core_yyscan_t yyscanner);
 static void addlitchar(unsigned char ychar, core_yyscan_t yyscanner);
 static char *litbufdup(core_yyscan_t yyscanner);
-static char *litbuf_udeescape(unsigned char escape, core_yyscan_t yyscanner);
 static unsigned char unescape_single_char(unsigned char c, core_yyscan_t yyscanner);
 static int	process_integer_literal(const char *token, YYSTYPE *lval);
-static bool is_utf16_surrogate_first(pg_wchar c);
-static bool is_utf16_surrogate_second(pg_wchar c);
-static pg_wchar surrogate_pair_to_codepoint(pg_wchar first, pg_wchar second);
 static void addunicode(pg_wchar c, yyscan_t yyscanner);
-static bool check_uescapechar(unsigned char escape);
 
 #define yyerror(msg)  scanner_yyerror(msg, yyscanner)
 
@@ -168,12 +163,11 @@ extern void core_yyset_column(int column_no, yyscan_t yyscanner);
  *  <xd> delimited identifiers (double-quoted identifiers)
  *  <xh> hexadecimal numeric string
  *  <xq> standard quoted strings
+ *  <xqs> quote stop (detect continued strings)
  *  <xe> extended quoted strings (support backslash escape sequences)
  *  <xdolq> $foo$ quoted strings
  *  <xui> quoted identifier with Unicode escapes
- *  <xuiend> end of a quoted identifier with Unicode escapes, UESCAPE can follow
  *  <xus> quoted string with Unicode escapes
- *  <xusend> end of a quoted string with Unicode escapes, UESCAPE can follow
  *  <xeu> Unicode surrogate pair in extended quoted string
  *
  * Remember to add an <<EOF>> case whenever you add a new exclusive state!
@@ -185,12 +179,11 @@ extern void core_yyset_column(int column_no, yyscan_t yyscanner);
 %x xd
 %x xh
 %x xq
+%x xqs
 %x xe
 %x xdolq
 %x xui
-%x xuiend
 %x xus
-%x xusend
 %x xeu
 
 /*
@@ -231,19 +224,18 @@ special_whitespace		({space}+|{comment}{newline})
 horiz_whitespace		({horiz_space}|{comment})
 whitespace_with_newline	({horiz_whitespace}*{newline}{special_whitespace}*)
 
+quote			'
+/* If we see {quote} then {quotecontinue}, the quoted string continues */
+quotecontinue	{whitespace_with_newline}{quote}
+
 /*
- * To ensure that {quotecontinue} can be scanned without having to back up
- * if the full pattern isn't matched, we include trailing whitespace in
- * {quotestop}.  This matches all cases where {quotecontinue} fails to match,
- * except for {quote} followed by whitespace and just one "-" (not two,
- * which would start a {comment}).  To cover that we have {quotefail}.
- * The actions for {quotestop} and {quotefail} must throw back characters
- * beyond the quote proper.
+ * {quotecontinuefail} is needed to avoid lexer backup when we fail to match
+ * {quotecontinue}.  It might seem that this could just be {whitespace}*,
+ * but if there's a dash after {whitespace_with_newline}, it must be consumed
+ * to see if there's another dash --- which would start a {comment} and thus
+ * allow continuation of the {quotecontinue} token.
  */
-quote			'
-quotestop		{quote}{whitespace}*
-quotecontinue	{quote}{whitespace_with_newline}{quote}
-quotefail		{quote}{whitespace}*"-"
+quotecontinuefail	{whitespace}*"-"?
 
 /* Bit string
  * It is tempting to scan the string for only those characters
@@ -304,21 +296,12 @@ xdstop			{dquote}
 xddouble		{dquote}{dquote}
 xdinside		[^"]+
 
-/* Unicode escapes */
-uescape			[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']{quote}
-/* error rule to avoid backup */
-uescapefail		[uU][eE][sS][cC][aA][pP][eE]{whitespace}*"-"|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*|[uU][eE][sS][cC][aA][pP]|[uU][eE][sS][cC][aA]|[uU][eE][sS][cC]|[uU][eE][sS]|[uU][eE]|[uU]
-
 /* Quoted identifier with Unicode escapes */
 xuistart		[uU]&{dquote}
 
 /* Quoted string with Unicode escapes */
 xusstart		[uU]&{quote}
 
-/* Optional UESCAPE after a quoted string or identifier with Unicode escapes. */
-xustop1		{uescapefail}?
-xustop2		{uescape}
-
 /* error rule to avoid backup */
 xufailed		[uU]&
 
@@ -476,21 +459,10 @@ other			.
 					startlit();
 					addlitchar('b', yyscanner);
 				}
-<xb>{quotestop}	|
-<xb>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
-					yylval->str = litbufdup(yyscanner);
-					return BCONST;
-				}
 <xh>{xhinside}	|
 <xb>{xbinside}	{
 					addlit(yytext, yyleng, yyscanner);
 				}
-<xh>{quotecontinue}	|
-<xb>{quotecontinue}	{
-					/* ignore */
-				}
 <xb><<EOF>>		{ yyerror("unterminated bit string literal"); }
 
 {xhstart}		{
@@ -505,13 +477,6 @@ other			.
 					startlit();
 					addlitchar('x', yyscanner);
 				}
-<xh>{quotestop}	|
-<xh>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
-					yylval->str = litbufdup(yyscanner);
-					return XCONST;
-				}
 <xh><<EOF>>		{ yyerror("unterminated hexadecimal string literal"); }
 
 {xnstart}		{
@@ -568,53 +533,67 @@ other			.
 					BEGIN(xus);
 					startlit();
 				}
-<xq,xe>{quotestop}	|
-<xq,xe>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
+
+<xb,xh,xq,xe,xus>{quote} {
 					/*
-					 * check that the data remains valid if it might have been
-					 * made invalid by unescaping any chars.
+					 * When we are scanning a quoted string and see an end
+					 * quote, we must look ahead for a possible continuation.
+					 * If we don't see one, we know the end quote was in fact
+					 * the end of the string.  To reduce the lexer table size,
+					 * we use a single "xqs" state to do the lookahead for all
+					 * types of strings.
 					 */
-					if (yyextra->saw_non_ascii)
-						pg_verifymbstr(yyextra->literalbuf,
-									   yyextra->literallen,
-									   false);
-					yylval->str = litbufdup(yyscanner);
-					return SCONST;
-				}
-<xus>{quotestop} |
-<xus>{quotefail} {
-					/* throw back all but the quote */
-					yyless(1);
-					/* xusend state looks for possible UESCAPE */
-					BEGIN(xusend);
+					yyextra->state_before_str_stop = YYSTATE;
+					BEGIN(xqs);
 				}
-<xusend>{whitespace} {
-					/* stay in xusend state over whitespace */
+<xqs>{quotecontinue} {
+					/*
+					 * Found a quote continuation, so return to the in-quote
+					 * state and continue scanning the literal.
+					 */
+					BEGIN(yyextra->state_before_str_stop);
 				}
-<xusend><<EOF>> |
-<xusend>{other} |
-<xusend>{xustop1} {
-					/* no UESCAPE after the quote, throw back everything */
+<xqs>{quotecontinuefail} |
+<xqs><<EOF>> |
+<xqs>{other}	{
+					/*
+					 * Failed to see a quote continuation.  Throw back
+					 * everything after the end quote, and handle the string
+					 * according to the state we were in previously.
+					 */
 					yyless(0);
 					BEGIN(INITIAL);
-					yylval->str = litbuf_udeescape('\\', yyscanner);
-					return SCONST;
-				}
-<xusend>{xustop2} {
-					/* found UESCAPE after the end quote */
-					BEGIN(INITIAL);
-					if (!check_uescapechar(yytext[yyleng - 2]))
+
+					switch (yyextra->state_before_str_stop)
 					{
-						SET_YYLLOC();
-						ADVANCE_YYLLOC(yyleng - 2);
-						yyerror("invalid Unicode escape character");
+						case xb:
+							yylval->str = litbufdup(yyscanner);
+							return BCONST;
+						case xh:
+							yylval->str = litbufdup(yyscanner);
+							return XCONST;
+						case xq:
+							/* fallthrough */
+						case xe:
+							/*
+							 * Check that the data remains valid if it
+							 * might have been made invalid by unescaping
+							 * any chars.
+							 */
+							if (yyextra->saw_non_ascii)
+								pg_verifymbstr(yyextra->literalbuf,
+											   yyextra->literallen,
+											   false);
+							yylval->str = litbufdup(yyscanner);
+							return SCONST;
+						case xus:
+							yylval->str = litbufdup(yyscanner);
+							return UCONST;
+						default:
+							yyerror("unhandled previous state in xqs");
 					}
-					yylval->str = litbuf_udeescape(yytext[yyleng - 2],
-												   yyscanner);
-					return SCONST;
 				}
+
 <xq,xe,xus>{xqdouble} {
 					addlitchar('\'', yyscanner);
 				}
@@ -693,9 +672,6 @@ other			.
 					if (c == '\0' || IS_HIGHBIT_SET(c))
 						yyextra->saw_non_ascii = true;
 				}
-<xq,xe,xus>{quotecontinue} {
-					/* ignore */
-				}
 <xe>.			{
 					/* This is only needed for \ just before EOF */
 					addlitchar(yytext[0], yyscanner);
@@ -770,53 +746,14 @@ other			.
 					return IDENT;
 				}
 <xui>{dquote} {
-					yyless(1);
-					/* xuiend state looks for possible UESCAPE */
-					BEGIN(xuiend);
-				}
-<xuiend>{whitespace} {
-					/* stay in xuiend state over whitespace */
-				}
-<xuiend><<EOF>> |
-<xuiend>{other} |
-<xuiend>{xustop1} {
-					/* no UESCAPE after the quote, throw back everything */
-					char	   *ident;
-					int			identlen;
-
-					yyless(0);
-
-					BEGIN(INITIAL);
 					if (yyextra->literallen == 0)
 						yyerror("zero-length delimited identifier");
-					ident = litbuf_udeescape('\\', yyscanner);
-					identlen = strlen(ident);
-					if (identlen >= NAMEDATALEN)
-						truncate_identifier(ident, identlen, true);
-					yylval->str = ident;
-					return IDENT;
-				}
-<xuiend>{xustop2}	{
-					/* found UESCAPE after the end quote */
-					char	   *ident;
-					int			identlen;
 
 					BEGIN(INITIAL);
-					if (yyextra->literallen == 0)
-						yyerror("zero-length delimited identifier");
-					if (!check_uescapechar(yytext[yyleng - 2]))
-					{
-						SET_YYLLOC();
-						ADVANCE_YYLLOC(yyleng - 2);
-						yyerror("invalid Unicode escape character");
-					}
-					ident = litbuf_udeescape(yytext[yyleng - 2], yyscanner);
-					identlen = strlen(ident);
-					if (identlen >= NAMEDATALEN)
-						truncate_identifier(ident, identlen, true);
-					yylval->str = ident;
-					return IDENT;
+					yylval->str = litbufdup(yyscanner);
+					return UIDENT;
 				}
+
 <xd,xui>{xddouble}	{
 					addlitchar('"', yyscanner);
 				}
@@ -1288,50 +1225,6 @@ process_integer_literal(const char *token, YYSTYPE *lval)
 	return ICONST;
 }
 
-static unsigned int
-hexval(unsigned char c)
-{
-	if (c >= '0' && c <= '9')
-		return c - '0';
-	if (c >= 'a' && c <= 'f')
-		return c - 'a' + 0xA;
-	if (c >= 'A' && c <= 'F')
-		return c - 'A' + 0xA;
-	elog(ERROR, "invalid hexadecimal digit");
-	return 0;					/* not reached */
-}
-
-static void
-check_unicode_value(pg_wchar c, char *loc, core_yyscan_t yyscanner)
-{
-	if (GetDatabaseEncoding() == PG_UTF8)
-		return;
-
-	if (c > 0x7F)
-	{
-		ADVANCE_YYLLOC(loc - yyextra->literalbuf + 3);	/* 3 for U&" */
-		yyerror("Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8");
-	}
-}
-
-static bool
-is_utf16_surrogate_first(pg_wchar c)
-{
-	return (c >= 0xD800 && c <= 0xDBFF);
-}
-
-static bool
-is_utf16_surrogate_second(pg_wchar c)
-{
-	return (c >= 0xDC00 && c <= 0xDFFF);
-}
-
-static pg_wchar
-surrogate_pair_to_codepoint(pg_wchar first, pg_wchar second)
-{
-	return ((first & 0x3FF) << 10) + 0x10000 + (second & 0x3FF);
-}
-
 static void
 addunicode(pg_wchar c, core_yyscan_t yyscanner)
 {
@@ -1349,172 +1242,6 @@ addunicode(pg_wchar c, core_yyscan_t yyscanner)
 	addlit(buf, pg_mblen(buf), yyscanner);
 }
 
-/* is 'escape' acceptable as Unicode escape character (UESCAPE syntax) ? */
-static bool
-check_uescapechar(unsigned char escape)
-{
-	if (isxdigit(escape)
-		|| escape == '+'
-		|| escape == '\''
-		|| escape == '"'
-		|| scanner_isspace(escape))
-	{
-		return false;
-	}
-	else
-		return true;
-}
-
-/* like litbufdup, but handle unicode escapes */
-static char *
-litbuf_udeescape(unsigned char escape, core_yyscan_t yyscanner)
-{
-	char	   *new;
-	char	   *litbuf,
-			   *in,
-			   *out;
-	pg_wchar	pair_first = 0;
-
-	/* Make literalbuf null-terminated to simplify the scanning loop */
-	litbuf = yyextra->literalbuf;
-	litbuf[yyextra->literallen] = '\0';
-
-	/*
-	 * This relies on the subtle assumption that a UTF-8 expansion cannot be
-	 * longer than its escaped representation.
-	 */
-	new = palloc(yyextra->literallen + 1);
-
-	in = litbuf;
-	out = new;
-	while (*in)
-	{
-		if (in[0] == escape)
-		{
-			if (in[1] == escape)
-			{
-				if (pair_first)
-				{
-					ADVANCE_YYLLOC(in - litbuf + 3);	/* 3 for U&" */
-					yyerror("invalid Unicode surrogate pair");
-				}
-				*out++ = escape;
-				in += 2;
-			}
-			else if (isxdigit((unsigned char) in[1]) &&
-					 isxdigit((unsigned char) in[2]) &&
-					 isxdigit((unsigned char) in[3]) &&
-					 isxdigit((unsigned char) in[4]))
-			{
-				pg_wchar	unicode;
-
-				unicode = (hexval(in[1]) << 12) +
-					(hexval(in[2]) << 8) +
-					(hexval(in[3]) << 4) +
-					hexval(in[4]);
-				check_unicode_value(unicode, in, yyscanner);
-				if (pair_first)
-				{
-					if (is_utf16_surrogate_second(unicode))
-					{
-						unicode = surrogate_pair_to_codepoint(pair_first, unicode);
-						pair_first = 0;
-					}
-					else
-					{
-						ADVANCE_YYLLOC(in - litbuf + 3);		/* 3 for U&" */
-						yyerror("invalid Unicode surrogate pair");
-					}
-				}
-				else if (is_utf16_surrogate_second(unicode))
-					yyerror("invalid Unicode surrogate pair");
-
-				if (is_utf16_surrogate_first(unicode))
-					pair_first = unicode;
-				else
-				{
-					unicode_to_utf8(unicode, (unsigned char *) out);
-					out += pg_mblen(out);
-				}
-				in += 5;
-			}
-			else if (in[1] == '+' &&
-					 isxdigit((unsigned char) in[2]) &&
-					 isxdigit((unsigned char) in[3]) &&
-					 isxdigit((unsigned char) in[4]) &&
-					 isxdigit((unsigned char) in[5]) &&
-					 isxdigit((unsigned char) in[6]) &&
-					 isxdigit((unsigned char) in[7]))
-			{
-				pg_wchar	unicode;
-
-				unicode = (hexval(in[2]) << 20) +
-					(hexval(in[3]) << 16) +
-					(hexval(in[4]) << 12) +
-					(hexval(in[5]) << 8) +
-					(hexval(in[6]) << 4) +
-					hexval(in[7]);
-				check_unicode_value(unicode, in, yyscanner);
-				if (pair_first)
-				{
-					if (is_utf16_surrogate_second(unicode))
-					{
-						unicode = surrogate_pair_to_codepoint(pair_first, unicode);
-						pair_first = 0;
-					}
-					else
-					{
-						ADVANCE_YYLLOC(in - litbuf + 3);		/* 3 for U&" */
-						yyerror("invalid Unicode surrogate pair");
-					}
-				}
-				else if (is_utf16_surrogate_second(unicode))
-					yyerror("invalid Unicode surrogate pair");
-
-				if (is_utf16_surrogate_first(unicode))
-					pair_first = unicode;
-				else
-				{
-					unicode_to_utf8(unicode, (unsigned char *) out);
-					out += pg_mblen(out);
-				}
-				in += 8;
-			}
-			else
-			{
-				ADVANCE_YYLLOC(in - litbuf + 3);		/* 3 for U&" */
-				yyerror("invalid Unicode escape value");
-			}
-		}
-		else
-		{
-			if (pair_first)
-			{
-				ADVANCE_YYLLOC(in - litbuf + 3);		/* 3 for U&" */
-				yyerror("invalid Unicode surrogate pair");
-			}
-			*out++ = *in++;
-		}
-	}
-
-	/* unfinished surrogate pair? */
-	if (pair_first)
-	{
-		ADVANCE_YYLLOC(in - litbuf + 3);				/* 3 for U&" */
-		yyerror("invalid Unicode surrogate pair");
-	}
-
-	*out = '\0';
-
-	/*
-	 * We could skip pg_verifymbstr if we didn't process any non-7-bit-ASCII
-	 * codes; but it's probably not worth the trouble, since this isn't likely
-	 * to be a performance-critical path.
-	 */
-	pg_verifymbstr(new, out - new, false);
-	return new;
-}
-
 static unsigned char
 unescape_single_char(unsigned char c, core_yyscan_t yyscanner)
 {
diff --git a/src/fe_utils/psqlscan.l b/src/fe_utils/psqlscan.l
index ce20936339..eba7490078 100644
--- a/src/fe_utils/psqlscan.l
+++ b/src/fe_utils/psqlscan.l
@@ -114,12 +114,11 @@ extern void psql_yyset_column(int column_no, yyscan_t yyscanner);
  *  <xd> delimited identifiers (double-quoted identifiers)
  *  <xh> hexadecimal numeric string
  *  <xq> standard quoted strings
+ *  <xqs> quote stop (detect continued strings)
  *  <xe> extended quoted strings (support backslash escape sequences)
  *  <xdolq> $foo$ quoted strings
  *  <xui> quoted identifier with Unicode escapes
- *  <xuiend> end of a quoted identifier with Unicode escapes, UESCAPE can follow
  *  <xus> quoted string with Unicode escapes
- *  <xusend> end of a quoted string with Unicode escapes, UESCAPE can follow
  *
  * Note: we intentionally don't mimic the backend's <xeu> state; we have
  * no need to distinguish it from <xe> state, and no good way to get out
@@ -132,12 +131,11 @@ extern void psql_yyset_column(int column_no, yyscan_t yyscanner);
 %x xd
 %x xh
 %x xq
+%x xqs
 %x xe
 %x xdolq
 %x xui
-%x xuiend
 %x xus
-%x xusend
 
 /*
  * In order to make the world safe for Windows and Mac clients as well as
@@ -177,19 +175,18 @@ special_whitespace		({space}+|{comment}{newline})
 horiz_whitespace		({horiz_space}|{comment})
 whitespace_with_newline	({horiz_whitespace}*{newline}{special_whitespace}*)
 
+quote			'
+/* If we see {quote} then {quotecontinue}, the quoted string continues */
+quotecontinue	{whitespace_with_newline}{quote}
+
 /*
- * To ensure that {quotecontinue} can be scanned without having to back up
- * if the full pattern isn't matched, we include trailing whitespace in
- * {quotestop}.  This matches all cases where {quotecontinue} fails to match,
- * except for {quote} followed by whitespace and just one "-" (not two,
- * which would start a {comment}).  To cover that we have {quotefail}.
- * The actions for {quotestop} and {quotefail} must throw back characters
- * beyond the quote proper.
+ * {quotecontinuefail} is needed to avoid lexer backup when we fail to match
+ * {quotecontinue}.  It might seem that this could just be {whitespace}*,
+ * but if there's a dash after {whitespace_with_newline}, it must be consumed
+ * to see if there's another dash --- which would start a {comment} and thus
+ * allow continuation of the {quotecontinue} token.
  */
-quote			'
-quotestop		{quote}{whitespace}*
-quotecontinue	{quote}{whitespace_with_newline}{quote}
-quotefail		{quote}{whitespace}*"-"
+quotecontinuefail	{whitespace}*"-"?
 
 /* Bit string
  * It is tempting to scan the string for only those characters
@@ -250,21 +247,12 @@ xdstop			{dquote}
 xddouble		{dquote}{dquote}
 xdinside		[^"]+
 
-/* Unicode escapes */
-uescape			[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']{quote}
-/* error rule to avoid backup */
-uescapefail		[uU][eE][sS][cC][aA][pP][eE]{whitespace}*"-"|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*|[uU][eE][sS][cC][aA][pP]|[uU][eE][sS][cC][aA]|[uU][eE][sS][cC]|[uU][eE][sS]|[uU][eE]|[uU]
-
 /* Quoted identifier with Unicode escapes */
 xuistart		[uU]&{dquote}
 
 /* Quoted string with Unicode escapes */
 xusstart		[uU]&{quote}
 
-/* Optional UESCAPE after a quoted string or identifier with Unicode escapes. */
-xustop1		{uescapefail}?
-xustop2		{uescape}
-
 /* error rule to avoid backup */
 xufailed		[uU]&
 
@@ -438,20 +426,10 @@ other			.
 					BEGIN(xb);
 					ECHO;
 				}
-<xb>{quotestop}	|
-<xb>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
-					ECHO;
-				}
 <xh>{xhinside}	|
 <xb>{xbinside}	{
 					ECHO;
 				}
-<xh>{quotecontinue}	|
-<xb>{quotecontinue}	{
-					ECHO;
-				}
 
 {xhstart}		{
 					/* Hexadecimal bit type.
@@ -463,12 +441,6 @@ other			.
 					BEGIN(xh);
 					ECHO;
 				}
-<xh>{quotestop}	|
-<xh>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
-					ECHO;
-				}
 
 {xnstart}		{
 					yyless(1);	/* eat only 'n' this time */
@@ -490,32 +462,38 @@ other			.
 					BEGIN(xus);
 					ECHO;
 				}
-<xq,xe>{quotestop}	|
-<xq,xe>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
-					ECHO;
-				}
-<xus>{quotestop} |
-<xus>{quotefail} {
-					/* throw back all but the quote */
-					yyless(1);
-					BEGIN(xusend);
+
+<xb,xh,xq,xe,xus>{quote} {
+					/*
+					 * When we are scanning a quoted string and see an end
+					 * quote, we must look ahead for a possible continuation.
+					 * If we don't see one, we know the end quote was in fact
+					 * the end of the string.  To reduce the lexer table size,
+					 * we use a single "xqs" state to do the lookahead for all
+					 * types of strings.
+					 */
+					cur_state->state_before_str_stop = YYSTATE;
+					BEGIN(xqs);
 					ECHO;
 				}
-<xusend>{whitespace} {
+<xqs>{quotecontinue} {
+					/*
+					 * Found a quote continuation, so return to the in-quote
+					 * state and continue scanning the literal.
+					 */
+					BEGIN(cur_state->state_before_str_stop);
 					ECHO;
 				}
-<xusend>{other} |
-<xusend>{xustop1} {
+<xqs>{quotecontinuefail} |
+<xqs>{other}	{
+					/*
+					 * Failed to see a quote continuation.  Throw back
+					 * everything after the end quote.
+					 */
 					yyless(0);
 					BEGIN(INITIAL);
-					ECHO;
-				}
-<xusend>{xustop2} {
-					BEGIN(INITIAL);
-					ECHO;
 				}
+
 <xq,xe,xus>{xqdouble} {
 					ECHO;
 				}
@@ -540,9 +518,6 @@ other			.
 <xe>{xehexesc}  {
 					ECHO;
 				}
-<xq,xe,xus>{quotecontinue} {
-					ECHO;
-				}
 <xe>.			{
 					/* This is only needed for \ just before EOF */
 					ECHO;
@@ -600,23 +575,10 @@ other			.
 					ECHO;
 				}
 <xui>{dquote} {
-					yyless(1);
-					BEGIN(xuiend);
-					ECHO;
-				}
-<xuiend>{whitespace} {
-					ECHO;
-				}
-<xuiend>{other} |
-<xuiend>{xustop1} {
-					yyless(0);
-					BEGIN(INITIAL);
-					ECHO;
-				}
-<xuiend>{xustop2}	{
 					BEGIN(INITIAL);
 					ECHO;
 				}
+
 <xd,xui>{xddouble}	{
 					ECHO;
 				}
@@ -1084,8 +1046,7 @@ psql_scan(PsqlScanState state,
 			switch (state->start_state)
 			{
 				case INITIAL:
-				case xuiend:	/* we treat these like INITIAL */
-				case xusend:
+				case xqs:		/* we treat this like INITIAL */
 					if (state->paren_depth > 0)
 					{
 						result = PSCAN_INCOMPLETE;
@@ -1240,7 +1201,8 @@ psql_scan_reselect_sql_lexer(PsqlScanState state)
 bool
 psql_scan_in_quote(PsqlScanState state)
 {
-	return state->start_state != INITIAL;
+	return state->start_state != INITIAL &&
+			state->start_state != xqs;
 }
 
 /*
diff --git a/src/include/fe_utils/psqlscan_int.h b/src/include/fe_utils/psqlscan_int.h
index 2acb380078..f53ccbf82e 100644
--- a/src/include/fe_utils/psqlscan_int.h
+++ b/src/include/fe_utils/psqlscan_int.h
@@ -110,6 +110,7 @@ typedef struct PsqlScanStateData
 	 * and updated with its finishing state on exit.
 	 */
 	int			start_state;	/* yylex's starting/finishing state */
+	int			state_before_str_stop;	/* start cond. before end quote */
 	int			paren_depth;	/* depth of nesting in parentheses */
 	int			xcdepth;		/* depth of nesting in slash-star comments */
 	char	   *dolqstart;		/* current $foo$ quote start string */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 00ace8425e..5893d317d8 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -416,6 +416,7 @@ PG_KEYWORD("truncate", TRUNCATE, UNRESERVED_KEYWORD)
 PG_KEYWORD("trusted", TRUSTED, UNRESERVED_KEYWORD)
 PG_KEYWORD("type", TYPE_P, UNRESERVED_KEYWORD)
 PG_KEYWORD("types", TYPES_P, UNRESERVED_KEYWORD)
+PG_KEYWORD("uescape", UESCAPE, UNRESERVED_KEYWORD)
 PG_KEYWORD("unbounded", UNBOUNDED, UNRESERVED_KEYWORD)
 PG_KEYWORD("uncommitted", UNCOMMITTED, UNRESERVED_KEYWORD)
 PG_KEYWORD("unencrypted", UNENCRYPTED, UNRESERVED_KEYWORD)
diff --git a/src/include/parser/scanner.h b/src/include/parser/scanner.h
index 731a2bd264..571d5e273f 100644
--- a/src/include/parser/scanner.h
+++ b/src/include/parser/scanner.h
@@ -48,7 +48,7 @@ typedef union core_YYSTYPE
  * However, those are not defined in this file, because bison insists on
  * defining them for itself.  The token codes used by the core scanner are
  * the ASCII characters plus these:
- *	%token <str>	IDENT FCONST SCONST BCONST XCONST Op
+ *	%token <str>	IDENT UIDENT FCONST SCONST UCONST BCONST XCONST Op
  *	%token <ival>	ICONST PARAM
  *	%token			TYPECAST DOT_DOT COLON_EQUALS EQUALS_GREATER
  *	%token			LESS_EQUALS GREATER_EQUALS NOT_EQUALS
@@ -99,6 +99,7 @@ typedef struct core_yy_extra_type
 	int			literallen;		/* actual current string length */
 	int			literalalloc;	/* current allocated buffer size */
 
+	int			state_before_str_stop;	/* start cond. before end quote */
 	int			xcdepth;		/* depth of nesting in slash-star comments */
 	char	   *dolqstart;		/* current $foo$ quote start string */
 
diff --git a/src/include/parser/scansup.h b/src/include/parser/scansup.h
index fb2980bd17..8370982b47 100644
--- a/src/include/parser/scansup.h
+++ b/src/include/parser/scansup.h
@@ -15,6 +15,9 @@
 #ifndef SCANSUP_H
 #define SCANSUP_H
 
+#include "mb/pg_wchar.h"
+
+
 extern char *scanstr(const char *s);
 
 extern char *downcase_truncate_identifier(const char *ident, int len,
@@ -27,4 +30,35 @@ extern void truncate_identifier(char *ident, int len, bool warn);
 
 extern bool scanner_isspace(char ch);
 
+static inline unsigned int
+hexval(unsigned char c)
+{
+	if (c >= '0' && c <= '9')
+		return c - '0';
+	if (c >= 'a' && c <= 'f')
+		return c - 'a' + 0xA;
+	if (c >= 'A' && c <= 'F')
+		return c - 'A' + 0xA;
+	elog(ERROR, "invalid hexadecimal digit");
+	return 0;					/* not reached */
+}
+
+static inline bool
+is_utf16_surrogate_first(pg_wchar c)
+{
+	return (c >= 0xD800 && c <= 0xDBFF);
+}
+
+static inline bool
+is_utf16_surrogate_second(pg_wchar c)
+{
+	return (c >= 0xDC00 && c <= 0xDFFF);
+}
+
+static inline pg_wchar
+surrogate_pair_to_codepoint(pg_wchar first, pg_wchar second)
+{
+	return ((first & 0x3FF) << 10) + 0x10000 + (second & 0x3FF);
+}
+
 #endif							/* SCANSUP_H */
diff --git a/src/interfaces/ecpg/preproc/ecpg.tokens b/src/interfaces/ecpg/preproc/ecpg.tokens
index 1d613af02f..749a9146ba 100644
--- a/src/interfaces/ecpg/preproc/ecpg.tokens
+++ b/src/interfaces/ecpg/preproc/ecpg.tokens
@@ -24,4 +24,4 @@
                 S_TYPEDEF
 
 %token CSTRING CVARIABLE CPP_LINE IP
-%token DOLCONST ECONST NCONST UCONST UIDENT
+%token DOLCONST ECONST NCONST
diff --git a/src/interfaces/ecpg/preproc/ecpg.trailer b/src/interfaces/ecpg/preproc/ecpg.trailer
index b303a9cbd0..dbf1abb5fb 100644
--- a/src/interfaces/ecpg/preproc/ecpg.trailer
+++ b/src/interfaces/ecpg/preproc/ecpg.trailer
@@ -1812,7 +1812,6 @@ ecpg_sconst:
 			$$[strlen($1)+3]='\0';
 			free($1);
 		}
-		| UCONST	{ $$ = $1; }
 		| DOLCONST	{ $$ = $1; }
 		;
 
@@ -1820,7 +1819,6 @@ ecpg_xconst:	XCONST		{ $$ = make_name(); } ;
 
 ecpg_ident:	IDENT		{ $$ = make_name(); }
 		| CSTRING	{ $$ = make3_str(mm_strdup("\""), $1, mm_strdup("\"")); }
-		| UIDENT	{ $$ = $1; }
 		;
 
 quoted_ident_stringvar: name
diff --git a/src/interfaces/ecpg/preproc/parse.pl b/src/interfaces/ecpg/preproc/parse.pl
index 3619706cdc..dc40b2974c 100644
--- a/src/interfaces/ecpg/preproc/parse.pl
+++ b/src/interfaces/ecpg/preproc/parse.pl
@@ -218,8 +218,8 @@ sub main
 				if ($a eq 'IDENT' && $prior eq '%nonassoc')
 				{
 
-					# add two more tokens to the list
-					$str = $str . "\n%nonassoc CSTRING\n%nonassoc UIDENT";
+					# add one more tokens to the list
+					$str = $str . "\n%nonassoc CSTRING";
 				}
 				$prior = $a;
 			}
diff --git a/src/pl/plpgsql/src/pl_gram.y b/src/pl/plpgsql/src/pl_gram.y
index dea95f4230..f0533d8407 100644
--- a/src/pl/plpgsql/src/pl_gram.y
+++ b/src/pl/plpgsql/src/pl_gram.y
@@ -232,7 +232,7 @@ static	void			check_raise_parameters(PLpgSQL_stmt_raise *stmt);
  * Some of these are not directly referenced in this file, but they must be
  * here anyway.
  */
-%token <str>	IDENT FCONST SCONST BCONST XCONST Op
+%token <str>	IDENT UIDENT FCONST SCONST UCONST BCONST XCONST Op
 %token <ival>	ICONST PARAM
 %token			TYPECAST DOT_DOT COLON_EQUALS EQUALS_GREATER
 %token			LESS_EQUALS GREATER_EQUALS NOT_EQUALS
diff --git a/src/test/regress/expected/strings.out b/src/test/regress/expected/strings.out
index 486c00b3b3..4d48d33688 100644
--- a/src/test/regress/expected/strings.out
+++ b/src/test/regress/expected/strings.out
@@ -48,17 +48,40 @@ SELECT 'tricky' AS U&"\" UESCAPE '!';
 (1 row)
 
 SELECT U&'wrong: \061';
-ERROR:  invalid Unicode escape value at or near "\061'"
+ERROR:  invalid Unicode escape value
 LINE 1: SELECT U&'wrong: \061';
                          ^
 SELECT U&'wrong: \+0061';
-ERROR:  invalid Unicode escape value at or near "\+0061'"
+ERROR:  invalid Unicode escape value
 LINE 1: SELECT U&'wrong: \+0061';
                          ^
 SELECT U&'wrong: +0061' UESCAPE '+';
-ERROR:  invalid Unicode escape character at or near "+'"
+ERROR:  invalid Unicode escape character "+"
 LINE 1: SELECT U&'wrong: +0061' UESCAPE '+';
                                          ^
+-- handling of Unicode surrogate pairs
+SELECT U&'\d83d\de04\d83d\dc36' as correct_in_utf8;
+ correct_in_utf8 
+-----------------
+ 😄🐶
+(1 row)
+
+SELECT U&'\d83d\d83d'; -- 2 high surrogates in a row
+ERROR:  invalid Unicode surrogate pair
+LINE 1: SELECT U&'\d83d\d83d';
+                       ^
+SELECT U&'\de04\d83d'; -- surrogates in wrong order
+ERROR:  invalid Unicode surrogate pair
+LINE 1: SELECT U&'\de04\d83d';
+                  ^
+SELECT U&'\d83dX'; -- orphan high surrogate
+ERROR:  invalid Unicode surrogate pair
+LINE 1: SELECT U&'\d83dX';
+                       ^
+SELECT U&'\de04X'; -- orphan low surrogate
+ERROR:  invalid Unicode surrogate pair
+LINE 1: SELECT U&'\de04X';
+                  ^
 SET standard_conforming_strings TO off;
 SELECT U&'d\0061t\+000061' AS U&"d\0061t\+000061";
 ERROR:  unsafe use of string constant with Unicode escapes
diff --git a/src/test/regress/sql/strings.sql b/src/test/regress/sql/strings.sql
index 5744c9f800..9d81096aa7 100644
--- a/src/test/regress/sql/strings.sql
+++ b/src/test/regress/sql/strings.sql
@@ -29,6 +29,13 @@ SELECT U&'wrong: \061';
 SELECT U&'wrong: \+0061';
 SELECT U&'wrong: +0061' UESCAPE '+';
 
+-- handling of Unicode surrogate pairs
+SELECT U&'\d83d\de04\d83d\dc36' as correct_in_utf8;
+SELECT U&'\d83d\d83d'; -- 2 high surrogates in a row
+SELECT U&'\de04\d83d'; -- surrogates in wrong order
+SELECT U&'\d83dX'; -- orphan high surrogate
+SELECT U&'\de04X'; -- orphan low surrogate
+
 SET standard_conforming_strings TO off;
 
 SELECT U&'d\0061t\+000061' AS U&"d\0061t\+000061";

#19

Thomas Munro

thomas.munro@gmail.com

over 6 years ago

In reply to: John Naylor (#18)

Re: benchmarking Flex practices

On Thu, Aug 1, 2019 at 8:51 PM John Naylor <john.naylor@2ndquadrant.com> wrote:

select U&'\de04\d83d'; -- surrogates in wrong order
-psql:test_unicode.sql:10: ERROR:  invalid Unicode surrogate pair at
or near "U&'\de04\d83d'"
+psql:test_unicode.sql:10: ERROR:  invalid Unicode surrogate pair
LINE 1: select U&'\de04\d83d';
-               ^
+                  ^
select U&'\de04X'; -- orphan low surrogate
-psql:test_unicode.sql:12: ERROR:  invalid Unicode surrogate pair at
or near "U&'\de04X'"
+psql:test_unicode.sql:12: ERROR:  invalid Unicode surrogate pair
LINE 1: select U&'\de04X';
-               ^
+                  ^

While moving this to the September CF, I noticed this failure on Windows:

+ERROR: Unicode escape values cannot be used for code point values
above 007F when the server encoding is not UTF8
LINE 1: SELECT U&'\d83d\d83d';
^

https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.50382

--
Thomas Munro
https://enterprisedb.com

#20

Alvaro Herrera

alvherre@2ndquadrant.com

over 6 years ago

In reply to: John Naylor (#18)

Re: benchmarking Flex practices

... it seems this patch needs attention, but I'm not sure from whom.
The tests don't pass whenever the server encoding is not UTF8, so I
suppose we should either have an alternate expected output file to
account for that, or the tests should be removed. But anyway the code
needs to be reviewed.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#21

Tom Lane

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Alvaro Herrera (#20)

Re: benchmarking Flex practices

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

... it seems this patch needs attention, but I'm not sure from whom.
The tests don't pass whenever the server encoding is not UTF8, so I
suppose we should either have an alternate expected output file to
account for that, or the tests should be removed. But anyway the code
needs to be reviewed.

Yeah, I'm overdue to review it, but other things have taken precedence.

The unportable test is not a problem at this point, since the patch
isn't finished anyway. I'm not sure yet whether it'd be worth
preserving that test case in the final version.

regards, tom lane

#22

Tom Lane

tgl@sss.pgh.pa.us

about 6 years ago

In reply to: John Naylor (#18)

1 attachment(s)

Re: benchmarking Flex practices

[ My apologies for being so slow to get back to this ]

John Naylor <john.naylor@2ndquadrant.com> writes:

Now that I think of it, the regression in v7 was largely due to the
fact that the parser has to call the lexer 3 times per string in this
case, and that's going to be slower no matter what we do.

Ah, of course. I'm not too fussed about the performance of queries with
an explicit UESCAPE clause, as that seems like a very minority use-case.
What we do want to pay attention to is not regressing for plain
identifiers/strings, and to a lesser extent the U& cases without UESCAPE.

Inlining hexval() and friends seems to have helped somewhat for
unicode escapes, but I'd have to profile to improve that further.
However, v8 has regressed from v7 enough with both simple strings and
the information schema that it's a noticeable regression from HEAD.
I'm guessing getting rid of the "Uescape" production is to blame, but
I haven't tried reverting just that one piece. Since inlining the
rules didn't seem to help with the precedence hacks, it seems like the
separate production was a better way. Thoughts?

I have duplicated your performance tests here, and get more or less
the same results (see below). I agree that the performance of the
v8 patch isn't really where we want to be --- and it also seems
rather invasive to gram.y, and hence error-prone. (If we do it
like that, I bet my bottom dollar that somebody would soon commit
a patch that adds a production using IDENT not Ident, and it'd take
a long time to notice.)

It struck me though that there's another solution we haven't discussed,
and that's to make the token lookahead filter in parser.c do the work
of converting UIDENT [UESCAPE SCONST] to IDENT, and similarly for the
string case. I pursued that to the extent of developing the attached
incomplete patch ("v9"), which looks reasonable from a performance
standpoint. I get these results with tests using the drive_parser
function:

information_schema

HEAD 3447.674 ms, 3433.498 ms, 3422.407 ms
v6 3381.851 ms, 3442.478 ms, 3402.629 ms
v7 3525.865 ms, 3441.038 ms, 3473.488 ms
v8 3567.640 ms, 3488.417 ms, 3556.544 ms
v9 3456.360 ms, 3403.635 ms, 3418.787 ms

pgbench str

HEAD 4414.046 ms, 4376.222 ms, 4356.468 ms
v6 4304.582 ms, 4245.534 ms, 4263.562 ms
v7 4395.815 ms, 4398.381 ms, 4460.304 ms
v8 4475.706 ms, 4466.665 ms, 4471.048 ms
v9 4392.473 ms, 4316.549 ms, 4318.472 ms

pgbench unicode

HEAD 4959.000 ms, 4921.751 ms, 4945.069 ms
v6 4856.998 ms, 4802.996 ms, 4855.486 ms
v7 5057.199 ms, 4948.342 ms, 4956.614 ms
v8 5008.090 ms, 4963.641 ms, 4983.576 ms
v9 4809.227 ms, 4767.355 ms, 4741.641 ms

pgbench uesc

HEAD 5114.401 ms, 5235.764 ms, 5200.567 ms
v6 5030.156 ms, 5083.398 ms, 4986.974 ms
v7 5915.508 ms, 5953.135 ms, 5929.775 ms
v8 5678.810 ms, 5665.239 ms, 5645.696 ms
v9 5648.965 ms, 5601.592 ms, 5600.480 ms

(A note about what we're looking at: on my machine, after using cpupower
to lock down the CPU frequency, and taskset to bind everything to one
CPU socket, I can get numbers that are very repeatable, to 0.1% or so
... until I restart the postmaster, and then I get different but equally
repeatable numbers. The difference can be several percent, which is a lot
of noise compared to what we're looking for. I believe the explanation is
that kernel ASLR has loaded the backend executable at some different
addresses and so there are different cache-line-boundary effects. While
I could lock that down too by disabling ASLR, the result would be to
overemphasize chance effects of a particular set of cache line boundaries.
So I prefer to run all the tests over again after restarting the
postmaster, a few times, and then look at the overall set of results to
see what things look like. Each number quoted above is median-of-three
tests within a single postmaster run.)

Anyway, my conclusion is that the attached patch is at least as fast
as today's HEAD; it's not as fast as v6, but on the other hand it's
an even smaller postmaster executable, so there's something to be said
for that:

$ size postg*
text data bss dec hex filename
7478138 57928 203360 7739426 761822 postgres.head
7271218 57928 203360 7532506 72efda postgres.v6
7275810 57928 203360 7537098 7301ca postgres.v7
7276978 57928 203360 7538266 73065a postgres.v8
7266274 57928 203360 7527562 72dc8a postgres.v9

I based this on your v7 not v8; not sure if there's anything you
want to salvage from v8.

Generally, I'm pretty happy with this approach: it touches gram.y
hardly at all, and it removes just about all of the complexity from
scan.l. I'm happier about dropping the support code into parser.c
than the other choices we've discussed.

There's still undone work here, though:

* I did not touch psql. Probably your patch is fine for that.

* I did not do more with ecpg than get it to compile, using the
same hacks as in your v7. It still fails its regression tests,
but now the reason is that what we've done in parser/parser.c
needs to be transposed into the identical functionality in
ecpg/preproc/parser.c. Or at least some kind of functionality
there. A problem with this approach is that it presumes we can
reduce a UIDENT sequence to a plain IDENT, but to do so we need
assumptions about the target encoding, and I'm not sure that
ecpg should make any such assumptions. Maybe ecpg should just
reject all cases that produce non-ASCII identifiers? (Probably
it could be made to do something smarter with more work, but
it's not clear to me that it's worth the trouble.)

* I haven't convinced myself either way as to whether it'd be
better to factor out the code duplicated between the UIDENT
and UCONST cases in base_yylex.

If this seems like a reasonable approach to you, please fill in
the missing psql and ecpg bits.

regards, tom lane

Attachments:

v9-draft-handle-uescapes-in-parser.patchtext/x-diff; charset=us-ascii; name=v9-draft-handle-uescapes-in-parser.patchDownload

diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index c508684..1f10340 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -601,7 +601,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
  * DOT_DOT is unused in the core SQL grammar, and so will always provoke
  * parse errors.  It is needed by PL/pgSQL.
  */
-%token <str>	IDENT FCONST SCONST BCONST XCONST Op
+%token <str>	IDENT UIDENT FCONST SCONST UCONST BCONST XCONST Op
 %token <ival>	ICONST PARAM
 %token			TYPECAST DOT_DOT COLON_EQUALS EQUALS_GREATER
 %token			LESS_EQUALS GREATER_EQUALS NOT_EQUALS
@@ -691,7 +691,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	TREAT TRIGGER TRIM TRUE_P
 	TRUNCATE TRUSTED TYPE_P TYPES_P
 
-	UNBOUNDED UNCOMMITTED UNENCRYPTED UNION UNIQUE UNKNOWN UNLISTEN UNLOGGED
+	UESCAPE UNBOUNDED UNCOMMITTED UNENCRYPTED UNION UNIQUE UNKNOWN UNLISTEN UNLOGGED
 	UNTIL UPDATE USER USING
 
 	VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
@@ -15374,6 +15374,7 @@ unreserved_keyword:
 			| TRUSTED
 			| TYPE_P
 			| TYPES_P
+			| UESCAPE
 			| UNBOUNDED
 			| UNCOMMITTED
 			| UNENCRYPTED
diff --git a/src/backend/parser/parser.c b/src/backend/parser/parser.c
index 4c0c258..e64f701 100644
--- a/src/backend/parser/parser.c
+++ b/src/backend/parser/parser.c
@@ -23,6 +23,12 @@
 
 #include "parser/gramparse.h"
 #include "parser/parser.h"
+#include "parser/scansup.h"
+#include "mb/pg_wchar.h"
+
+static bool check_uescapechar(unsigned char escape);
+static char *str_udeescape(char escape, char *str, int position,
+						   core_yyscan_t yyscanner);
 
 
 /*
@@ -75,6 +81,10 @@ raw_parser(const char *str)
  * scanner backtrack, which would cost more performance than this filter
  * layer does.
  *
+ * We also use this filter to convert UIDENT and UCONST sequences into
+ * plain IDENT and SCONST tokens.  While that could be handled by additional
+ * productions in the main grammar, it's more efficient to do it like this.
+ *
  * The filter also provides a convenient place to translate between
  * the core_YYSTYPE and YYSTYPE representations (which are really the
  * same thing anyway, but notationally they're different).
@@ -104,7 +114,7 @@ base_yylex(YYSTYPE *lvalp, YYLTYPE *llocp, core_yyscan_t yyscanner)
 	 * If this token isn't one that requires lookahead, just return it.  If it
 	 * does, determine the token length.  (We could get that via strlen(), but
 	 * since we have such a small set of possibilities, hardwiring seems
-	 * feasible and more efficient.)
+	 * feasible and more efficient --- at least for the fixed-length cases.)
 	 */
 	switch (cur_token)
 	{
@@ -117,6 +127,10 @@ base_yylex(YYSTYPE *lvalp, YYLTYPE *llocp, core_yyscan_t yyscanner)
 		case WITH:
 			cur_token_length = 4;
 			break;
+		case UIDENT:
+		case UCONST:
+			cur_token_length = strlen(yyextra->core_yy_extra.scanbuf + *llocp);
+			break;
 		default:
 			return cur_token;
 	}
@@ -190,7 +204,311 @@ base_yylex(YYSTYPE *lvalp, YYLTYPE *llocp, core_yyscan_t yyscanner)
 					break;
 			}
 			break;
+
+		case UIDENT:
+			/* Look ahead for UESCAPE */
+			if (next_token == UESCAPE)
+			{
+				/* Yup, so get third token, which had better be SCONST */
+				const char *escstr;
+
+				/* Again save and restore *llocp */
+				cur_yylloc = *llocp;
+
+				/* Get third token */
+				next_token = core_yylex(&(yyextra->lookahead_yylval),
+										llocp, yyscanner);
+
+				/* If we throw error here, it will point to third token */
+				if (next_token != SCONST)
+					scanner_yyerror("UESCAPE must be followed by a simple string literal",
+									yyscanner);
+
+				escstr = yyextra->lookahead_yylval.str;
+				if (strlen(escstr) != 1 || !check_uescapechar(escstr[0]))
+					scanner_yyerror("invalid Unicode escape character",
+									yyscanner);
+
+				/* Now restore *llocp; errors will point to first token */
+				*llocp = cur_yylloc;
+
+				/* Apply Unicode conversion */
+				lvalp->core_yystype.str =
+					str_udeescape(escstr[0],
+								  lvalp->core_yystype.str,
+								  *llocp,
+								  yyscanner);
+
+				/*
+				 * We don't need to un-revert truncation of UESCAPE.  What we
+				 * do want to do is clear have_lookahead, thereby consuming
+				 * all three tokens.
+				 */
+				yyextra->have_lookahead = false;
+			}
+			else
+			{
+				/* No UESCAPE, so convert using default escape character */
+				lvalp->core_yystype.str =
+					str_udeescape('\\',
+								  lvalp->core_yystype.str,
+								  *llocp,
+								  yyscanner);
+			}
+			/* It's an identifier, so truncate as appropriate */
+			truncate_identifier(lvalp->core_yystype.str,
+								strlen(lvalp->core_yystype.str),
+								true);
+			cur_token = IDENT;
+			break;
+
+		case UCONST:
+			/* Look ahead for UESCAPE */
+			if (next_token == UESCAPE)
+			{
+				/* Yup, so get third token, which had better be SCONST */
+				const char *escstr;
+
+				/* Again save and restore *llocp */
+				cur_yylloc = *llocp;
+
+				/* Get third token */
+				next_token = core_yylex(&(yyextra->lookahead_yylval),
+										llocp, yyscanner);
+
+				/* If we throw error here, it will point to third token */
+				if (next_token != SCONST)
+					scanner_yyerror("UESCAPE must be followed by a simple string literal",
+									yyscanner);
+
+				escstr = yyextra->lookahead_yylval.str;
+				if (strlen(escstr) != 1 || !check_uescapechar(escstr[0]))
+					scanner_yyerror("invalid Unicode escape character",
+									yyscanner);
+
+				/* Now restore *llocp; errors will point to first token */
+				*llocp = cur_yylloc;
+
+				/* Apply Unicode conversion */
+				lvalp->core_yystype.str =
+					str_udeescape(escstr[0],
+								  lvalp->core_yystype.str,
+								  *llocp,
+								  yyscanner);
+
+				/*
+				 * We don't need to un-revert truncation of UESCAPE.  What we
+				 * do want to do is clear have_lookahead, thereby consuming
+				 * all three tokens.
+				 */
+				yyextra->have_lookahead = false;
+			}
+			else
+			{
+				/* No UESCAPE, so convert using default escape character */
+				lvalp->core_yystype.str =
+					str_udeescape('\\',
+								  lvalp->core_yystype.str,
+								  *llocp,
+								  yyscanner);
+			}
+			cur_token = SCONST;
+			break;
 	}
 
 	return cur_token;
 }
+
+/* convert hex digit (caller should have verified that) to value */
+static unsigned int
+hexval(unsigned char c)
+{
+	if (c >= '0' && c <= '9')
+		return c - '0';
+	if (c >= 'a' && c <= 'f')
+		return c - 'a' + 0xA;
+	if (c >= 'A' && c <= 'F')
+		return c - 'A' + 0xA;
+	elog(ERROR, "invalid hexadecimal digit");
+	return 0;					/* not reached */
+}
+
+/* is Unicode code point acceptable in database's encoding? */
+static void
+check_unicode_value(pg_wchar c, int pos, core_yyscan_t yyscanner)
+{
+	/* See also addunicode() in scan.l */
+	if (c == 0 || c > 0x10FFFF)
+		ereport(ERROR,
+				(errcode(ERRCODE_SYNTAX_ERROR),
+				 errmsg("invalid Unicode escape value"),
+				 scanner_errposition(pos, yyscanner)));
+
+	if (c > 0x7F && GetDatabaseEncoding() != PG_UTF8)
+		ereport(ERROR,
+				(errcode(ERRCODE_SYNTAX_ERROR),
+				 errmsg("Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8"),
+				 scanner_errposition(pos, yyscanner)));
+}
+
+/* is 'escape' acceptable as Unicode escape character (UESCAPE syntax) ? */
+static bool
+check_uescapechar(unsigned char escape)
+{
+	if (isxdigit(escape)
+		|| escape == '+'
+		|| escape == '\''
+		|| escape == '"'
+		|| scanner_isspace(escape))
+		return false;
+	else
+		return true;
+}
+
+/* Process Unicode escapes in "str", producing a palloc'd plain string */
+static char *
+str_udeescape(char escape, char *str, int position,
+			  core_yyscan_t yyscanner)
+{
+	char	   *new,
+			   *in,
+			   *out;
+	int			str_length;
+	pg_wchar	pair_first = 0;
+
+	str_length = strlen(str);
+
+	/*
+	 * This relies on the subtle assumption that a UTF-8 expansion cannot be
+	 * longer than its escaped representation.
+	 */
+	new = palloc(str_length + 1);
+
+	in = str;
+	out = new;
+	while (*in)
+	{
+		if (in[0] == escape)
+		{
+			if (in[1] == escape)
+			{
+				if (pair_first)
+					goto invalid_pair;
+				*out++ = escape;
+				in += 2;
+			}
+			else if (isxdigit((unsigned char) in[1]) &&
+					 isxdigit((unsigned char) in[2]) &&
+					 isxdigit((unsigned char) in[3]) &&
+					 isxdigit((unsigned char) in[4]))
+			{
+				pg_wchar	unicode;
+
+				unicode = (hexval(in[1]) << 12) +
+					(hexval(in[2]) << 8) +
+					(hexval(in[3]) << 4) +
+					hexval(in[4]);
+				check_unicode_value(unicode,
+									position + in - str + 3,	/* 3 for U&" */
+									yyscanner);
+				if (pair_first)
+				{
+					if (is_utf16_surrogate_second(unicode))
+					{
+						unicode = surrogate_pair_to_codepoint(pair_first, unicode);
+						pair_first = 0;
+					}
+					else
+						goto invalid_pair;
+				}
+				else if (is_utf16_surrogate_second(unicode))
+					goto invalid_pair;
+
+				if (is_utf16_surrogate_first(unicode))
+					pair_first = unicode;
+				else
+				{
+					unicode_to_utf8(unicode, (unsigned char *) out);
+					out += pg_mblen(out);
+				}
+				in += 5;
+			}
+			else if (in[1] == '+' &&
+					 isxdigit((unsigned char) in[2]) &&
+					 isxdigit((unsigned char) in[3]) &&
+					 isxdigit((unsigned char) in[4]) &&
+					 isxdigit((unsigned char) in[5]) &&
+					 isxdigit((unsigned char) in[6]) &&
+					 isxdigit((unsigned char) in[7]))
+			{
+				pg_wchar	unicode;
+
+				unicode = (hexval(in[2]) << 20) +
+					(hexval(in[3]) << 16) +
+					(hexval(in[4]) << 12) +
+					(hexval(in[5]) << 8) +
+					(hexval(in[6]) << 4) +
+					hexval(in[7]);
+				check_unicode_value(unicode,
+									position + in - str + 3,	/* 3 for U&" */
+									yyscanner);
+				if (pair_first)
+				{
+					if (is_utf16_surrogate_second(unicode))
+					{
+						unicode = surrogate_pair_to_codepoint(pair_first, unicode);
+						pair_first = 0;
+					}
+					else
+						goto invalid_pair;
+				}
+				else if (is_utf16_surrogate_second(unicode))
+					goto invalid_pair;
+
+				if (is_utf16_surrogate_first(unicode))
+					pair_first = unicode;
+				else
+				{
+					unicode_to_utf8(unicode, (unsigned char *) out);
+					out += pg_mblen(out);
+				}
+				in += 8;
+			}
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("invalid Unicode escape value"),
+						 scanner_errposition(position + in - str + 3,	/* 3 for U&" */
+											 yyscanner)));
+		}
+		else
+		{
+			if (pair_first)
+				goto invalid_pair;
+
+			*out++ = *in++;
+		}
+	}
+
+	/* unfinished surrogate pair? */
+	if (pair_first)
+		goto invalid_pair;
+
+	*out = '\0';
+
+	/*
+	 * We could skip pg_verifymbstr if we didn't process any non-7-bit-ASCII
+	 * codes; but it's probably not worth the trouble, since this isn't likely
+	 * to be a performance-critical path.
+	 */
+	pg_verifymbstr(new, out - new, false);
+	return new;
+
+invalid_pair:
+	ereport(ERROR,
+			(errcode(ERRCODE_SYNTAX_ERROR),
+			 errmsg("invalid Unicode surrogate pair"),
+			 scanner_errposition(position + in - str + 3,	/* 3 for U&" */
+								 yyscanner)));
+	return NULL;				/* keep compiler quiet */
+}
diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l
index e1cae85..a96af2c 100644
--- a/src/backend/parser/scan.l
+++ b/src/backend/parser/scan.l
@@ -110,14 +110,9 @@ const uint16 ScanKeywordTokens[] = {
 static void addlit(char *ytext, int yleng, core_yyscan_t yyscanner);
 static void addlitchar(unsigned char ychar, core_yyscan_t yyscanner);
 static char *litbufdup(core_yyscan_t yyscanner);
-static char *litbuf_udeescape(unsigned char escape, core_yyscan_t yyscanner);
 static unsigned char unescape_single_char(unsigned char c, core_yyscan_t yyscanner);
 static int	process_integer_literal(const char *token, YYSTYPE *lval);
-static bool is_utf16_surrogate_first(pg_wchar c);
-static bool is_utf16_surrogate_second(pg_wchar c);
-static pg_wchar surrogate_pair_to_codepoint(pg_wchar first, pg_wchar second);
 static void addunicode(pg_wchar c, yyscan_t yyscanner);
-static bool check_uescapechar(unsigned char escape);
 
 #define yyerror(msg)  scanner_yyerror(msg, yyscanner)
 
@@ -168,12 +163,11 @@ extern void core_yyset_column(int column_no, yyscan_t yyscanner);
  *  <xd> delimited identifiers (double-quoted identifiers)
  *  <xh> hexadecimal numeric string
  *  <xq> standard quoted strings
+ *  <xqs> quote stop (detect continued strings)
  *  <xe> extended quoted strings (support backslash escape sequences)
  *  <xdolq> $foo$ quoted strings
  *  <xui> quoted identifier with Unicode escapes
- *  <xuiend> end of a quoted identifier with Unicode escapes, UESCAPE can follow
  *  <xus> quoted string with Unicode escapes
- *  <xusend> end of a quoted string with Unicode escapes, UESCAPE can follow
  *  <xeu> Unicode surrogate pair in extended quoted string
  *
  * Remember to add an <<EOF>> case whenever you add a new exclusive state!
@@ -185,12 +179,11 @@ extern void core_yyset_column(int column_no, yyscan_t yyscanner);
 %x xd
 %x xh
 %x xq
+%x xqs
 %x xe
 %x xdolq
 %x xui
-%x xuiend
 %x xus
-%x xusend
 %x xeu
 
 /*
@@ -231,19 +224,18 @@ special_whitespace		({space}+|{comment}{newline})
 horiz_whitespace		({horiz_space}|{comment})
 whitespace_with_newline	({horiz_whitespace}*{newline}{special_whitespace}*)
 
+quote			'
+/* If we see {quote} then {quotecontinue}, the quoted string continues */
+quotecontinue	{whitespace_with_newline}{quote}
+
 /*
- * To ensure that {quotecontinue} can be scanned without having to back up
- * if the full pattern isn't matched, we include trailing whitespace in
- * {quotestop}.  This matches all cases where {quotecontinue} fails to match,
- * except for {quote} followed by whitespace and just one "-" (not two,
- * which would start a {comment}).  To cover that we have {quotefail}.
- * The actions for {quotestop} and {quotefail} must throw back characters
- * beyond the quote proper.
+ * {quotecontinuefail} is needed to avoid lexer backup when we fail to match
+ * {quotecontinue}.  It might seem that this could just be {whitespace}*,
+ * but if there's a dash after {whitespace_with_newline}, it must be consumed
+ * to see if there's another dash --- which would start a {comment} and thus
+ * allow continuation of the {quotecontinue} token.
  */
-quote			'
-quotestop		{quote}{whitespace}*
-quotecontinue	{quote}{whitespace_with_newline}{quote}
-quotefail		{quote}{whitespace}*"-"
+quotecontinuefail	{whitespace}*"-"?
 
 /* Bit string
  * It is tempting to scan the string for only those characters
@@ -304,21 +296,12 @@ xdstop			{dquote}
 xddouble		{dquote}{dquote}
 xdinside		[^"]+
 
-/* Unicode escapes */
-uescape			[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']{quote}
-/* error rule to avoid backup */
-uescapefail		[uU][eE][sS][cC][aA][pP][eE]{whitespace}*"-"|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*|[uU][eE][sS][cC][aA][pP]|[uU][eE][sS][cC][aA]|[uU][eE][sS][cC]|[uU][eE][sS]|[uU][eE]|[uU]
-
 /* Quoted identifier with Unicode escapes */
 xuistart		[uU]&{dquote}
 
 /* Quoted string with Unicode escapes */
 xusstart		[uU]&{quote}
 
-/* Optional UESCAPE after a quoted string or identifier with Unicode escapes. */
-xustop1		{uescapefail}?
-xustop2		{uescape}
-
 /* error rule to avoid backup */
 xufailed		[uU]&
 
@@ -476,21 +459,10 @@ other			.
 					startlit();
 					addlitchar('b', yyscanner);
 				}
-<xb>{quotestop}	|
-<xb>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
-					yylval->str = litbufdup(yyscanner);
-					return BCONST;
-				}
 <xh>{xhinside}	|
 <xb>{xbinside}	{
 					addlit(yytext, yyleng, yyscanner);
 				}
-<xh>{quotecontinue}	|
-<xb>{quotecontinue}	{
-					/* ignore */
-				}
 <xb><<EOF>>		{ yyerror("unterminated bit string literal"); }
 
 {xhstart}		{
@@ -505,13 +477,6 @@ other			.
 					startlit();
 					addlitchar('x', yyscanner);
 				}
-<xh>{quotestop}	|
-<xh>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
-					yylval->str = litbufdup(yyscanner);
-					return XCONST;
-				}
 <xh><<EOF>>		{ yyerror("unterminated hexadecimal string literal"); }
 
 {xnstart}		{
@@ -568,53 +533,67 @@ other			.
 					BEGIN(xus);
 					startlit();
 				}
-<xq,xe>{quotestop}	|
-<xq,xe>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
+
+<xb,xh,xq,xe,xus>{quote} {
 					/*
-					 * check that the data remains valid if it might have been
-					 * made invalid by unescaping any chars.
+					 * When we are scanning a quoted string and see an end
+					 * quote, we must look ahead for a possible continuation.
+					 * If we don't see one, we know the end quote was in fact
+					 * the end of the string.  To reduce the lexer table size,
+					 * we use a single "xqs" state to do the lookahead for all
+					 * types of strings.
 					 */
-					if (yyextra->saw_non_ascii)
-						pg_verifymbstr(yyextra->literalbuf,
-									   yyextra->literallen,
-									   false);
-					yylval->str = litbufdup(yyscanner);
-					return SCONST;
-				}
-<xus>{quotestop} |
-<xus>{quotefail} {
-					/* throw back all but the quote */
-					yyless(1);
-					/* xusend state looks for possible UESCAPE */
-					BEGIN(xusend);
+					yyextra->state_before_str_stop = YYSTATE;
+					BEGIN(xqs);
 				}
-<xusend>{whitespace} {
-					/* stay in xusend state over whitespace */
+<xqs>{quotecontinue} {
+					/*
+					 * Found a quote continuation, so return to the in-quote
+					 * state and continue scanning the literal.
+					 */
+					BEGIN(yyextra->state_before_str_stop);
 				}
-<xusend><<EOF>> |
-<xusend>{other} |
-<xusend>{xustop1} {
-					/* no UESCAPE after the quote, throw back everything */
+<xqs>{quotecontinuefail} |
+<xqs><<EOF>> |
+<xqs>{other}	{
+					/*
+					 * Failed to see a quote continuation.  Throw back
+					 * everything after the end quote, and handle the string
+					 * according to the state we were in previously.
+					 */
 					yyless(0);
 					BEGIN(INITIAL);
-					yylval->str = litbuf_udeescape('\\', yyscanner);
-					return SCONST;
-				}
-<xusend>{xustop2} {
-					/* found UESCAPE after the end quote */
-					BEGIN(INITIAL);
-					if (!check_uescapechar(yytext[yyleng - 2]))
+
+					switch (yyextra->state_before_str_stop)
 					{
-						SET_YYLLOC();
-						ADVANCE_YYLLOC(yyleng - 2);
-						yyerror("invalid Unicode escape character");
+						case xb:
+							yylval->str = litbufdup(yyscanner);
+							return BCONST;
+						case xh:
+							yylval->str = litbufdup(yyscanner);
+							return XCONST;
+						case xq:
+							/* fallthrough */
+						case xe:
+							/*
+							 * Check that the data remains valid if it
+							 * might have been made invalid by unescaping
+							 * any chars.
+							 */
+							if (yyextra->saw_non_ascii)
+								pg_verifymbstr(yyextra->literalbuf,
+											   yyextra->literallen,
+											   false);
+							yylval->str = litbufdup(yyscanner);
+							return SCONST;
+						case xus:
+							yylval->str = litbufdup(yyscanner);
+							return UCONST;
+						default:
+							yyerror("unhandled previous state in xqs");
 					}
-					yylval->str = litbuf_udeescape(yytext[yyleng - 2],
-												   yyscanner);
-					return SCONST;
 				}
+
 <xq,xe,xus>{xqdouble} {
 					addlitchar('\'', yyscanner);
 				}
@@ -693,9 +672,6 @@ other			.
 					if (c == '\0' || IS_HIGHBIT_SET(c))
 						yyextra->saw_non_ascii = true;
 				}
-<xq,xe,xus>{quotecontinue} {
-					/* ignore */
-				}
 <xe>.			{
 					/* This is only needed for \ just before EOF */
 					addlitchar(yytext[0], yyscanner);
@@ -770,53 +746,14 @@ other			.
 					return IDENT;
 				}
 <xui>{dquote} {
-					yyless(1);
-					/* xuiend state looks for possible UESCAPE */
-					BEGIN(xuiend);
-				}
-<xuiend>{whitespace} {
-					/* stay in xuiend state over whitespace */
-				}
-<xuiend><<EOF>> |
-<xuiend>{other} |
-<xuiend>{xustop1} {
-					/* no UESCAPE after the quote, throw back everything */
-					char	   *ident;
-					int			identlen;
-
-					yyless(0);
-
-					BEGIN(INITIAL);
 					if (yyextra->literallen == 0)
 						yyerror("zero-length delimited identifier");
-					ident = litbuf_udeescape('\\', yyscanner);
-					identlen = strlen(ident);
-					if (identlen >= NAMEDATALEN)
-						truncate_identifier(ident, identlen, true);
-					yylval->str = ident;
-					return IDENT;
-				}
-<xuiend>{xustop2}	{
-					/* found UESCAPE after the end quote */
-					char	   *ident;
-					int			identlen;
 
 					BEGIN(INITIAL);
-					if (yyextra->literallen == 0)
-						yyerror("zero-length delimited identifier");
-					if (!check_uescapechar(yytext[yyleng - 2]))
-					{
-						SET_YYLLOC();
-						ADVANCE_YYLLOC(yyleng - 2);
-						yyerror("invalid Unicode escape character");
-					}
-					ident = litbuf_udeescape(yytext[yyleng - 2], yyscanner);
-					identlen = strlen(ident);
-					if (identlen >= NAMEDATALEN)
-						truncate_identifier(ident, identlen, true);
-					yylval->str = ident;
-					return IDENT;
+					yylval->str = litbufdup(yyscanner);
+					return UIDENT;
 				}
+
 <xd,xui>{xddouble}	{
 					addlitchar('"', yyscanner);
 				}
@@ -1288,55 +1225,12 @@ process_integer_literal(const char *token, YYSTYPE *lval)
 	return ICONST;
 }
 
-static unsigned int
-hexval(unsigned char c)
-{
-	if (c >= '0' && c <= '9')
-		return c - '0';
-	if (c >= 'a' && c <= 'f')
-		return c - 'a' + 0xA;
-	if (c >= 'A' && c <= 'F')
-		return c - 'A' + 0xA;
-	elog(ERROR, "invalid hexadecimal digit");
-	return 0;					/* not reached */
-}
-
-static void
-check_unicode_value(pg_wchar c, char *loc, core_yyscan_t yyscanner)
-{
-	if (GetDatabaseEncoding() == PG_UTF8)
-		return;
-
-	if (c > 0x7F)
-	{
-		ADVANCE_YYLLOC(loc - yyextra->literalbuf + 3);	/* 3 for U&" */
-		yyerror("Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8");
-	}
-}
-
-static bool
-is_utf16_surrogate_first(pg_wchar c)
-{
-	return (c >= 0xD800 && c <= 0xDBFF);
-}
-
-static bool
-is_utf16_surrogate_second(pg_wchar c)
-{
-	return (c >= 0xDC00 && c <= 0xDFFF);
-}
-
-static pg_wchar
-surrogate_pair_to_codepoint(pg_wchar first, pg_wchar second)
-{
-	return ((first & 0x3FF) << 10) + 0x10000 + (second & 0x3FF);
-}
-
 static void
 addunicode(pg_wchar c, core_yyscan_t yyscanner)
 {
 	char		buf[8];
 
+	/* See also check_unicode_value() in parser.c */
 	if (c == 0 || c > 0x10FFFF)
 		yyerror("invalid Unicode escape value");
 	if (c > 0x7F)
@@ -1349,172 +1243,6 @@ addunicode(pg_wchar c, core_yyscan_t yyscanner)
 	addlit(buf, pg_mblen(buf), yyscanner);
 }
 
-/* is 'escape' acceptable as Unicode escape character (UESCAPE syntax) ? */
-static bool
-check_uescapechar(unsigned char escape)
-{
-	if (isxdigit(escape)
-		|| escape == '+'
-		|| escape == '\''
-		|| escape == '"'
-		|| scanner_isspace(escape))
-	{
-		return false;
-	}
-	else
-		return true;
-}
-
-/* like litbufdup, but handle unicode escapes */
-static char *
-litbuf_udeescape(unsigned char escape, core_yyscan_t yyscanner)
-{
-	char	   *new;
-	char	   *litbuf,
-			   *in,
-			   *out;
-	pg_wchar	pair_first = 0;
-
-	/* Make literalbuf null-terminated to simplify the scanning loop */
-	litbuf = yyextra->literalbuf;
-	litbuf[yyextra->literallen] = '\0';
-
-	/*
-	 * This relies on the subtle assumption that a UTF-8 expansion cannot be
-	 * longer than its escaped representation.
-	 */
-	new = palloc(yyextra->literallen + 1);
-
-	in = litbuf;
-	out = new;
-	while (*in)
-	{
-		if (in[0] == escape)
-		{
-			if (in[1] == escape)
-			{
-				if (pair_first)
-				{
-					ADVANCE_YYLLOC(in - litbuf + 3);	/* 3 for U&" */
-					yyerror("invalid Unicode surrogate pair");
-				}
-				*out++ = escape;
-				in += 2;
-			}
-			else if (isxdigit((unsigned char) in[1]) &&
-					 isxdigit((unsigned char) in[2]) &&
-					 isxdigit((unsigned char) in[3]) &&
-					 isxdigit((unsigned char) in[4]))
-			{
-				pg_wchar	unicode;
-
-				unicode = (hexval(in[1]) << 12) +
-					(hexval(in[2]) << 8) +
-					(hexval(in[3]) << 4) +
-					hexval(in[4]);
-				check_unicode_value(unicode, in, yyscanner);
-				if (pair_first)
-				{
-					if (is_utf16_surrogate_second(unicode))
-					{
-						unicode = surrogate_pair_to_codepoint(pair_first, unicode);
-						pair_first = 0;
-					}
-					else
-					{
-						ADVANCE_YYLLOC(in - litbuf + 3);		/* 3 for U&" */
-						yyerror("invalid Unicode surrogate pair");
-					}
-				}
-				else if (is_utf16_surrogate_second(unicode))
-					yyerror("invalid Unicode surrogate pair");
-
-				if (is_utf16_surrogate_first(unicode))
-					pair_first = unicode;
-				else
-				{
-					unicode_to_utf8(unicode, (unsigned char *) out);
-					out += pg_mblen(out);
-				}
-				in += 5;
-			}
-			else if (in[1] == '+' &&
-					 isxdigit((unsigned char) in[2]) &&
-					 isxdigit((unsigned char) in[3]) &&
-					 isxdigit((unsigned char) in[4]) &&
-					 isxdigit((unsigned char) in[5]) &&
-					 isxdigit((unsigned char) in[6]) &&
-					 isxdigit((unsigned char) in[7]))
-			{
-				pg_wchar	unicode;
-
-				unicode = (hexval(in[2]) << 20) +
-					(hexval(in[3]) << 16) +
-					(hexval(in[4]) << 12) +
-					(hexval(in[5]) << 8) +
-					(hexval(in[6]) << 4) +
-					hexval(in[7]);
-				check_unicode_value(unicode, in, yyscanner);
-				if (pair_first)
-				{
-					if (is_utf16_surrogate_second(unicode))
-					{
-						unicode = surrogate_pair_to_codepoint(pair_first, unicode);
-						pair_first = 0;
-					}
-					else
-					{
-						ADVANCE_YYLLOC(in - litbuf + 3);		/* 3 for U&" */
-						yyerror("invalid Unicode surrogate pair");
-					}
-				}
-				else if (is_utf16_surrogate_second(unicode))
-					yyerror("invalid Unicode surrogate pair");
-
-				if (is_utf16_surrogate_first(unicode))
-					pair_first = unicode;
-				else
-				{
-					unicode_to_utf8(unicode, (unsigned char *) out);
-					out += pg_mblen(out);
-				}
-				in += 8;
-			}
-			else
-			{
-				ADVANCE_YYLLOC(in - litbuf + 3);		/* 3 for U&" */
-				yyerror("invalid Unicode escape value");
-			}
-		}
-		else
-		{
-			if (pair_first)
-			{
-				ADVANCE_YYLLOC(in - litbuf + 3);		/* 3 for U&" */
-				yyerror("invalid Unicode surrogate pair");
-			}
-			*out++ = *in++;
-		}
-	}
-
-	/* unfinished surrogate pair? */
-	if (pair_first)
-	{
-		ADVANCE_YYLLOC(in - litbuf + 3);				/* 3 for U&" */
-		yyerror("invalid Unicode surrogate pair");
-	}
-
-	*out = '\0';
-
-	/*
-	 * We could skip pg_verifymbstr if we didn't process any non-7-bit-ASCII
-	 * codes; but it's probably not worth the trouble, since this isn't likely
-	 * to be a performance-critical path.
-	 */
-	pg_verifymbstr(new, out - new, false);
-	return new;
-}
-
 static unsigned char
 unescape_single_char(unsigned char c, core_yyscan_t yyscanner)
 {
diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h
index 3e3e6c4..0c4cb9c 100644
--- a/src/include/mb/pg_wchar.h
+++ b/src/include/mb/pg_wchar.h
@@ -509,6 +509,27 @@ typedef uint32 (*utf_local_conversion_func) (uint32 code);
 
 
 /*
+ * Some handy functions for Unicode-specific tests.
+ */
+static inline bool
+is_utf16_surrogate_first(pg_wchar c)
+{
+	return (c >= 0xD800 && c <= 0xDBFF);
+}
+
+static inline bool
+is_utf16_surrogate_second(pg_wchar c)
+{
+	return (c >= 0xDC00 && c <= 0xDFFF);
+}
+
+static inline pg_wchar
+surrogate_pair_to_codepoint(pg_wchar first, pg_wchar second)
+{
+	return ((first & 0x3FF) << 10) + 0x10000 + (second & 0x3FF);
+}
+
+/*
  * These functions are considered part of libpq's exported API and
  * are also declared in libpq-fe.h.
  */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 00ace84..5893d31 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -416,6 +416,7 @@ PG_KEYWORD("truncate", TRUNCATE, UNRESERVED_KEYWORD)
 PG_KEYWORD("trusted", TRUSTED, UNRESERVED_KEYWORD)
 PG_KEYWORD("type", TYPE_P, UNRESERVED_KEYWORD)
 PG_KEYWORD("types", TYPES_P, UNRESERVED_KEYWORD)
+PG_KEYWORD("uescape", UESCAPE, UNRESERVED_KEYWORD)
 PG_KEYWORD("unbounded", UNBOUNDED, UNRESERVED_KEYWORD)
 PG_KEYWORD("uncommitted", UNCOMMITTED, UNRESERVED_KEYWORD)
 PG_KEYWORD("unencrypted", UNENCRYPTED, UNRESERVED_KEYWORD)
diff --git a/src/include/parser/scanner.h b/src/include/parser/scanner.h
index 731a2bd..571d5e2 100644
--- a/src/include/parser/scanner.h
+++ b/src/include/parser/scanner.h
@@ -48,7 +48,7 @@ typedef union core_YYSTYPE
  * However, those are not defined in this file, because bison insists on
  * defining them for itself.  The token codes used by the core scanner are
  * the ASCII characters plus these:
- *	%token <str>	IDENT FCONST SCONST BCONST XCONST Op
+ *	%token <str>	IDENT UIDENT FCONST SCONST UCONST BCONST XCONST Op
  *	%token <ival>	ICONST PARAM
  *	%token			TYPECAST DOT_DOT COLON_EQUALS EQUALS_GREATER
  *	%token			LESS_EQUALS GREATER_EQUALS NOT_EQUALS
@@ -99,6 +99,7 @@ typedef struct core_yy_extra_type
 	int			literallen;		/* actual current string length */
 	int			literalalloc;	/* current allocated buffer size */
 
+	int			state_before_str_stop;	/* start cond. before end quote */
 	int			xcdepth;		/* depth of nesting in slash-star comments */
 	char	   *dolqstart;		/* current $foo$ quote start string */
 
diff --git a/src/interfaces/ecpg/preproc/ecpg.tokens b/src/interfaces/ecpg/preproc/ecpg.tokens
index 1d613af..749a914 100644
--- a/src/interfaces/ecpg/preproc/ecpg.tokens
+++ b/src/interfaces/ecpg/preproc/ecpg.tokens
@@ -24,4 +24,4 @@
                 S_TYPEDEF
 
 %token CSTRING CVARIABLE CPP_LINE IP
-%token DOLCONST ECONST NCONST UCONST UIDENT
+%token DOLCONST ECONST NCONST
diff --git a/src/interfaces/ecpg/preproc/ecpg.trailer b/src/interfaces/ecpg/preproc/ecpg.trailer
index f58b41e..efad0c0 100644
--- a/src/interfaces/ecpg/preproc/ecpg.trailer
+++ b/src/interfaces/ecpg/preproc/ecpg.trailer
@@ -1750,7 +1750,6 @@ ecpg_sconst:
 			$$[strlen($1)+3]='\0';
 			free($1);
 		}
-		| UCONST	{ $$ = $1; }
 		| DOLCONST	{ $$ = $1; }
 		;
 
@@ -1758,7 +1757,6 @@ ecpg_xconst:	XCONST		{ $$ = make_name(); } ;
 
 ecpg_ident:	IDENT		{ $$ = make_name(); }
 		| CSTRING	{ $$ = make3_str(mm_strdup("\""), $1, mm_strdup("\"")); }
-		| UIDENT	{ $$ = $1; }
 		;
 
 quoted_ident_stringvar: name
diff --git a/src/interfaces/ecpg/preproc/parse.pl b/src/interfaces/ecpg/preproc/parse.pl
index 3619706..dc40b29 100644
--- a/src/interfaces/ecpg/preproc/parse.pl
+++ b/src/interfaces/ecpg/preproc/parse.pl
@@ -218,8 +218,8 @@ sub main
 				if ($a eq 'IDENT' && $prior eq '%nonassoc')
 				{
 
-					# add two more tokens to the list
-					$str = $str . "\n%nonassoc CSTRING\n%nonassoc UIDENT";
+					# add one more tokens to the list
+					$str = $str . "\n%nonassoc CSTRING";
 				}
 				$prior = $a;
 			}
diff --git a/src/pl/plpgsql/src/pl_gram.y b/src/pl/plpgsql/src/pl_gram.y
index 454071a..3cdf928 100644
--- a/src/pl/plpgsql/src/pl_gram.y
+++ b/src/pl/plpgsql/src/pl_gram.y
@@ -232,7 +232,7 @@ static	void			check_raise_parameters(PLpgSQL_stmt_raise *stmt);
  * Some of these are not directly referenced in this file, but they must be
  * here anyway.
  */
-%token <str>	IDENT FCONST SCONST BCONST XCONST Op
+%token <str>	IDENT UIDENT FCONST SCONST UCONST BCONST XCONST Op
 %token <ival>	ICONST PARAM
 %token			TYPECAST DOT_DOT COLON_EQUALS EQUALS_GREATER
 %token			LESS_EQUALS GREATER_EQUALS NOT_EQUALS
diff --git a/src/test/regress/expected/strings.out b/src/test/regress/expected/strings.out
index 6d96843..0716e4f 100644
--- a/src/test/regress/expected/strings.out
+++ b/src/test/regress/expected/strings.out
@@ -48,17 +48,17 @@ SELECT 'tricky' AS U&"\" UESCAPE '!';
 (1 row)
 
 SELECT U&'wrong: \061';
-ERROR:  invalid Unicode escape value at or near "\061'"
+ERROR:  invalid Unicode escape value
 LINE 1: SELECT U&'wrong: \061';
                          ^
 SELECT U&'wrong: \+0061';
-ERROR:  invalid Unicode escape value at or near "\+0061'"
+ERROR:  invalid Unicode escape value
 LINE 1: SELECT U&'wrong: \+0061';
                          ^
 SELECT U&'wrong: +0061' UESCAPE '+';
-ERROR:  invalid Unicode escape character at or near "+'"
+ERROR:  invalid Unicode escape character at or near "'+'"
 LINE 1: SELECT U&'wrong: +0061' UESCAPE '+';
-                                         ^
+                                        ^
 SET standard_conforming_strings TO off;
 SELECT U&'d\0061t\+000061' AS U&"d\0061t\+000061";
 ERROR:  unsafe use of string constant with Unicode escapes

#23

John Naylor

john.naylor@2ndquadrant.com

about 6 years ago

In reply to: Tom Lane (#22)

1 attachment(s)

Re: benchmarking Flex practices

On Tue, Nov 26, 2019 at 5:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

[ My apologies for being so slow to get back to this ]

No worries -- it's a nice-to-have, not something our users are excited about.

It struck me though that there's another solution we haven't discussed,
and that's to make the token lookahead filter in parser.c do the work
of converting UIDENT [UESCAPE SCONST] to IDENT, and similarly for the
string case.

I recently tried again to get gram.y to handle it without precedence
hacks (or at least hacks with less mystery) and came to the conclusion
that maybe it just doesn't belong in the grammar after all. I hadn't
thought of any alternatives, so thanks for working on that!

It seems something is not quite right in v9 with the error position reporting:

 SELECT U&'wrong: +0061' UESCAPE '+';
 ERROR:  invalid Unicode escape character at or near "'+'"
 LINE 1: SELECT U&'wrong: +0061' UESCAPE '+';
-                                        ^
+                               ^

The caret is not pointing to the third token, or the second for that
matter. What worked for me was un-truncating the current token before
calling yylex again. To see if I'm on the right track, I've included
this in the attached, which applies on top of your v9.

Generally, I'm pretty happy with this approach: it touches gram.y
hardly at all, and it removes just about all of the complexity from
scan.l. I'm happier about dropping the support code into parser.c
than the other choices we've discussed.

Seems like the best of both worlds. If we ever wanted to ditch the
whole token filter and use Bison's %glr mode, we'd have extra work to
do, but there doesn't seem to be a rush to do so anyway.

There's still undone work here, though:

* I did not touch psql. Probably your patch is fine for that.

* I did not do more with ecpg than get it to compile, using the
same hacks as in your v7. It still fails its regression tests,
but now the reason is that what we've done in parser/parser.c
needs to be transposed into the identical functionality in
ecpg/preproc/parser.c. Or at least some kind of functionality
there. A problem with this approach is that it presumes we can
reduce a UIDENT sequence to a plain IDENT, but to do so we need
assumptions about the target encoding, and I'm not sure that
ecpg should make any such assumptions. Maybe ecpg should just
reject all cases that produce non-ASCII identifiers? (Probably
it could be made to do something smarter with more work, but
it's not clear to me that it's worth the trouble.)

Hmm, I thought we only allowed Unicode escapes in the first place if
the server encoding was UTF-8. Or did you mean something else?

If this seems like a reasonable approach to you, please fill in
the missing psql and ecpg bits.

Will do.

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

v9-addendum-handle-uescapes-in-parser.patchapplication/octet-stream; name=v9-addendum-handle-uescapes-in-parser.patchDownload

diff --git a/src/backend/parser/parser.c b/src/backend/parser/parser.c
index e64f701dc9..cc306d4d4d 100644
--- a/src/backend/parser/parser.c
+++ b/src/backend/parser/parser.c
@@ -272,6 +272,9 @@ base_yylex(YYSTYPE *lvalp, YYLTYPE *llocp, core_yyscan_t yyscanner)
 				/* Again save and restore *llocp */
 				cur_yylloc = *llocp;
 
+				/* Un-truncate current token so errors point to third token */
+				*(yyextra->lookahead_end) = yyextra->lookahead_hold_char;
+
 				/* Get third token */
 				next_token = core_yylex(&(yyextra->lookahead_yylval),
 										llocp, yyscanner);

#24

Tom Lane

tgl@sss.pgh.pa.us

about 6 years ago

In reply to: John Naylor (#23)

Re: benchmarking Flex practices

John Naylor <john.naylor@2ndquadrant.com> writes:

It seems something is not quite right in v9 with the error position reporting:

SELECT U&'wrong: +0061' UESCAPE '+';
ERROR:  invalid Unicode escape character at or near "'+'"
LINE 1: SELECT U&'wrong: +0061' UESCAPE '+';
-                                        ^
+                               ^

The caret is not pointing to the third token, or the second for that
matter.

Interesting. For me it points at the third token with or without
your fix ... some flex version discrepancy maybe? Anyway, I have
no objection to your fix; it's probably cleaner than what I had.

* I did not do more with ecpg than get it to compile, using the
same hacks as in your v7. It still fails its regression tests,
but now the reason is that what we've done in parser/parser.c
needs to be transposed into the identical functionality in
ecpg/preproc/parser.c. Or at least some kind of functionality
there. A problem with this approach is that it presumes we can
reduce a UIDENT sequence to a plain IDENT, but to do so we need
assumptions about the target encoding, and I'm not sure that
ecpg should make any such assumptions. Maybe ecpg should just
reject all cases that produce non-ASCII identifiers? (Probably
it could be made to do something smarter with more work, but
it's not clear to me that it's worth the trouble.)

Hmm, I thought we only allowed Unicode escapes in the first place if
the server encoding was UTF-8. Or did you mean something else?

Well, yeah, but the problem here is that ecpg would have to assume
that the client encoding that its output program will be executed
with is UTF-8. That seems pretty action-at-a-distance-y.

I haven't looked closely at what ecpg does with the processed
identifiers. If it just spits them out as-is, a possible solution
is to not do anything about de-escaping, but pass the sequence
U&"..." (plus UESCAPE ... if any), just like that, on to the grammar
as the value of the IDENT token.

BTW, in the back of my mind here is Chapman's point that it'd be
a large step forward in usability if we allowed Unicode escapes
when the backend encoding is *not* UTF-8. I think I see how to
get there once this patch is done, so I definitely would not like
to introduce some comparable restriction in ecpg.

regards, tom lane

#25

John Naylor

john.naylor@2ndquadrant.com

about 6 years ago

In reply to: Tom Lane (#24)

1 attachment(s)

Re: benchmarking Flex practices

On Tue, Nov 26, 2019 at 10:32 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I haven't looked closely at what ecpg does with the processed
identifiers. If it just spits them out as-is, a possible solution
is to not do anything about de-escaping, but pass the sequence
U&"..." (plus UESCAPE ... if any), just like that, on to the grammar
as the value of the IDENT token.

It does pass them along as-is, so I did it that way.

In the attached v10, I've synced both ECPG and psql.

* I haven't convinced myself either way as to whether it'd be
better to factor out the code duplicated between the UIDENT
and UCONST cases in base_yylex.

I chose to factor it out, since we have 2 versions of parser.c, and
this way was much easier to work with.

Some notes:

I arranged for the ECPG grammar to only see SCONST and IDENT. With
UCONST and UIDENT out of the way, it was a small additional step to
put all string reconstruction into the lexer, which has the advantage
of allowing removal of the other special-case ECPG string tokens as
well. The fewer special cases involved in pasting the grammar
together, the better. In doing so, I've probably introduced memory
leaks, but I wanted to get your opinion on the overall approach before
investigating.

In ECPG's parser.c, I simply copied check_uescapechar() and
ecpg_isspace(), but we could find a common place if desired. During
development, I found that this file replicates the location-tracking
logic in the backend, but doesn't seem to make use of it. I also would
have had to replicate the backend's datatype for YYLTYPE. Fixing that
might be worthwhile some day, but to get this working, I just ripped
out the extra location tracking.

I no longer use state variables to track scanner state, and in fact I
removed the existing "state_before" variable in ECPG. Instead, I used
the Flex builtins yy_push_state(), yy_pop_state(), and yy_top_state().
These have been a feature for a long time, it seems, so I think we're
okay as far as portability. I think it's cleaner this way, and
possibly faster. I also used this to reunite the xcc and xcsql states.
This whole part could be split out into a separate refactoring patch
to be applied first, if desired.

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

v10-handle-uescapes-in-parser.patchapplication/octet-stream; name=v10-handle-uescapes-in-parser.patchDownload

 src/backend/parser/gram.y                |   5 +-
 src/backend/parser/parser.c              | 279 ++++++++++++++++++-
 src/backend/parser/scan.l                | 451 +++++++------------------------
 src/fe_utils/psqlscan.l                  | 156 +++++------
 src/include/mb/pg_wchar.h                |  21 ++
 src/include/parser/kwlist.h              |   1 +
 src/include/parser/scanner.h             |   2 +-
 src/interfaces/ecpg/preproc/ecpg.tokens  |   1 -
 src/interfaces/ecpg/preproc/ecpg.trailer |  37 +--
 src/interfaces/ecpg/preproc/ecpg.type    |   6 +-
 src/interfaces/ecpg/preproc/parse.pl     |   4 +-
 src/interfaces/ecpg/preproc/parser.c     | 114 ++++++--
 src/interfaces/ecpg/preproc/pgc.l        | 269 +++++++++---------
 src/pl/plpgsql/src/pl_gram.y             |   2 +-
 src/test/regress/expected/strings.out    |  12 +-
 src/test/regress/sql/strings.sql         |   1 +
 16 files changed, 707 insertions(+), 654 deletions(-)

diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index c5086846de..1f10340484 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -601,7 +601,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
  * DOT_DOT is unused in the core SQL grammar, and so will always provoke
  * parse errors.  It is needed by PL/pgSQL.
  */
-%token <str>	IDENT FCONST SCONST BCONST XCONST Op
+%token <str>	IDENT UIDENT FCONST SCONST UCONST BCONST XCONST Op
 %token <ival>	ICONST PARAM
 %token			TYPECAST DOT_DOT COLON_EQUALS EQUALS_GREATER
 %token			LESS_EQUALS GREATER_EQUALS NOT_EQUALS
@@ -691,7 +691,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	TREAT TRIGGER TRIM TRUE_P
 	TRUNCATE TRUSTED TYPE_P TYPES_P
 
-	UNBOUNDED UNCOMMITTED UNENCRYPTED UNION UNIQUE UNKNOWN UNLISTEN UNLOGGED
+	UESCAPE UNBOUNDED UNCOMMITTED UNENCRYPTED UNION UNIQUE UNKNOWN UNLISTEN UNLOGGED
 	UNTIL UPDATE USER USING
 
 	VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
@@ -15374,6 +15374,7 @@ unreserved_keyword:
 			| TRUSTED
 			| TYPE_P
 			| TYPES_P
+			| UESCAPE
 			| UNBOUNDED
 			| UNCOMMITTED
 			| UNENCRYPTED
diff --git a/src/backend/parser/parser.c b/src/backend/parser/parser.c
index 4c0c258cd7..6d4a9721ac 100644
--- a/src/backend/parser/parser.c
+++ b/src/backend/parser/parser.c
@@ -23,6 +23,12 @@
 
 #include "parser/gramparse.h"
 #include "parser/parser.h"
+#include "parser/scansup.h"
+#include "mb/pg_wchar.h"
+
+static bool check_uescapechar(unsigned char escape);
+static char *str_udeescape(char escape, char *str, int position,
+						   core_yyscan_t yyscanner);
 
 
 /*
@@ -75,6 +81,10 @@ raw_parser(const char *str)
  * scanner backtrack, which would cost more performance than this filter
  * layer does.
  *
+ * We also use this filter to convert UIDENT and UCONST sequences into
+ * plain IDENT and SCONST tokens.  While that could be handled by additional
+ * productions in the main grammar, it's more efficient to do it like this.
+ *
  * The filter also provides a convenient place to translate between
  * the core_YYSTYPE and YYSTYPE representations (which are really the
  * same thing anyway, but notationally they're different).
@@ -104,7 +114,7 @@ base_yylex(YYSTYPE *lvalp, YYLTYPE *llocp, core_yyscan_t yyscanner)
 	 * If this token isn't one that requires lookahead, just return it.  If it
 	 * does, determine the token length.  (We could get that via strlen(), but
 	 * since we have such a small set of possibilities, hardwiring seems
-	 * feasible and more efficient.)
+	 * feasible and more efficient --- at least for the fixed-length cases.)
 	 */
 	switch (cur_token)
 	{
@@ -117,6 +127,10 @@ base_yylex(YYSTYPE *lvalp, YYLTYPE *llocp, core_yyscan_t yyscanner)
 		case WITH:
 			cur_token_length = 4;
 			break;
+		case UIDENT:
+		case UCONST:
+			cur_token_length = strlen(yyextra->core_yy_extra.scanbuf + *llocp);
+			break;
 		default:
 			return cur_token;
 	}
@@ -190,7 +204,270 @@ base_yylex(YYSTYPE *lvalp, YYLTYPE *llocp, core_yyscan_t yyscanner)
 					break;
 			}
 			break;
+
+		case UIDENT:
+		case UCONST:
+			/* Look ahead for UESCAPE */
+			if (next_token == UESCAPE)
+			{
+				/* Yup, so get third token, which had better be SCONST */
+				const char *escstr;
+
+				/* Again save and restore *llocp */
+				cur_yylloc = *llocp;
+
+				/* Un-truncate current token so errors point to third token */
+				*(yyextra->lookahead_end) = yyextra->lookahead_hold_char;
+
+				/* Get third token */
+				next_token = core_yylex(&(yyextra->lookahead_yylval),
+										llocp, yyscanner);
+
+				/* If we throw error here, it will point to third token */
+				if (next_token != SCONST)
+					scanner_yyerror("UESCAPE must be followed by a simple string literal",
+									yyscanner);
+
+				escstr = yyextra->lookahead_yylval.str;
+				if (strlen(escstr) != 1 || !check_uescapechar(escstr[0]))
+					scanner_yyerror("invalid Unicode escape character",
+									yyscanner);
+
+				/* Now restore *llocp; errors will point to first token */
+				*llocp = cur_yylloc;
+
+				/* Apply Unicode conversion */
+				lvalp->core_yystype.str =
+					str_udeescape(escstr[0],
+								  lvalp->core_yystype.str,
+								  *llocp,
+								  yyscanner);
+
+				/*
+				 * We don't need to revert the un-truncation of UESCAPE.  What we
+				 * do want to do is clear have_lookahead, thereby consuming
+				 * all three tokens.
+				 */
+				yyextra->have_lookahead = false;
+			}
+			else
+			{
+				/* No UESCAPE, so convert using default escape character */
+				lvalp->core_yystype.str =
+					str_udeescape('\\',
+								  lvalp->core_yystype.str,
+								  *llocp,
+								  yyscanner);
+			}
+
+			if (cur_token == UIDENT)
+			{
+				/* It's an identifier, so truncate as appropriate */
+				truncate_identifier(lvalp->core_yystype.str,
+									strlen(lvalp->core_yystype.str),
+									true);
+				cur_token = IDENT;
+			}
+			else if (cur_token == UCONST)
+			{
+				cur_token = SCONST;
+			}
+			break;
 	}
 
 	return cur_token;
 }
+
+/* convert hex digit (caller should have verified that) to value */
+static unsigned int
+hexval(unsigned char c)
+{
+	if (c >= '0' && c <= '9')
+		return c - '0';
+	if (c >= 'a' && c <= 'f')
+		return c - 'a' + 0xA;
+	if (c >= 'A' && c <= 'F')
+		return c - 'A' + 0xA;
+	elog(ERROR, "invalid hexadecimal digit");
+	return 0;					/* not reached */
+}
+
+/* is Unicode code point acceptable in database's encoding? */
+static void
+check_unicode_value(pg_wchar c, int pos, core_yyscan_t yyscanner)
+{
+	/* See also addunicode() in scan.l */
+	if (c == 0 || c > 0x10FFFF)
+		ereport(ERROR,
+				(errcode(ERRCODE_SYNTAX_ERROR),
+				 errmsg("invalid Unicode escape value"),
+				 scanner_errposition(pos, yyscanner)));
+
+	if (c > 0x7F && GetDatabaseEncoding() != PG_UTF8)
+		ereport(ERROR,
+				(errcode(ERRCODE_SYNTAX_ERROR),
+				 errmsg("Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8"),
+				 scanner_errposition(pos, yyscanner)));
+}
+
+/* is 'escape' acceptable as Unicode escape character (UESCAPE syntax) ? */
+static bool
+check_uescapechar(unsigned char escape)
+{
+	if (isxdigit(escape)
+		|| escape == '+'
+		|| escape == '\''
+		|| escape == '"'
+		|| scanner_isspace(escape))
+		return false;
+	else
+		return true;
+}
+
+/* Process Unicode escapes in "str", producing a palloc'd plain string */
+static char *
+str_udeescape(char escape, char *str, int position,
+			  core_yyscan_t yyscanner)
+{
+	char	   *new,
+			   *in,
+			   *out;
+	int			str_length;
+	pg_wchar	pair_first = 0;
+
+	str_length = strlen(str);
+
+	/*
+	 * This relies on the subtle assumption that a UTF-8 expansion cannot be
+	 * longer than its escaped representation.
+	 */
+	new = palloc(str_length + 1);
+
+	in = str;
+	out = new;
+	while (*in)
+	{
+		if (in[0] == escape)
+		{
+			if (in[1] == escape)
+			{
+				if (pair_first)
+					goto invalid_pair;
+				*out++ = escape;
+				in += 2;
+			}
+			else if (isxdigit((unsigned char) in[1]) &&
+					 isxdigit((unsigned char) in[2]) &&
+					 isxdigit((unsigned char) in[3]) &&
+					 isxdigit((unsigned char) in[4]))
+			{
+				pg_wchar	unicode;
+
+				unicode = (hexval(in[1]) << 12) +
+					(hexval(in[2]) << 8) +
+					(hexval(in[3]) << 4) +
+					hexval(in[4]);
+				check_unicode_value(unicode,
+									position + in - str + 3,	/* 3 for U&" */
+									yyscanner);
+				if (pair_first)
+				{
+					if (is_utf16_surrogate_second(unicode))
+					{
+						unicode = surrogate_pair_to_codepoint(pair_first, unicode);
+						pair_first = 0;
+					}
+					else
+						goto invalid_pair;
+				}
+				else if (is_utf16_surrogate_second(unicode))
+					goto invalid_pair;
+
+				if (is_utf16_surrogate_first(unicode))
+					pair_first = unicode;
+				else
+				{
+					unicode_to_utf8(unicode, (unsigned char *) out);
+					out += pg_mblen(out);
+				}
+				in += 5;
+			}
+			else if (in[1] == '+' &&
+					 isxdigit((unsigned char) in[2]) &&
+					 isxdigit((unsigned char) in[3]) &&
+					 isxdigit((unsigned char) in[4]) &&
+					 isxdigit((unsigned char) in[5]) &&
+					 isxdigit((unsigned char) in[6]) &&
+					 isxdigit((unsigned char) in[7]))
+			{
+				pg_wchar	unicode;
+
+				unicode = (hexval(in[2]) << 20) +
+					(hexval(in[3]) << 16) +
+					(hexval(in[4]) << 12) +
+					(hexval(in[5]) << 8) +
+					(hexval(in[6]) << 4) +
+					hexval(in[7]);
+				check_unicode_value(unicode,
+									position + in - str + 3,	/* 3 for U&" */
+									yyscanner);
+				if (pair_first)
+				{
+					if (is_utf16_surrogate_second(unicode))
+					{
+						unicode = surrogate_pair_to_codepoint(pair_first, unicode);
+						pair_first = 0;
+					}
+					else
+						goto invalid_pair;
+				}
+				else if (is_utf16_surrogate_second(unicode))
+					goto invalid_pair;
+
+				if (is_utf16_surrogate_first(unicode))
+					pair_first = unicode;
+				else
+				{
+					unicode_to_utf8(unicode, (unsigned char *) out);
+					out += pg_mblen(out);
+				}
+				in += 8;
+			}
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("invalid Unicode escape value"),
+						 scanner_errposition(position + in - str + 3,	/* 3 for U&" */
+											 yyscanner)));
+		}
+		else
+		{
+			if (pair_first)
+				goto invalid_pair;
+
+			*out++ = *in++;
+		}
+	}
+
+	/* unfinished surrogate pair? */
+	if (pair_first)
+		goto invalid_pair;
+
+	*out = '\0';
+
+	/*
+	 * We could skip pg_verifymbstr if we didn't process any non-7-bit-ASCII
+	 * codes; but it's probably not worth the trouble, since this isn't likely
+	 * to be a performance-critical path.
+	 */
+	pg_verifymbstr(new, out - new, false);
+	return new;
+
+invalid_pair:
+	ereport(ERROR,
+			(errcode(ERRCODE_SYNTAX_ERROR),
+			 errmsg("invalid Unicode surrogate pair"),
+			 scanner_errposition(position + in - str + 3,	/* 3 for U&" */
+								 yyscanner)));
+	return NULL;				/* keep compiler quiet */
+}
diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l
index e1cae859e8..856f4bac3a 100644
--- a/src/backend/parser/scan.l
+++ b/src/backend/parser/scan.l
@@ -110,14 +110,9 @@ const uint16 ScanKeywordTokens[] = {
 static void addlit(char *ytext, int yleng, core_yyscan_t yyscanner);
 static void addlitchar(unsigned char ychar, core_yyscan_t yyscanner);
 static char *litbufdup(core_yyscan_t yyscanner);
-static char *litbuf_udeescape(unsigned char escape, core_yyscan_t yyscanner);
 static unsigned char unescape_single_char(unsigned char c, core_yyscan_t yyscanner);
 static int	process_integer_literal(const char *token, YYSTYPE *lval);
-static bool is_utf16_surrogate_first(pg_wchar c);
-static bool is_utf16_surrogate_second(pg_wchar c);
-static pg_wchar surrogate_pair_to_codepoint(pg_wchar first, pg_wchar second);
 static void addunicode(pg_wchar c, yyscan_t yyscanner);
-static bool check_uescapechar(unsigned char escape);
 
 #define yyerror(msg)  scanner_yyerror(msg, yyscanner)
 
@@ -149,6 +144,7 @@ extern void core_yyset_column(int column_no, yyscan_t yyscanner);
 %option noyyalloc
 %option noyyrealloc
 %option noyyfree
+%option stack
 %option warn
 %option prefix="core_yy"
 
@@ -168,12 +164,11 @@ extern void core_yyset_column(int column_no, yyscan_t yyscanner);
  *  <xd> delimited identifiers (double-quoted identifiers)
  *  <xh> hexadecimal numeric string
  *  <xq> standard quoted strings
+ *  <xqs> quote stop (detect continued strings)
  *  <xe> extended quoted strings (support backslash escape sequences)
  *  <xdolq> $foo$ quoted strings
  *  <xui> quoted identifier with Unicode escapes
- *  <xuiend> end of a quoted identifier with Unicode escapes, UESCAPE can follow
  *  <xus> quoted string with Unicode escapes
- *  <xusend> end of a quoted string with Unicode escapes, UESCAPE can follow
  *  <xeu> Unicode surrogate pair in extended quoted string
  *
  * Remember to add an <<EOF>> case whenever you add a new exclusive state!
@@ -185,12 +180,11 @@ extern void core_yyset_column(int column_no, yyscan_t yyscanner);
 %x xd
 %x xh
 %x xq
+%x xqs
 %x xe
 %x xdolq
 %x xui
-%x xuiend
 %x xus
-%x xusend
 %x xeu
 
 /*
@@ -231,19 +225,18 @@ special_whitespace		({space}+|{comment}{newline})
 horiz_whitespace		({horiz_space}|{comment})
 whitespace_with_newline	({horiz_whitespace}*{newline}{special_whitespace}*)
 
+quote			'
+/* If we see {quote} then {quotecontinue}, the quoted string continues */
+quotecontinue	{whitespace_with_newline}{quote}
+
 /*
- * To ensure that {quotecontinue} can be scanned without having to back up
- * if the full pattern isn't matched, we include trailing whitespace in
- * {quotestop}.  This matches all cases where {quotecontinue} fails to match,
- * except for {quote} followed by whitespace and just one "-" (not two,
- * which would start a {comment}).  To cover that we have {quotefail}.
- * The actions for {quotestop} and {quotefail} must throw back characters
- * beyond the quote proper.
+ * {quotecontinuefail} is needed to avoid lexer backup when we fail to match
+ * {quotecontinue}.  It might seem that this could just be {whitespace}*,
+ * but if there's a dash after {whitespace_with_newline}, it must be consumed
+ * to see if there's another dash --- which would start a {comment} and thus
+ * allow continuation of the {quotecontinue} token.
  */
-quote			'
-quotestop		{quote}{whitespace}*
-quotecontinue	{quote}{whitespace_with_newline}{quote}
-quotefail		{quote}{whitespace}*"-"
+quotecontinuefail	{whitespace}*"-"?
 
 /* Bit string
  * It is tempting to scan the string for only those characters
@@ -304,21 +297,12 @@ xdstop			{dquote}
 xddouble		{dquote}{dquote}
 xdinside		[^"]+
 
-/* Unicode escapes */
-uescape			[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']{quote}
-/* error rule to avoid backup */
-uescapefail		[uU][eE][sS][cC][aA][pP][eE]{whitespace}*"-"|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*|[uU][eE][sS][cC][aA][pP]|[uU][eE][sS][cC][aA]|[uU][eE][sS][cC]|[uU][eE][sS]|[uU][eE]|[uU]
-
 /* Quoted identifier with Unicode escapes */
 xuistart		[uU]&{dquote}
 
 /* Quoted string with Unicode escapes */
 xusstart		[uU]&{quote}
 
-/* Optional UESCAPE after a quoted string or identifier with Unicode escapes. */
-xustop1		{uescapefail}?
-xustop2		{uescape}
-
 /* error rule to avoid backup */
 xufailed		[uU]&
 
@@ -428,7 +412,7 @@ other			.
 					/* Set location in case of syntax error in comment */
 					SET_YYLLOC();
 					yyextra->xcdepth = 0;
-					BEGIN(xc);
+					yy_push_state(xc, yyscanner);
 					/* Put back any characters past slash-star; see above */
 					yyless(2);
 				}
@@ -442,7 +426,7 @@ other			.
 
 {xcstop}		{
 					if (yyextra->xcdepth <= 0)
-						BEGIN(INITIAL);
+						yy_pop_state(yyscanner);
 					else
 						(yyextra->xcdepth)--;
 				}
@@ -472,25 +456,14 @@ other			.
 					 * to mark it for the input routine as a binary string.
 					 */
 					SET_YYLLOC();
-					BEGIN(xb);
+					yy_push_state(xb, yyscanner);
 					startlit();
 					addlitchar('b', yyscanner);
 				}
-<xb>{quotestop}	|
-<xb>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
-					yylval->str = litbufdup(yyscanner);
-					return BCONST;
-				}
 <xh>{xhinside}	|
 <xb>{xbinside}	{
 					addlit(yytext, yyleng, yyscanner);
 				}
-<xh>{quotecontinue}	|
-<xb>{quotecontinue}	{
-					/* ignore */
-				}
 <xb><<EOF>>		{ yyerror("unterminated bit string literal"); }
 
 {xhstart}		{
@@ -501,17 +474,10 @@ other			.
 					 * to mark it for the input routine as a hex string.
 					 */
 					SET_YYLLOC();
-					BEGIN(xh);
+					yy_push_state(xh, yyscanner);
 					startlit();
 					addlitchar('x', yyscanner);
 				}
-<xh>{quotestop}	|
-<xh>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
-					yylval->str = litbufdup(yyscanner);
-					return XCONST;
-				}
 <xh><<EOF>>		{ yyerror("unterminated hexadecimal string literal"); }
 
 {xnstart}		{
@@ -545,16 +511,16 @@ other			.
 					yyextra->saw_non_ascii = false;
 					SET_YYLLOC();
 					if (yyextra->standard_conforming_strings)
-						BEGIN(xq);
+						yy_push_state(xq, yyscanner);
 					else
-						BEGIN(xe);
+						yy_push_state(xe, yyscanner);
 					startlit();
 				}
 {xestart}		{
 					yyextra->warn_on_first_escape = false;
 					yyextra->saw_non_ascii = false;
 					SET_YYLLOC();
-					BEGIN(xe);
+					yy_push_state(xe, yyscanner);
 					startlit();
 				}
 {xusstart}		{
@@ -565,56 +531,80 @@ other			.
 								 errmsg("unsafe use of string constant with Unicode escapes"),
 								 errdetail("String constants with Unicode escapes cannot be used when standard_conforming_strings is off."),
 								 lexer_errposition()));
-					BEGIN(xus);
+					yy_push_state(xus, yyscanner);
 					startlit();
 				}
-<xq,xe>{quotestop}	|
-<xq,xe>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
+
+<xb,xh,xq,xe,xus>{quote} {
 					/*
-					 * check that the data remains valid if it might have been
-					 * made invalid by unescaping any chars.
+					 * When we are scanning a quoted string and see an end
+					 * quote, we must look ahead for a possible continuation.
+					 * If we don't see one, we know the end quote was in fact
+					 * the end of the string.  To reduce the lexer table size,
+					 * we use a single "xqs" state to do the lookahead for all
+					 * types of strings.
 					 */
-					if (yyextra->saw_non_ascii)
-						pg_verifymbstr(yyextra->literalbuf,
-									   yyextra->literallen,
-									   false);
-					yylval->str = litbufdup(yyscanner);
-					return SCONST;
-				}
-<xus>{quotestop} |
-<xus>{quotefail} {
-					/* throw back all but the quote */
-					yyless(1);
-					/* xusend state looks for possible UESCAPE */
-					BEGIN(xusend);
+					yy_push_state(xqs, yyscanner);
 				}
-<xusend>{whitespace} {
-					/* stay in xusend state over whitespace */
+<xqs>{quotecontinue} {
+					/*
+					 * Found a quote continuation, so return to the in-quote
+					 * state and continue scanning the literal.
+					 */
+					yy_pop_state(yyscanner);
 				}
-<xusend><<EOF>> |
-<xusend>{other} |
-<xusend>{xustop1} {
-					/* no UESCAPE after the quote, throw back everything */
+<xqs>{quotecontinuefail} |
+<xqs><<EOF>> |
+<xqs>{other}	{
+					int token;
+
+					/*
+					 * Failed to see a quote continuation.  Throw back
+					 * everything after the end quote, and handle the string
+					 * according to the state we were in previously.
+					 */
 					yyless(0);
-					BEGIN(INITIAL);
-					yylval->str = litbuf_udeescape('\\', yyscanner);
-					return SCONST;
-				}
-<xusend>{xustop2} {
-					/* found UESCAPE after the end quote */
-					BEGIN(INITIAL);
-					if (!check_uescapechar(yytext[yyleng - 2]))
+
+					switch (yy_top_state(yyscanner))
 					{
-						SET_YYLLOC();
-						ADVANCE_YYLLOC(yyleng - 2);
-						yyerror("invalid Unicode escape character");
+						case xb:
+							yylval->str = litbufdup(yyscanner);
+							token = BCONST;
+							break;
+						case xh:
+							yylval->str = litbufdup(yyscanner);
+							token = XCONST;
+							break;
+						case xq:
+							/* fallthrough */
+						case xe:
+							/*
+							 * Check that the data remains valid if it
+							 * might have been made invalid by unescaping
+							 * any chars.
+							 */
+							if (yyextra->saw_non_ascii)
+								pg_verifymbstr(yyextra->literalbuf,
+											   yyextra->literallen,
+											   false);
+							yylval->str = litbufdup(yyscanner);
+							token = SCONST;
+							break;
+						case xus:
+							yylval->str = litbufdup(yyscanner);
+							token = UCONST;
+							break;
+						default:
+							yyerror("unhandled previous state in xqs");
 					}
-					yylval->str = litbuf_udeescape(yytext[yyleng - 2],
-												   yyscanner);
-					return SCONST;
+
+					/* go back to state before string start */
+					yy_pop_state(yyscanner);
+					yy_pop_state(yyscanner);
+
+					return token;
 				}
+
 <xq,xe,xus>{xqdouble} {
 					addlitchar('\'', yyscanner);
 				}
@@ -693,9 +683,6 @@ other			.
 					if (c == '\0' || IS_HIGHBIT_SET(c))
 						yyextra->saw_non_ascii = true;
 				}
-<xq,xe,xus>{quotecontinue} {
-					/* ignore */
-				}
 <xe>.			{
 					/* This is only needed for \ just before EOF */
 					addlitchar(yytext[0], yyscanner);
@@ -705,7 +692,7 @@ other			.
 {dolqdelim}		{
 					SET_YYLLOC();
 					yyextra->dolqstart = pstrdup(yytext);
-					BEGIN(xdolq);
+					yy_push_state(xdolq, yyscanner);
 					startlit();
 				}
 {dolqfailed}	{
@@ -720,7 +707,7 @@ other			.
 					{
 						pfree(yyextra->dolqstart);
 						yyextra->dolqstart = NULL;
-						BEGIN(INITIAL);
+						yy_pop_state(yyscanner);
 						yylval->str = litbufdup(yyscanner);
 						return SCONST;
 					}
@@ -749,18 +736,18 @@ other			.
 
 {xdstart}		{
 					SET_YYLLOC();
-					BEGIN(xd);
+					yy_push_state(xd, yyscanner);
 					startlit();
 				}
 {xuistart}		{
 					SET_YYLLOC();
-					BEGIN(xui);
+					yy_push_state(xui, yyscanner);
 					startlit();
 				}
 <xd>{xdstop}	{
 					char	   *ident;
 
-					BEGIN(INITIAL);
+					yy_pop_state(yyscanner);
 					if (yyextra->literallen == 0)
 						yyerror("zero-length delimited identifier");
 					ident = litbufdup(yyscanner);
@@ -769,54 +756,15 @@ other			.
 					yylval->str = ident;
 					return IDENT;
 				}
-<xui>{dquote} {
-					yyless(1);
-					/* xuiend state looks for possible UESCAPE */
-					BEGIN(xuiend);
-				}
-<xuiend>{whitespace} {
-					/* stay in xuiend state over whitespace */
-				}
-<xuiend><<EOF>> |
-<xuiend>{other} |
-<xuiend>{xustop1} {
-					/* no UESCAPE after the quote, throw back everything */
-					char	   *ident;
-					int			identlen;
-
-					yyless(0);
-
-					BEGIN(INITIAL);
+<xui>{dquote}	{
 					if (yyextra->literallen == 0)
 						yyerror("zero-length delimited identifier");
-					ident = litbuf_udeescape('\\', yyscanner);
-					identlen = strlen(ident);
-					if (identlen >= NAMEDATALEN)
-						truncate_identifier(ident, identlen, true);
-					yylval->str = ident;
-					return IDENT;
-				}
-<xuiend>{xustop2}	{
-					/* found UESCAPE after the end quote */
-					char	   *ident;
-					int			identlen;
 
-					BEGIN(INITIAL);
-					if (yyextra->literallen == 0)
-						yyerror("zero-length delimited identifier");
-					if (!check_uescapechar(yytext[yyleng - 2]))
-					{
-						SET_YYLLOC();
-						ADVANCE_YYLLOC(yyleng - 2);
-						yyerror("invalid Unicode escape character");
-					}
-					ident = litbuf_udeescape(yytext[yyleng - 2], yyscanner);
-					identlen = strlen(ident);
-					if (identlen >= NAMEDATALEN)
-						truncate_identifier(ident, identlen, true);
-					yylval->str = ident;
-					return IDENT;
+					yy_pop_state(yyscanner);
+					yylval->str = litbufdup(yyscanner);
+					return UIDENT;
 				}
+
 <xd,xui>{xddouble}	{
 					addlitchar('"', yyscanner);
 				}
@@ -1288,55 +1236,12 @@ process_integer_literal(const char *token, YYSTYPE *lval)
 	return ICONST;
 }
 
-static unsigned int
-hexval(unsigned char c)
-{
-	if (c >= '0' && c <= '9')
-		return c - '0';
-	if (c >= 'a' && c <= 'f')
-		return c - 'a' + 0xA;
-	if (c >= 'A' && c <= 'F')
-		return c - 'A' + 0xA;
-	elog(ERROR, "invalid hexadecimal digit");
-	return 0;					/* not reached */
-}
-
-static void
-check_unicode_value(pg_wchar c, char *loc, core_yyscan_t yyscanner)
-{
-	if (GetDatabaseEncoding() == PG_UTF8)
-		return;
-
-	if (c > 0x7F)
-	{
-		ADVANCE_YYLLOC(loc - yyextra->literalbuf + 3);	/* 3 for U&" */
-		yyerror("Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8");
-	}
-}
-
-static bool
-is_utf16_surrogate_first(pg_wchar c)
-{
-	return (c >= 0xD800 && c <= 0xDBFF);
-}
-
-static bool
-is_utf16_surrogate_second(pg_wchar c)
-{
-	return (c >= 0xDC00 && c <= 0xDFFF);
-}
-
-static pg_wchar
-surrogate_pair_to_codepoint(pg_wchar first, pg_wchar second)
-{
-	return ((first & 0x3FF) << 10) + 0x10000 + (second & 0x3FF);
-}
-
 static void
 addunicode(pg_wchar c, core_yyscan_t yyscanner)
 {
 	char		buf[8];
 
+	/* See also check_unicode_value() in parser.c */
 	if (c == 0 || c > 0x10FFFF)
 		yyerror("invalid Unicode escape value");
 	if (c > 0x7F)
@@ -1349,172 +1254,6 @@ addunicode(pg_wchar c, core_yyscan_t yyscanner)
 	addlit(buf, pg_mblen(buf), yyscanner);
 }
 
-/* is 'escape' acceptable as Unicode escape character (UESCAPE syntax) ? */
-static bool
-check_uescapechar(unsigned char escape)
-{
-	if (isxdigit(escape)
-		|| escape == '+'
-		|| escape == '\''
-		|| escape == '"'
-		|| scanner_isspace(escape))
-	{
-		return false;
-	}
-	else
-		return true;
-}
-
-/* like litbufdup, but handle unicode escapes */
-static char *
-litbuf_udeescape(unsigned char escape, core_yyscan_t yyscanner)
-{
-	char	   *new;
-	char	   *litbuf,
-			   *in,
-			   *out;
-	pg_wchar	pair_first = 0;
-
-	/* Make literalbuf null-terminated to simplify the scanning loop */
-	litbuf = yyextra->literalbuf;
-	litbuf[yyextra->literallen] = '\0';
-
-	/*
-	 * This relies on the subtle assumption that a UTF-8 expansion cannot be
-	 * longer than its escaped representation.
-	 */
-	new = palloc(yyextra->literallen + 1);
-
-	in = litbuf;
-	out = new;
-	while (*in)
-	{
-		if (in[0] == escape)
-		{
-			if (in[1] == escape)
-			{
-				if (pair_first)
-				{
-					ADVANCE_YYLLOC(in - litbuf + 3);	/* 3 for U&" */
-					yyerror("invalid Unicode surrogate pair");
-				}
-				*out++ = escape;
-				in += 2;
-			}
-			else if (isxdigit((unsigned char) in[1]) &&
-					 isxdigit((unsigned char) in[2]) &&
-					 isxdigit((unsigned char) in[3]) &&
-					 isxdigit((unsigned char) in[4]))
-			{
-				pg_wchar	unicode;
-
-				unicode = (hexval(in[1]) << 12) +
-					(hexval(in[2]) << 8) +
-					(hexval(in[3]) << 4) +
-					hexval(in[4]);
-				check_unicode_value(unicode, in, yyscanner);
-				if (pair_first)
-				{
-					if (is_utf16_surrogate_second(unicode))
-					{
-						unicode = surrogate_pair_to_codepoint(pair_first, unicode);
-						pair_first = 0;
-					}
-					else
-					{
-						ADVANCE_YYLLOC(in - litbuf + 3);		/* 3 for U&" */
-						yyerror("invalid Unicode surrogate pair");
-					}
-				}
-				else if (is_utf16_surrogate_second(unicode))
-					yyerror("invalid Unicode surrogate pair");
-
-				if (is_utf16_surrogate_first(unicode))
-					pair_first = unicode;
-				else
-				{
-					unicode_to_utf8(unicode, (unsigned char *) out);
-					out += pg_mblen(out);
-				}
-				in += 5;
-			}
-			else if (in[1] == '+' &&
-					 isxdigit((unsigned char) in[2]) &&
-					 isxdigit((unsigned char) in[3]) &&
-					 isxdigit((unsigned char) in[4]) &&
-					 isxdigit((unsigned char) in[5]) &&
-					 isxdigit((unsigned char) in[6]) &&
-					 isxdigit((unsigned char) in[7]))
-			{
-				pg_wchar	unicode;
-
-				unicode = (hexval(in[2]) << 20) +
-					(hexval(in[3]) << 16) +
-					(hexval(in[4]) << 12) +
-					(hexval(in[5]) << 8) +
-					(hexval(in[6]) << 4) +
-					hexval(in[7]);
-				check_unicode_value(unicode, in, yyscanner);
-				if (pair_first)
-				{
-					if (is_utf16_surrogate_second(unicode))
-					{
-						unicode = surrogate_pair_to_codepoint(pair_first, unicode);
-						pair_first = 0;
-					}
-					else
-					{
-						ADVANCE_YYLLOC(in - litbuf + 3);		/* 3 for U&" */
-						yyerror("invalid Unicode surrogate pair");
-					}
-				}
-				else if (is_utf16_surrogate_second(unicode))
-					yyerror("invalid Unicode surrogate pair");
-
-				if (is_utf16_surrogate_first(unicode))
-					pair_first = unicode;
-				else
-				{
-					unicode_to_utf8(unicode, (unsigned char *) out);
-					out += pg_mblen(out);
-				}
-				in += 8;
-			}
-			else
-			{
-				ADVANCE_YYLLOC(in - litbuf + 3);		/* 3 for U&" */
-				yyerror("invalid Unicode escape value");
-			}
-		}
-		else
-		{
-			if (pair_first)
-			{
-				ADVANCE_YYLLOC(in - litbuf + 3);		/* 3 for U&" */
-				yyerror("invalid Unicode surrogate pair");
-			}
-			*out++ = *in++;
-		}
-	}
-
-	/* unfinished surrogate pair? */
-	if (pair_first)
-	{
-		ADVANCE_YYLLOC(in - litbuf + 3);				/* 3 for U&" */
-		yyerror("invalid Unicode surrogate pair");
-	}
-
-	*out = '\0';
-
-	/*
-	 * We could skip pg_verifymbstr if we didn't process any non-7-bit-ASCII
-	 * codes; but it's probably not worth the trouble, since this isn't likely
-	 * to be a performance-critical path.
-	 */
-	pg_verifymbstr(new, out - new, false);
-	return new;
-}
-
 static unsigned char
 unescape_single_char(unsigned char c, core_yyscan_t yyscanner)
 {
diff --git a/src/fe_utils/psqlscan.l b/src/fe_utils/psqlscan.l
index ce20936339..71ada0b72e 100644
--- a/src/fe_utils/psqlscan.l
+++ b/src/fe_utils/psqlscan.l
@@ -86,6 +86,8 @@ extern void psql_yyset_column(int column_no, yyscan_t yyscanner);
 %option noinput
 %option nounput
 %option noyywrap
+%option noyy_top_state
+%option stack
 %option warn
 %option prefix="psql_yy"
 
@@ -114,12 +116,11 @@ extern void psql_yyset_column(int column_no, yyscan_t yyscanner);
  *  <xd> delimited identifiers (double-quoted identifiers)
  *  <xh> hexadecimal numeric string
  *  <xq> standard quoted strings
+ *  <xqs> quote stop (detect continued strings)
  *  <xe> extended quoted strings (support backslash escape sequences)
  *  <xdolq> $foo$ quoted strings
  *  <xui> quoted identifier with Unicode escapes
- *  <xuiend> end of a quoted identifier with Unicode escapes, UESCAPE can follow
  *  <xus> quoted string with Unicode escapes
- *  <xusend> end of a quoted string with Unicode escapes, UESCAPE can follow
  *
  * Note: we intentionally don't mimic the backend's <xeu> state; we have
  * no need to distinguish it from <xe> state, and no good way to get out
@@ -132,12 +133,11 @@ extern void psql_yyset_column(int column_no, yyscan_t yyscanner);
 %x xd
 %x xh
 %x xq
+%x xqs
 %x xe
 %x xdolq
 %x xui
-%x xuiend
 %x xus
-%x xusend
 
 /*
  * In order to make the world safe for Windows and Mac clients as well as
@@ -177,19 +177,18 @@ special_whitespace		({space}+|{comment}{newline})
 horiz_whitespace		({horiz_space}|{comment})
 whitespace_with_newline	({horiz_whitespace}*{newline}{special_whitespace}*)
 
+quote			'
+/* If we see {quote} then {quotecontinue}, the quoted string continues */
+quotecontinue	{whitespace_with_newline}{quote}
+
 /*
- * To ensure that {quotecontinue} can be scanned without having to back up
- * if the full pattern isn't matched, we include trailing whitespace in
- * {quotestop}.  This matches all cases where {quotecontinue} fails to match,
- * except for {quote} followed by whitespace and just one "-" (not two,
- * which would start a {comment}).  To cover that we have {quotefail}.
- * The actions for {quotestop} and {quotefail} must throw back characters
- * beyond the quote proper.
+ * {quotecontinuefail} is needed to avoid lexer backup when we fail to match
+ * {quotecontinue}.  It might seem that this could just be {whitespace}*,
+ * but if there's a dash after {whitespace_with_newline}, it must be consumed
+ * to see if there's another dash --- which would start a {comment} and thus
+ * allow continuation of the {quotecontinue} token.
  */
-quote			'
-quotestop		{quote}{whitespace}*
-quotecontinue	{quote}{whitespace_with_newline}{quote}
-quotefail		{quote}{whitespace}*"-"
+quotecontinuefail	{whitespace}*"-"?
 
 /* Bit string
  * It is tempting to scan the string for only those characters
@@ -250,21 +249,12 @@ xdstop			{dquote}
 xddouble		{dquote}{dquote}
 xdinside		[^"]+
 
-/* Unicode escapes */
-uescape			[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']{quote}
-/* error rule to avoid backup */
-uescapefail		[uU][eE][sS][cC][aA][pP][eE]{whitespace}*"-"|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*|[uU][eE][sS][cC][aA][pP]|[uU][eE][sS][cC][aA]|[uU][eE][sS][cC]|[uU][eE][sS]|[uU][eE]|[uU]
-
 /* Quoted identifier with Unicode escapes */
 xuistart		[uU]&{dquote}
 
 /* Quoted string with Unicode escapes */
 xusstart		[uU]&{quote}
 
-/* Optional UESCAPE after a quoted string or identifier with Unicode escapes. */
-xustop1		{uescapefail}?
-xustop2		{uescape}
-
 /* error rule to avoid backup */
 xufailed		[uU]&
 
@@ -399,7 +389,7 @@ other			.
 
 {xcstart}		{
 					cur_state->xcdepth = 0;
-					BEGIN(xc);
+					yy_push_state(xc, yyscanner);
 					/* Put back any characters past slash-star; see above */
 					yyless(2);
 					ECHO;
@@ -415,7 +405,7 @@ other			.
 
 {xcstop}		{
 					if (cur_state->xcdepth <= 0)
-						BEGIN(INITIAL);
+						yy_pop_state(yyscanner);
 					else
 						cur_state->xcdepth--;
 					ECHO;
@@ -435,23 +425,13 @@ other			.
 } /* <xc> */
 
 {xbstart}		{
-					BEGIN(xb);
-					ECHO;
-				}
-<xb>{quotestop}	|
-<xb>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
+					yy_push_state(xb, yyscanner);
 					ECHO;
 				}
 <xh>{xhinside}	|
 <xb>{xbinside}	{
 					ECHO;
 				}
-<xh>{quotecontinue}	|
-<xb>{quotecontinue}	{
-					ECHO;
-				}
 
 {xhstart}		{
 					/* Hexadecimal bit type.
@@ -460,13 +440,7 @@ other			.
 					 * In the meantime, place a leading "x" on the string
 					 * to mark it for the input routine as a hex string.
 					 */
-					BEGIN(xh);
-					ECHO;
-				}
-<xh>{quotestop}	|
-<xh>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
+					yy_push_state(xh, yyscanner);
 					ECHO;
 				}
 
@@ -477,45 +451,53 @@ other			.
 
 {xqstart}		{
 					if (cur_state->std_strings)
-						BEGIN(xq);
+						yy_push_state(xq, yyscanner);
 					else
-						BEGIN(xe);
+						yy_push_state(xe, yyscanner);
 					ECHO;
 				}
 {xestart}		{
-					BEGIN(xe);
+					yy_push_state(xe, yyscanner);
 					ECHO;
 				}
 {xusstart}		{
-					BEGIN(xus);
-					ECHO;
-				}
-<xq,xe>{quotestop}	|
-<xq,xe>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
+					yy_push_state(xus, yyscanner);
 					ECHO;
 				}
-<xus>{quotestop} |
-<xus>{quotefail} {
-					/* throw back all but the quote */
-					yyless(1);
-					BEGIN(xusend);
+
+<xb,xh,xq,xe,xus>{quote} {
+					/*
+					 * When we are scanning a quoted string and see an end
+					 * quote, we must look ahead for a possible continuation.
+					 * If we don't see one, we know the end quote was in fact
+					 * the end of the string.  To reduce the lexer table size,
+					 * we use a single "xqs" state to do the lookahead for all
+					 * types of strings.
+					 */
+					yy_push_state(xqs, yyscanner);
 					ECHO;
 				}
-<xusend>{whitespace} {
+<xqs>{quotecontinue} {
+					/*
+					 * Found a quote continuation, so return to the in-quote
+					 * state and continue scanning the literal.
+					 */
+					yy_pop_state(yyscanner);
 					ECHO;
 				}
-<xusend>{other} |
-<xusend>{xustop1} {
+<xqs>{quotecontinuefail} |
+<xqs>{other}	{
+					/*
+					 * Failed to see a quote continuation.  Throw back
+					 * everything after the end quote.
+					 */
 					yyless(0);
-					BEGIN(INITIAL);
-					ECHO;
-				}
-<xusend>{xustop2} {
-					BEGIN(INITIAL);
-					ECHO;
+
+					/* go back to state before string start */
+					yy_pop_state(yyscanner);
+					yy_pop_state(yyscanner);
 				}
+
 <xq,xe,xus>{xqdouble} {
 					ECHO;
 				}
@@ -540,9 +522,6 @@ other			.
 <xe>{xehexesc}  {
 					ECHO;
 				}
-<xq,xe,xus>{quotecontinue} {
-					ECHO;
-				}
 <xe>.			{
 					/* This is only needed for \ just before EOF */
 					ECHO;
@@ -550,7 +529,7 @@ other			.
 
 {dolqdelim}		{
 					cur_state->dolqstart = pg_strdup(yytext);
-					BEGIN(xdolq);
+					yy_push_state(xdolq, yyscanner);
 					ECHO;
 				}
 {dolqfailed}	{
@@ -563,7 +542,7 @@ other			.
 					{
 						free(cur_state->dolqstart);
 						cur_state->dolqstart = NULL;
-						BEGIN(INITIAL);
+						yy_pop_state(yyscanner);
 					}
 					else
 					{
@@ -588,35 +567,22 @@ other			.
 				}
 
 {xdstart}		{
-					BEGIN(xd);
+					yy_push_state(xd, yyscanner);
 					ECHO;
 				}
 {xuistart}		{
-					BEGIN(xui);
+					yy_push_state(xui, yyscanner);
 					ECHO;
 				}
 <xd>{xdstop}	{
-					BEGIN(INITIAL);
-					ECHO;
-				}
-<xui>{dquote} {
-					yyless(1);
-					BEGIN(xuiend);
-					ECHO;
-				}
-<xuiend>{whitespace} {
+					yy_pop_state(yyscanner);
 					ECHO;
 				}
-<xuiend>{other} |
-<xuiend>{xustop1} {
-					yyless(0);
-					BEGIN(INITIAL);
-					ECHO;
-				}
-<xuiend>{xustop2}	{
-					BEGIN(INITIAL);
+<xui>{dquote}	{
+					yy_pop_state(yyscanner);
 					ECHO;
 				}
+
 <xd,xui>{xddouble}	{
 					ECHO;
 				}
@@ -1084,8 +1050,7 @@ psql_scan(PsqlScanState state,
 			switch (state->start_state)
 			{
 				case INITIAL:
-				case xuiend:	/* we treat these like INITIAL */
-				case xusend:
+				case xqs:		/* we treat this like INITIAL */
 					if (state->paren_depth > 0)
 					{
 						result = PSCAN_INCOMPLETE;
@@ -1240,7 +1205,8 @@ psql_scan_reselect_sql_lexer(PsqlScanState state)
 bool
 psql_scan_in_quote(PsqlScanState state)
 {
-	return state->start_state != INITIAL;
+	return state->start_state != INITIAL &&
+			state->start_state != xqs;
 }
 
 /*
diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h
index 3e3e6c470e..0c4cb9c7d0 100644
--- a/src/include/mb/pg_wchar.h
+++ b/src/include/mb/pg_wchar.h
@@ -508,6 +508,27 @@ typedef uint32 (*utf_local_conversion_func) (uint32 code);
 								   (destencoding))
 
 
+/*
+ * Some handy functions for Unicode-specific tests.
+ */
+static inline bool
+is_utf16_surrogate_first(pg_wchar c)
+{
+	return (c >= 0xD800 && c <= 0xDBFF);
+}
+
+static inline bool
+is_utf16_surrogate_second(pg_wchar c)
+{
+	return (c >= 0xDC00 && c <= 0xDFFF);
+}
+
+static inline pg_wchar
+surrogate_pair_to_codepoint(pg_wchar first, pg_wchar second)
+{
+	return ((first & 0x3FF) << 10) + 0x10000 + (second & 0x3FF);
+}
+
 /*
  * These functions are considered part of libpq's exported API and
  * are also declared in libpq-fe.h.
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 00ace8425e..5893d317d8 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -416,6 +416,7 @@ PG_KEYWORD("truncate", TRUNCATE, UNRESERVED_KEYWORD)
 PG_KEYWORD("trusted", TRUSTED, UNRESERVED_KEYWORD)
 PG_KEYWORD("type", TYPE_P, UNRESERVED_KEYWORD)
 PG_KEYWORD("types", TYPES_P, UNRESERVED_KEYWORD)
+PG_KEYWORD("uescape", UESCAPE, UNRESERVED_KEYWORD)
 PG_KEYWORD("unbounded", UNBOUNDED, UNRESERVED_KEYWORD)
 PG_KEYWORD("uncommitted", UNCOMMITTED, UNRESERVED_KEYWORD)
 PG_KEYWORD("unencrypted", UNENCRYPTED, UNRESERVED_KEYWORD)
diff --git a/src/include/parser/scanner.h b/src/include/parser/scanner.h
index 731a2bd264..b61ac65f54 100644
--- a/src/include/parser/scanner.h
+++ b/src/include/parser/scanner.h
@@ -48,7 +48,7 @@ typedef union core_YYSTYPE
  * However, those are not defined in this file, because bison insists on
  * defining them for itself.  The token codes used by the core scanner are
  * the ASCII characters plus these:
- *	%token <str>	IDENT FCONST SCONST BCONST XCONST Op
+ *	%token <str>	IDENT UIDENT FCONST SCONST UCONST BCONST XCONST Op
  *	%token <ival>	ICONST PARAM
  *	%token			TYPECAST DOT_DOT COLON_EQUALS EQUALS_GREATER
  *	%token			LESS_EQUALS GREATER_EQUALS NOT_EQUALS
diff --git a/src/interfaces/ecpg/preproc/ecpg.tokens b/src/interfaces/ecpg/preproc/ecpg.tokens
index 1d613af02f..8e0527fdb7 100644
--- a/src/interfaces/ecpg/preproc/ecpg.tokens
+++ b/src/interfaces/ecpg/preproc/ecpg.tokens
@@ -24,4 +24,3 @@
                 S_TYPEDEF
 
 %token CSTRING CVARIABLE CPP_LINE IP
-%token DOLCONST ECONST NCONST UCONST UIDENT
diff --git a/src/interfaces/ecpg/preproc/ecpg.trailer b/src/interfaces/ecpg/preproc/ecpg.trailer
index f58b41e675..784d1d199e 100644
--- a/src/interfaces/ecpg/preproc/ecpg.trailer
+++ b/src/interfaces/ecpg/preproc/ecpg.trailer
@@ -1719,46 +1719,13 @@ ecpg_bconst:	BCONST		{ $$ = make_name(); } ;
 
 ecpg_fconst:	FCONST		{ $$ = make_name(); } ;
 
-ecpg_sconst:
-		SCONST
-		{
-			/* could have been input as '' or $$ */
-			$$ = (char *)mm_alloc(strlen($1) + 3);
-			$$[0]='\'';
-			strcpy($$+1, $1);
-			$$[strlen($1)+1]='\'';
-			$$[strlen($1)+2]='\0';
-			free($1);
-		}
-		| ECONST
-		{
-			$$ = (char *)mm_alloc(strlen($1) + 4);
-			$$[0]='E';
-			$$[1]='\'';
-			strcpy($$+2, $1);
-			$$[strlen($1)+2]='\'';
-			$$[strlen($1)+3]='\0';
-			free($1);
-		}
-		| NCONST
-		{
-			$$ = (char *)mm_alloc(strlen($1) + 4);
-			$$[0]='N';
-			$$[1]='\'';
-			strcpy($$+2, $1);
-			$$[strlen($1)+2]='\'';
-			$$[strlen($1)+3]='\0';
-			free($1);
-		}
-		| UCONST	{ $$ = $1; }
-		| DOLCONST	{ $$ = $1; }
+ecpg_sconst:	SCONST		{ $$ = $1; }
 		;
 
 ecpg_xconst:	XCONST		{ $$ = make_name(); } ;
 
-ecpg_ident:	IDENT		{ $$ = make_name(); }
+ecpg_ident:	IDENT		{ $$ = $1; }
 		| CSTRING	{ $$ = make3_str(mm_strdup("\""), $1, mm_strdup("\"")); }
-		| UIDENT	{ $$ = $1; }
 		;
 
 quoted_ident_stringvar: name
diff --git a/src/interfaces/ecpg/preproc/ecpg.type b/src/interfaces/ecpg/preproc/ecpg.type
index 9497b91b9d..ffafa82af9 100644
--- a/src/interfaces/ecpg/preproc/ecpg.type
+++ b/src/interfaces/ecpg/preproc/ecpg.type
@@ -122,12 +122,8 @@
 %type <str> CSTRING
 %type <str> CPP_LINE
 %type <str> CVARIABLE
-%type <str> DOLCONST
-%type <str> ECONST
-%type <str> NCONST
 %type <str> SCONST
-%type <str> UCONST
-%type <str> UIDENT
+%type <str> IDENT
 
 %type  <struct_union> s_struct_union_symbol
 
diff --git a/src/interfaces/ecpg/preproc/parse.pl b/src/interfaces/ecpg/preproc/parse.pl
index 3619706cdc..47300b7083 100644
--- a/src/interfaces/ecpg/preproc/parse.pl
+++ b/src/interfaces/ecpg/preproc/parse.pl
@@ -218,8 +218,8 @@ sub main
 				if ($a eq 'IDENT' && $prior eq '%nonassoc')
 				{
 
-					# add two more tokens to the list
-					$str = $str . "\n%nonassoc CSTRING\n%nonassoc UIDENT";
+					# add more tokens to the list
+					$str = $str . "\n%nonassoc CSTRING";
 				}
 				$prior = $a;
 			}
diff --git a/src/interfaces/ecpg/preproc/parser.c b/src/interfaces/ecpg/preproc/parser.c
index abae89d51b..ecf3c1b7e5 100644
--- a/src/interfaces/ecpg/preproc/parser.c
+++ b/src/interfaces/ecpg/preproc/parser.c
@@ -6,6 +6,10 @@
  * This should match src/backend/parser/parser.c, except that we do not
  * need to bother with re-entrant interfaces.
  *
+ * Note: ECPG doesn't report error location like the backend does.
+ * This file will need work if we ever want it to.
+ * See backend/parser/parser.c
+ *
  *
  * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -27,8 +31,10 @@ static int	lookahead_token;	/* one-token lookahead */
 static YYSTYPE lookahead_yylval;	/* yylval for lookahead token */
 static YYLTYPE lookahead_yylloc;	/* yylloc for lookahead token */
 static char *lookahead_yytext;	/* start current token */
-static char *lookahead_end;		/* end of current token */
-static char lookahead_hold_char;	/* to be put back at *lookahead_end */
+
+
+static bool check_uescapechar(unsigned char escape);
+static bool ecpg_isspace(char ch);
 
 
 /*
@@ -43,13 +49,16 @@ static char lookahead_hold_char;	/* to be put back at *lookahead_end */
  * words.  Furthermore it's not clear how to do that without re-introducing
  * scanner backtrack, which would cost more performance than this filter
  * layer does.
+ *
+ * We also use this filter to convert UIDENT and UCONST sequences into
+ * plain IDENT and SCONST tokens.  While that could be handled by additional
+ * productions in the main grammar, it's more efficient to do it like this.
  */
 int
 filtered_base_yylex(void)
 {
 	int			cur_token;
 	int			next_token;
-	int			cur_token_length;
 	YYSTYPE		cur_yylval;
 	YYLTYPE		cur_yylloc;
 	char	   *cur_yytext;
@@ -61,41 +70,26 @@ filtered_base_yylex(void)
 		base_yylval = lookahead_yylval;
 		base_yylloc = lookahead_yylloc;
 		base_yytext = lookahead_yytext;
-		*lookahead_end = lookahead_hold_char;
 		have_lookahead = false;
 	}
 	else
 		cur_token = base_yylex();
 
 	/*
-	 * If this token isn't one that requires lookahead, just return it.  If it
-	 * does, determine the token length.  (We could get that via strlen(), but
-	 * since we have such a small set of possibilities, hardwiring seems
-	 * feasible and more efficient.)
+	 * If this token isn't one that requires lookahead, just return it.
 	 */
 	switch (cur_token)
 	{
 		case NOT:
-			cur_token_length = 3;
-			break;
 		case NULLS_P:
-			cur_token_length = 5;
-			break;
 		case WITH:
-			cur_token_length = 4;
+		case UIDENT:
+		case UCONST:
 			break;
 		default:
 			return cur_token;
 	}
 
-	/*
-	 * Identify end+1 of current token.  base_yylex() has temporarily stored a
-	 * '\0' here, and will undo that when we call it again.  We need to redo
-	 * it to fully revert the lookahead call for error reporting purposes.
-	 */
-	lookahead_end = base_yytext + cur_token_length;
-	Assert(*lookahead_end == '\0');
-
 	/* Save and restore lexer output variables around the call */
 	cur_yylval = base_yylval;
 	cur_yylloc = base_yylloc;
@@ -113,10 +107,6 @@ filtered_base_yylex(void)
 	base_yylloc = cur_yylloc;
 	base_yytext = cur_yytext;
 
-	/* Now revert the un-truncation of the current token */
-	lookahead_hold_char = *lookahead_end;
-	*lookahead_end = '\0';
-
 	have_lookahead = true;
 
 	/* Replace cur_token if needed, based on lookahead */
@@ -157,7 +147,81 @@ filtered_base_yylex(void)
 					break;
 			}
 			break;
+		case UIDENT:
+		case UCONST:
+			/* Look ahead for UESCAPE */
+			if (next_token == UESCAPE)
+			{
+				/* Yup, so get third token, which had better be SCONST */
+				const char *escstr;
+
+				/* Again save and restore lexer output variables around the call */
+				cur_yylval = base_yylval;
+				cur_yylloc = base_yylloc;
+				cur_yytext = base_yytext;
+
+				/* Get third token */
+				next_token = base_yylex();
+
+				if (next_token != SCONST)
+					mmerror(PARSE_ERROR, ET_ERROR, "UESCAPE must be followed by a simple string literal");
+
+				/* Save and check escape string, which the scanner returns with quotes */
+				escstr = base_yylval.str;
+				if (strlen(escstr) != 3 || !check_uescapechar(escstr[1]))
+					mmerror(PARSE_ERROR, ET_ERROR, "invalid Unicode escape character");
+
+				base_yylval = cur_yylval;
+				base_yylloc = cur_yylloc;
+				base_yytext = cur_yytext;
+
+				/* Combine 3 tokens into 1 */
+				base_yylval.str = psprintf("%s uescape %s", base_yylval.str, escstr);
+
+				/*
+				 * We don't need to revert the un-truncation of UESCAPE.  What we
+				 * do want to do is clear have_lookahead, thereby consuming
+				 * all three tokens.
+				 */
+				have_lookahead = false;
+			}
+
+			if (cur_token == UIDENT)
+				cur_token = IDENT;
+			else if (cur_token == UCONST)
+				cur_token = SCONST;
+			break;
 	}
 
 	return cur_token;
 }
+
+// WIP: if we want to check this here, find a better location
+/* is 'escape' acceptable as Unicode escape character (UESCAPE syntax) ? */
+static bool
+check_uescapechar(unsigned char escape)
+{
+	if (isxdigit(escape)
+		|| escape == '+'
+		|| escape == '\''
+		|| escape == '"'
+		|| ecpg_isspace(escape))
+		return false;
+	else
+		return true;
+}
+
+/*
+ * ecpg_isspace() --- return true if flex scanner considers char whitespace
+ */
+static bool
+ecpg_isspace(char ch)
+{
+	if (ch == ' ' ||
+		ch == '\t' ||
+		ch == '\n' ||
+		ch == '\r' ||
+		ch == '\f')
+		return true;
+	return false;
+}
diff --git a/src/interfaces/ecpg/preproc/pgc.l b/src/interfaces/ecpg/preproc/pgc.l
index 488c89b7f4..c83b21118b 100644
--- a/src/interfaces/ecpg/preproc/pgc.l
+++ b/src/interfaces/ecpg/preproc/pgc.l
@@ -6,6 +6,9 @@
  *
  * This is a modified version of src/backend/parser/scan.l
  *
+ * The ecpg scanner is not backup-free, so the fail rules are
+ * only here to simplify syncing this file with scan.l.
+ *
  *
  * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -61,7 +64,6 @@ static bool isdefine(void);
 static bool isinformixdefine(void);
 
 char *token_start;
-static int state_before;
 
 struct _yy_buffer
 {
@@ -89,6 +91,7 @@ static struct _if_value
 %option nodefault
 %option noinput
 %option noyywrap
+%option stack
 %option warn
 %option yylineno
 %option prefix="base_yy"
@@ -105,13 +108,13 @@ static struct _if_value
  * and to eliminate parsing troubles for numeric strings.
  * Exclusive states:
  *  <xb> bit string literal
- *  <xcc> extended C-style comments in C
- *  <xcsql> extended C-style comments in SQL
+ *  <xc> extended C-style comments
  *  <xd> delimited identifiers (double-quoted identifiers)
  *  <xdc> double-quoted strings in C
  *  <xh> hexadecimal numeric string
  *  <xn> national character quoted strings
  *  <xq> standard quoted strings
+ *  <xqs> quote stop (detect continued strings)
  *  <xe> extended quoted strings (support backslash escape sequences)
  *  <xqc> single-quoted strings in C
  *  <xdolq> $foo$ quoted strings
@@ -120,18 +123,21 @@ static struct _if_value
  *  <xcond> condition of an EXEC SQL IFDEF construct
  *  <xskip> skipping the inactive part of an EXEC SQL IFDEF construct
  *
+ * Note: we intentionally don't mimic the backend's <xeu> state; we have
+ * no need to distinguish it from <xe> state.
+ *
  * Remember to add an <<EOF>> case whenever you add a new exclusive state!
  * The default one is probably not the right thing.
  */
 
 %x xb
-%x xcc
-%x xcsql
+%x xc
 %x xd
 %x xdc
 %x xh
 %x xn
 %x xq
+%x xqs
 %x xe
 %x xqc
 %x xdolq
@@ -181,9 +187,17 @@ horiz_whitespace		({horiz_space}|{comment})
 whitespace_with_newline	({horiz_whitespace}*{newline}{whitespace}*)
 
 quote			'
-quotestop		{quote}{whitespace}*
-quotecontinue	{quote}{whitespace_with_newline}{quote}
-quotefail		{quote}{whitespace}*"-"
+/* If we see {quote} then {quotecontinue}, the quoted string continues */
+quotecontinue	{whitespace_with_newline}{quote}
+
+/*
+ * {quotecontinuefail} is needed to avoid lexer backup when we fail to match
+ * {quotecontinue}.  It might seem that this could just be {whitespace}*,
+ * but if there's a dash after {whitespace_with_newline}, it must be consumed
+ * to see if there's another dash --- which would start a {comment} and thus
+ * allow continuation of the {quotecontinue} token.
+ */
+quotecontinuefail	{whitespace}*"-"?
 
 /* Bit string
  */
@@ -237,19 +251,11 @@ xdstop			{dquote}
 xddouble		{dquote}{dquote}
 xdinside		[^"]+
 
-/* Unicode escapes */
-/* (The ecpg scanner is not backup-free, so the fail rules in scan.l are
- * not needed here, but could be added if desired.)
- */
-uescape			[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']{quote}
-
 /* Quoted identifier with Unicode escapes */
 xuistart		[uU]&{dquote}
-xuistop			{dquote}({whitespace}*{uescape})?
 
 /* Quoted string with Unicode escapes */
 xusstart		[uU]&{quote}
-xusstop			{quote}({whitespace}*{uescape})?
 
 /* special stuff for C strings */
 xdcqq			\\\\
@@ -408,54 +414,56 @@ cppline			{space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})((\/\*[^*/]*\*+
 {whitespace}	{
 					/* ignore */
 				}
+} /* <SQL> */
 
+<C,SQL>{
 {xcstart}		{
 					token_start = yytext;
-					state_before = YYSTATE;
 					xcdepth = 0;
-					BEGIN(xcsql);
+					yy_push_state(xc);
 					/* Put back any characters past slash-star; see above */
 					yyless(2);
 					fputs("/*", yyout);
 				}
-} /* <SQL> */
+} /* <C,SQL> */
 
-<C>{xcstart}	{
-					token_start = yytext;
-					state_before = YYSTATE;
-					xcdepth = 0;
-					BEGIN(xcc);
-					/* Put back any characters past slash-star; see above */
-					yyless(2);
-					fputs("/*", yyout);
-				}
-<xcc>{xcstart}	{ ECHO; }
-<xcsql>{xcstart}	{
-					xcdepth++;
-					/* Put back any characters past slash-star; see above */
-					yyless(2);
-					fputs("/_*", yyout);
-				}
-<xcsql>{xcstop}	{
-					if (xcdepth <= 0)
+<xc>{
+{xcstart}		{
+					if (yy_top_state() == SQL)
 					{
-						ECHO;
-						BEGIN(state_before);
-						token_start = NULL;
+						xcdepth++;
+						/* Put back any characters past slash-star; see above */
+						yyless(2);
+						fputs("/_*", yyout);
 					}
-					else
+					else if (yy_top_state() == C)
 					{
-						xcdepth--;
-						fputs("*_/", yyout);
+						ECHO;
 					}
 				}
-<xcc>{xcstop}	{
-					ECHO;
-					BEGIN(state_before);
-					token_start = NULL;
+{xcstop}		{
+					if (yy_top_state() == SQL)
+					{
+						if (xcdepth <= 0)
+						{
+							ECHO;
+							yy_pop_state();
+							token_start = NULL;
+						}
+						else
+						{
+							xcdepth--;
+							fputs("*_/", yyout);
+						}
+					}
+					else if (yy_top_state() == C)
+					{
+						ECHO;
+						yy_pop_state();
+						token_start = NULL;
+					}
 				}
 
-<xcc,xcsql>{
 {xcinside}		{
 					ECHO;
 				}
@@ -471,56 +479,34 @@ cppline			{space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})((\/\*[^*/]*\*+
 <<EOF>>			{
 					mmfatal(PARSE_ERROR, "unterminated /* comment");
 				}
-} /* <xcc,xcsql> */
+} /* <xc> */
 
 <SQL>{
 {xbstart}		{
 					token_start = yytext;
-					BEGIN(xb);
+					yy_push_state(xb);
 					startlit();
 					addlitchar('b');
 				}
 } /* <SQL> */
 
-<xb>{quotestop}	|
-<xb>{quotefail} {
-					yyless(1);
-					BEGIN(SQL);
-					if (literalbuf[strspn(literalbuf, "01") + 1] != '\0')
-						mmerror(PARSE_ERROR, ET_ERROR, "invalid bit string literal");
-					base_yylval.str = mm_strdup(literalbuf);
-					return BCONST;
-				}
 <xh>{xhinside}	|
 <xb>{xbinside}	{
 					addlit(yytext, yyleng);
 				}
-<xh>{quotecontinue}	|
-<xb>{quotecontinue}	{
-					/* ignore */
-				}
 <xb><<EOF>>		{ mmfatal(PARSE_ERROR, "unterminated bit string literal"); }
 
 <SQL>{xhstart}	{
 					token_start = yytext;
-					BEGIN(xh);
+					yy_push_state(xh);
 					startlit();
 					addlitchar('x');
 				}
-<xh>{quotestop}	|
-<xh>{quotefail} {
-					yyless(1);
-					BEGIN(SQL);
-					base_yylval.str = mm_strdup(literalbuf);
-					return XCONST;
-				}
-
 <xh><<EOF>>		{ mmfatal(PARSE_ERROR, "unterminated hexadecimal string literal"); }
 
 <C>{xqstart}	{
 					token_start = yytext;
-					state_before = YYSTATE;
-					BEGIN(xqc);
+					yy_push_state(xqc);
 					startlit();
 				}
 
@@ -530,59 +516,98 @@ cppline			{space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})((\/\*[^*/]*\*+
 					 * Transfer it as-is to the backend.
 					 */
 					token_start = yytext;
-					state_before = YYSTATE;
-					BEGIN(xn);
+					yy_push_state(xn);
 					startlit();
 				}
 
 {xqstart}		{
 					token_start = yytext;
-					state_before = YYSTATE;
-					BEGIN(xq);
+					yy_push_state(xq);
 					startlit();
 				}
 {xestart}		{
 					token_start = yytext;
-					state_before = YYSTATE;
-					BEGIN(xe);
+					yy_push_state(xe);
 					startlit();
 				}
 {xusstart}		{
 					token_start = yytext;
-					state_before = YYSTATE;
-					BEGIN(xus);
+					yy_push_state(xus);
 					startlit();
-					addlit(yytext, yyleng);
 				}
 } /* <SQL> */
 
-<xq,xqc>{quotestop} |
-<xq,xqc>{quotefail} {
-					yyless(1);
-					BEGIN(state_before);
-					base_yylval.str = mm_strdup(literalbuf);
-					return SCONST;
-				}
-<xe>{quotestop} |
-<xe>{quotefail} {
-					yyless(1);
-					BEGIN(state_before);
-					base_yylval.str = mm_strdup(literalbuf);
-					return ECONST;
+<xb,xh,xq,xqc,xe,xn,xus>{quote} {
+					/*
+					 * When we are scanning a quoted string and see an end
+					 * quote, we must look ahead for a possible continuation.
+					 * If we don't see one, we know the end quote was in fact
+					 * the end of the string.  To reduce the lexer table size,
+					 * we use a single "xqs" state to do the lookahead for all
+					 * types of strings.
+					 */
+					yy_push_state(xqs);
 				}
-<xn>{quotestop} |
-<xn>{quotefail} {
-					yyless(1);
-					BEGIN(state_before);
-					base_yylval.str = mm_strdup(literalbuf);
-					return NCONST;
+<xqs>{quotecontinue} {
+					/*
+					 * Found a quote continuation, so return to the in-quote
+					 * state and continue scanning the literal.
+					 */
+					yy_pop_state();
 				}
-<xus>{xusstop} {
-					addlit(yytext, yyleng);
-					BEGIN(state_before);
-					base_yylval.str = mm_strdup(literalbuf);
-					return UCONST;
+<xqs>{quotecontinuefail} |
+<xqs><<EOF>> |
+<xqs>{other}	{
+					int token;
+
+					/*
+					 * Failed to see a quote continuation.  Throw back
+					 * everything after the end quote, and handle the string
+					 * according to the state we were in previously.
+					 */
+					yyless(0);
+
+					switch (yy_top_state())
+					{
+						case xb:
+							if (literalbuf[strspn(literalbuf, "01") + 1] != '\0')
+								mmerror(PARSE_ERROR, ET_ERROR, "invalid bit string literal");
+							base_yylval.str = mm_strdup(literalbuf);
+							token = BCONST;
+							break;
+						case xh:
+							base_yylval.str = mm_strdup(literalbuf);
+							token = XCONST;
+							break;
+						case xq:
+							/* fallthrough */
+						case xqc:
+							base_yylval.str = psprintf("'%s'", literalbuf);
+							token = SCONST;
+							break;
+						case xe:
+							base_yylval.str = psprintf("E'%s'", literalbuf);
+							token = SCONST;
+							break;
+						case xn:
+							base_yylval.str = psprintf("N'%s'", literalbuf);
+							token = SCONST;
+							break;
+						case xus:
+							base_yylval.str = psprintf("U&'%s'", literalbuf);
+							token = UCONST;
+							break;
+						default:
+							mmfatal(PARSE_ERROR, "unhandled previous state in xqs\n");
+					}
+
+					/* go back to state before string start */
+					yy_pop_state();
+					yy_pop_state();
+
+					return token;
 				}
+
 <xq,xe,xn,xus>{xqdouble}	{ addlitchar('\''); }
 <xqc>{xqcquote}	{
 					addlitchar('\\');
@@ -604,9 +629,6 @@ cppline			{space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})((\/\*[^*/]*\*+
 <xe>{xehexesc}  {
 					addlit(yytext, yyleng);
 				}
-<xq,xqc,xe,xn,xus>{quotecontinue}	{
-					/* ignore */
-				}
 <xe>.			{
 					/* This is only needed for \ just before EOF */
 					addlitchar(yytext[0]);
@@ -619,7 +641,7 @@ cppline			{space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})((\/\*[^*/]*\*+
 					if (dolqstart)
 						free(dolqstart);
 					dolqstart = mm_strdup(yytext);
-					BEGIN(xdolq);
+					yy_push_state(xdolq);
 					startlit();
 					addlit(yytext, yyleng);
 				}
@@ -637,9 +659,9 @@ cppline			{space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})((\/\*[^*/]*\*+
 						addlit(yytext, yyleng);
 						free(dolqstart);
 						dolqstart = NULL;
-						BEGIN(SQL);
+						yy_pop_state();
 						base_yylval.str = mm_strdup(literalbuf);
-						return DOLCONST;
+						return SCONST;
 					}
 					else
 					{
@@ -666,20 +688,17 @@ cppline			{space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})((\/\*[^*/]*\*+
 
 <SQL>{
 {xdstart}		{
-					state_before = YYSTATE;
-					BEGIN(xd);
+					yy_push_state(xd);
 					startlit();
 				}
 {xuistart}		{
-					state_before = YYSTATE;
-					BEGIN(xui);
+					yy_push_state(xui);
 					startlit();
-					addlit(yytext, yyleng);
 				}
 } /* <SQL> */
 
 <xd>{xdstop}	{
-					BEGIN(state_before);
+					yy_pop_state();
 					if (literallen == 0)
 						mmerror(PARSE_ERROR, ET_ERROR, "zero-length delimited identifier");
 					/* The backend will truncate the identifier here. We do not as it does not change the result. */
@@ -687,17 +706,16 @@ cppline			{space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})((\/\*[^*/]*\*+
 					return CSTRING;
 				}
 <xdc>{xdstop}	{
-					BEGIN(state_before);
+					yy_pop_state();
 					base_yylval.str = mm_strdup(literalbuf);
 					return CSTRING;
 				}
-<xui>{xuistop}	{
-					BEGIN(state_before);
+<xui>{dquote}	{
+					yy_pop_state();
 					if (literallen == 2) /* "U&" */
 						mmerror(PARSE_ERROR, ET_ERROR, "zero-length delimited identifier");
 					/* The backend will truncate the identifier here. We do not as it does not change the result. */
-					addlit(yytext, yyleng);
-					base_yylval.str = mm_strdup(literalbuf);
+					base_yylval.str = psprintf("U&\"%s\"", literalbuf);
 					return UIDENT;
 				}
 <xd,xui>{xddouble}	{
@@ -708,8 +726,7 @@ cppline			{space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})((\/\*[^*/]*\*+
 				}
 <xd,xui><<EOF>>	{ mmfatal(PARSE_ERROR, "unterminated quoted identifier"); }
 <C>{xdstart}	{
-					state_before = YYSTATE;
-					BEGIN(xdc);
+					yy_push_state(xdc);
 					startlit();
 				}
 <xdc>{xdcinside}	{
diff --git a/src/pl/plpgsql/src/pl_gram.y b/src/pl/plpgsql/src/pl_gram.y
index 454071a81f..3cdf9289c4 100644
--- a/src/pl/plpgsql/src/pl_gram.y
+++ b/src/pl/plpgsql/src/pl_gram.y
@@ -232,7 +232,7 @@ static	void			check_raise_parameters(PLpgSQL_stmt_raise *stmt);
  * Some of these are not directly referenced in this file, but they must be
  * here anyway.
  */
-%token <str>	IDENT FCONST SCONST BCONST XCONST Op
+%token <str>	IDENT UIDENT FCONST SCONST UCONST BCONST XCONST Op
 %token <ival>	ICONST PARAM
 %token			TYPECAST DOT_DOT COLON_EQUALS EQUALS_GREATER
 %token			LESS_EQUALS GREATER_EQUALS NOT_EQUALS
diff --git a/src/test/regress/expected/strings.out b/src/test/regress/expected/strings.out
index 6d96843e5b..60cb86193c 100644
--- a/src/test/regress/expected/strings.out
+++ b/src/test/regress/expected/strings.out
@@ -48,17 +48,21 @@ SELECT 'tricky' AS U&"\" UESCAPE '!';
 (1 row)
 
 SELECT U&'wrong: \061';
-ERROR:  invalid Unicode escape value at or near "\061'"
+ERROR:  invalid Unicode escape value
 LINE 1: SELECT U&'wrong: \061';
                          ^
 SELECT U&'wrong: \+0061';
-ERROR:  invalid Unicode escape value at or near "\+0061'"
+ERROR:  invalid Unicode escape value
 LINE 1: SELECT U&'wrong: \+0061';
                          ^
+SELECT U&'wrong: +0061' UESCAPE +;
+ERROR:  UESCAPE must be followed by a simple string literal at or near "+"
+LINE 1: SELECT U&'wrong: +0061' UESCAPE +;
+                                        ^
 SELECT U&'wrong: +0061' UESCAPE '+';
-ERROR:  invalid Unicode escape character at or near "+'"
+ERROR:  invalid Unicode escape character at or near "'+'"
 LINE 1: SELECT U&'wrong: +0061' UESCAPE '+';
-                                         ^
+                                        ^
 SET standard_conforming_strings TO off;
 SELECT U&'d\0061t\+000061' AS U&"d\0061t\+000061";
 ERROR:  unsafe use of string constant with Unicode escapes
diff --git a/src/test/regress/sql/strings.sql b/src/test/regress/sql/strings.sql
index 0afb94964b..c5cd15142a 100644
--- a/src/test/regress/sql/strings.sql
+++ b/src/test/regress/sql/strings.sql
@@ -27,6 +27,7 @@ SELECT 'tricky' AS U&"\" UESCAPE '!';
 
 SELECT U&'wrong: \061';
 SELECT U&'wrong: \+0061';
+SELECT U&'wrong: +0061' UESCAPE +;
 SELECT U&'wrong: +0061' UESCAPE '+';
 
 SET standard_conforming_strings TO off;

#26

John Naylor

john.naylor@2ndquadrant.com

about 6 years ago

In reply to: John Naylor (#25)

Re: benchmarking Flex practices

I wrote:

I no longer use state variables to track scanner state, and in fact I
removed the existing "state_before" variable in ECPG. Instead, I used
the Flex builtins yy_push_state(), yy_pop_state(), and yy_top_state().
These have been a feature for a long time, it seems, so I think we're
okay as far as portability. I think it's cleaner this way, and
possibly faster.

I thought I should get some actual numbers to test, and the results
are encouraging:

master v10
info 1.56s 1.51s
str 1.18s 1.14s
unicode 1.33s 1.34s
uescape 1.44s 1.58s

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#27

Tom Lane

tgl@sss.pgh.pa.us

almost 6 years ago

In reply to: John Naylor (#26)

Re: benchmarking Flex practices

John Naylor <john.naylor@2ndquadrant.com> writes:

I no longer use state variables to track scanner state, and in fact I
removed the existing "state_before" variable in ECPG. Instead, I used
the Flex builtins yy_push_state(), yy_pop_state(), and yy_top_state().
These have been a feature for a long time, it seems, so I think we're
okay as far as portability. I think it's cleaner this way, and
possibly faster.

Hmm ... after a bit of research I agree that these functions are not
a portability hazard. They are present at least as far back as flex
2.5.33 which is as old as we've got in the buildfarm.

However, I'm less excited about them from a performance standpoint.
The BEGIN() macro expands to (ordinarily)

yyg->yy_start = integer-constant

which is surely pretty cheap. However, yy_push_state is substantially
more expensive than that, not least because the first invocation in
a parse cycle will involve a malloc() or palloc(). Likewise yy_pop_state
is multiple times more expensive than plain BEGIN().

Now, I agree that this is negligible for ECPG's usage, so if
pushing/popping state is helpful there, let's go for it. But I am
not convinced it's negligible for the backend, and I also don't
see that we actually need to track any nested scanner states there.
So I'd rather stick to using BEGIN in the backend. Not sure about
psql.

BTW, while looking through the latest patch it struck me that
"UCONST" is an underspecified and potentially confusing name.
It doesn't indicate what kind of constant we're talking about,
for instance a C programmer could be forgiven for thinking
it means something like "123U". What do you think of "USCONST",
following UIDENT's lead of prefixing U onto whatever the
underlying token type is?

regards, tom lane

#28

John Naylor

john.naylor@2ndquadrant.com

almost 6 years ago

In reply to: Tom Lane (#27)

2 attachment(s)

Re: benchmarking Flex practices

On Mon, Jan 13, 2020 at 7:57 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Hmm ... after a bit of research I agree that these functions are not
a portability hazard. They are present at least as far back as flex
2.5.33 which is as old as we've got in the buildfarm.

However, I'm less excited about them from a performance standpoint.
The BEGIN() macro expands to (ordinarily)

yyg->yy_start = integer-constant

which is surely pretty cheap. However, yy_push_state is substantially
more expensive than that, not least because the first invocation in
a parse cycle will involve a malloc() or palloc(). Likewise yy_pop_state
is multiple times more expensive than plain BEGIN().

Now, I agree that this is negligible for ECPG's usage, so if
pushing/popping state is helpful there, let's go for it. But I am
not convinced it's negligible for the backend, and I also don't
see that we actually need to track any nested scanner states there.
So I'd rather stick to using BEGIN in the backend. Not sure about
psql.

Okay, removed in v11. The advantage of stack functions in ECPG was to
avoid having the two variables state_before_str_start and
state_before_str_stop. But if we don't use stack functions in the
backend, then consistency wins in my mind. Plus, it was easier for me
to revert the stack functions for all 3 scanners.

BTW, while looking through the latest patch it struck me that
"UCONST" is an underspecified and potentially confusing name.
It doesn't indicate what kind of constant we're talking about,
for instance a C programmer could be forgiven for thinking
it means something like "123U". What do you think of "USCONST",
following UIDENT's lead of prefixing U onto whatever the
underlying token type is?

Makes perfect sense. Grepping through the source tree, indeed it seems
the replication command scanner is using UCONST for digits.

Some other cosmetic adjustments in ECPG parser.c:
-Previously I had a WIP comment in about 2 functions that are copies
from elsewhere. In v11 I just noted that they are copied.
-I thought it'd be nicer if ECPG spelled UESCAPE in caps when
reconstructing the string.
-Corrected copy-paste-o in comment

Also:
-reverted some spurious whitespace changes
-revised scan.l comment about the performance benefits of no backtracking
-split the ECPG C-comment scanning cleanup into a separate patch, as I
did for v6. I include it here since it's related (merging scanner
states), but not relevant to making the core scanner smaller.
-wrote draft commit messages

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

v11-0001-Reduce-size-of-backend-scanner-transition-array.patchapplication/octet-stream; name=v11-0001-Reduce-size-of-backend-scanner-transition-array.patchDownload

From 403662f89a1e207bb29f46f58691414167aa2575 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@2ndquadrant.com>
Date: Mon, 13 Jan 2020 17:47:31 +0800
Subject: [PATCH v11 1/2] Reduce size of backend scanner transition array.

Previously, the core scanner's yy_transition[] array had 37045
elements. Since that number is larger than INT16_MAX, Flex generated the
array to contain 32-bit integers. By reorganizing some of the bulkier
scanner rules, this array now has 20495 elements. The much smaller total
length, combined with the consequent use of 16-bit integers reduces the
binary size by over 200kB. This was accomplished in two ways:

1. Consolidate handling of quote continuations into a new start condition,
rather than duplicating that rule in all of five different string types.

2. Treat Unicode strings and identifiers followed by a UESCAPE sequence as
three separate tokens, rather than one. This necessitated teaching parser.c
to handle these appropriately before passing to the Bison parser. It was
possible to handle these in the grammar, but that approach was rejected
for performance and maintainability reasons.

Performance seems equal or slightly faster in most cases. The exception
is UESCAPE sequences. Lexing those is about 10% slower since the scanner
now has to be called three times rather than one. This is acceptable since
that feature is very rarely used.
---
 src/backend/parser/gram.y                     |   5 +-
 src/backend/parser/parser.c                   | 279 +++++++++++-
 src/backend/parser/scan.l                     | 413 +++---------------
 src/fe_utils/psqlscan.l                       | 121 ++---
 src/include/fe_utils/psqlscan_int.h           |   1 +
 src/include/mb/pg_wchar.h                     |  21 +
 src/include/parser/kwlist.h                   |   1 +
 src/include/parser/scanner.h                  |   3 +-
 src/interfaces/ecpg/preproc/ecpg.tokens       |   1 -
 src/interfaces/ecpg/preproc/ecpg.trailer      |  37 +-
 src/interfaces/ecpg/preproc/ecpg.type         |   6 +-
 src/interfaces/ecpg/preproc/parse.pl          |   4 +-
 src/interfaces/ecpg/preproc/parser.c          | 114 +++--
 src/interfaces/ecpg/preproc/pgc.l             | 178 ++++----
 .../ecpg/test/expected/preproc-strings.c      |   2 +-
 .../ecpg/test/expected/preproc-strings.stderr |   2 +-
 src/pl/plpgsql/src/pl_gram.y                  |   2 +-
 src/test/regress/expected/strings.out         |  12 +-
 src/test/regress/sql/strings.sql              |   1 +
 19 files changed, 620 insertions(+), 583 deletions(-)

diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index ad5be902b0..560a8ee45e 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -601,7 +601,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
  * DOT_DOT is unused in the core SQL grammar, and so will always provoke
  * parse errors.  It is needed by PL/pgSQL.
  */
-%token <str>	IDENT FCONST SCONST BCONST XCONST Op
+%token <str>	IDENT UIDENT FCONST SCONST USCONST BCONST XCONST Op
 %token <ival>	ICONST PARAM
 %token			TYPECAST DOT_DOT COLON_EQUALS EQUALS_GREATER
 %token			LESS_EQUALS GREATER_EQUALS NOT_EQUALS
@@ -691,7 +691,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	TREAT TRIGGER TRIM TRUE_P
 	TRUNCATE TRUSTED TYPE_P TYPES_P
 
-	UNBOUNDED UNCOMMITTED UNENCRYPTED UNION UNIQUE UNKNOWN UNLISTEN UNLOGGED
+	UESCAPE UNBOUNDED UNCOMMITTED UNENCRYPTED UNION UNIQUE UNKNOWN UNLISTEN UNLOGGED
 	UNTIL UPDATE USER USING
 
 	VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
@@ -15374,6 +15374,7 @@ unreserved_keyword:
 			| TRUSTED
 			| TYPE_P
 			| TYPES_P
+			| UESCAPE
 			| UNBOUNDED
 			| UNCOMMITTED
 			| UNENCRYPTED
diff --git a/src/backend/parser/parser.c b/src/backend/parser/parser.c
index bc3f812da8..ca4158d80e 100644
--- a/src/backend/parser/parser.c
+++ b/src/backend/parser/parser.c
@@ -23,6 +23,12 @@
 
 #include "parser/gramparse.h"
 #include "parser/parser.h"
+#include "parser/scansup.h"
+#include "mb/pg_wchar.h"
+
+static bool check_uescapechar(unsigned char escape);
+static char *str_udeescape(char escape, char *str, int position,
+						   core_yyscan_t yyscanner);
 
 
 /*
@@ -75,6 +81,10 @@ raw_parser(const char *str)
  * scanner backtrack, which would cost more performance than this filter
  * layer does.
  *
+ * We also use this filter to convert UIDENT and USCONST sequences into
+ * plain IDENT and SCONST tokens.  While that could be handled by additional
+ * productions in the main grammar, it's more efficient to do it like this.
+ *
  * The filter also provides a convenient place to translate between
  * the core_YYSTYPE and YYSTYPE representations (which are really the
  * same thing anyway, but notationally they're different).
@@ -104,7 +114,7 @@ base_yylex(YYSTYPE *lvalp, YYLTYPE *llocp, core_yyscan_t yyscanner)
 	 * If this token isn't one that requires lookahead, just return it.  If it
 	 * does, determine the token length.  (We could get that via strlen(), but
 	 * since we have such a small set of possibilities, hardwiring seems
-	 * feasible and more efficient.)
+	 * feasible and more efficient --- at least for the fixed-length cases.)
 	 */
 	switch (cur_token)
 	{
@@ -117,6 +127,10 @@ base_yylex(YYSTYPE *lvalp, YYLTYPE *llocp, core_yyscan_t yyscanner)
 		case WITH:
 			cur_token_length = 4;
 			break;
+		case UIDENT:
+		case USCONST:
+			cur_token_length = strlen(yyextra->core_yy_extra.scanbuf + *llocp);
+			break;
 		default:
 			return cur_token;
 	}
@@ -190,7 +204,270 @@ base_yylex(YYSTYPE *lvalp, YYLTYPE *llocp, core_yyscan_t yyscanner)
 					break;
 			}
 			break;
+
+		case UIDENT:
+		case USCONST:
+			/* Look ahead for UESCAPE */
+			if (next_token == UESCAPE)
+			{
+				/* Yup, so get third token, which had better be SCONST */
+				const char *escstr;
+
+				/* Again save and restore *llocp */
+				cur_yylloc = *llocp;
+
+				/* Un-truncate current token so errors point to third token */
+				*(yyextra->lookahead_end) = yyextra->lookahead_hold_char;
+
+				/* Get third token */
+				next_token = core_yylex(&(yyextra->lookahead_yylval),
+										llocp, yyscanner);
+
+				/* If we throw error here, it will point to third token */
+				if (next_token != SCONST)
+					scanner_yyerror("UESCAPE must be followed by a simple string literal",
+									yyscanner);
+
+				escstr = yyextra->lookahead_yylval.str;
+				if (strlen(escstr) != 1 || !check_uescapechar(escstr[0]))
+					scanner_yyerror("invalid Unicode escape character",
+									yyscanner);
+
+				/* Now restore *llocp; errors will point to first token */
+				*llocp = cur_yylloc;
+
+				/* Apply Unicode conversion */
+				lvalp->core_yystype.str =
+					str_udeescape(escstr[0],
+								  lvalp->core_yystype.str,
+								  *llocp,
+								  yyscanner);
+
+				/*
+				 * We don't need to revert the un-truncation of UESCAPE.  What we
+				 * do want to do is clear have_lookahead, thereby consuming
+				 * all three tokens.
+				 */
+				yyextra->have_lookahead = false;
+			}
+			else
+			{
+				/* No UESCAPE, so convert using default escape character */
+				lvalp->core_yystype.str =
+					str_udeescape('\\',
+								  lvalp->core_yystype.str,
+								  *llocp,
+								  yyscanner);
+			}
+
+			if (cur_token == UIDENT)
+			{
+				/* It's an identifier, so truncate as appropriate */
+				truncate_identifier(lvalp->core_yystype.str,
+									strlen(lvalp->core_yystype.str),
+									true);
+				cur_token = IDENT;
+			}
+			else if (cur_token == USCONST)
+			{
+				cur_token = SCONST;
+			}
+			break;
 	}
 
 	return cur_token;
 }
+
+/* convert hex digit (caller should have verified that) to value */
+static unsigned int
+hexval(unsigned char c)
+{
+	if (c >= '0' && c <= '9')
+		return c - '0';
+	if (c >= 'a' && c <= 'f')
+		return c - 'a' + 0xA;
+	if (c >= 'A' && c <= 'F')
+		return c - 'A' + 0xA;
+	elog(ERROR, "invalid hexadecimal digit");
+	return 0;					/* not reached */
+}
+
+/* is Unicode code point acceptable in database's encoding? */
+static void
+check_unicode_value(pg_wchar c, int pos, core_yyscan_t yyscanner)
+{
+	/* See also addunicode() in scan.l */
+	if (c == 0 || c > 0x10FFFF)
+		ereport(ERROR,
+				(errcode(ERRCODE_SYNTAX_ERROR),
+				 errmsg("invalid Unicode escape value"),
+				 scanner_errposition(pos, yyscanner)));
+
+	if (c > 0x7F && GetDatabaseEncoding() != PG_UTF8)
+		ereport(ERROR,
+				(errcode(ERRCODE_SYNTAX_ERROR),
+				 errmsg("Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8"),
+				 scanner_errposition(pos, yyscanner)));
+}
+
+/* is 'escape' acceptable as Unicode escape character (UESCAPE syntax) ? */
+static bool
+check_uescapechar(unsigned char escape)
+{
+	if (isxdigit(escape)
+		|| escape == '+'
+		|| escape == '\''
+		|| escape == '"'
+		|| scanner_isspace(escape))
+		return false;
+	else
+		return true;
+}
+
+/* Process Unicode escapes in "str", producing a palloc'd plain string */
+static char *
+str_udeescape(char escape, char *str, int position,
+			  core_yyscan_t yyscanner)
+{
+	char	   *new,
+			   *in,
+			   *out;
+	int			str_length;
+	pg_wchar	pair_first = 0;
+
+	str_length = strlen(str);
+
+	/*
+	 * This relies on the subtle assumption that a UTF-8 expansion cannot be
+	 * longer than its escaped representation.
+	 */
+	new = palloc(str_length + 1);
+
+	in = str;
+	out = new;
+	while (*in)
+	{
+		if (in[0] == escape)
+		{
+			if (in[1] == escape)
+			{
+				if (pair_first)
+					goto invalid_pair;
+				*out++ = escape;
+				in += 2;
+			}
+			else if (isxdigit((unsigned char) in[1]) &&
+					 isxdigit((unsigned char) in[2]) &&
+					 isxdigit((unsigned char) in[3]) &&
+					 isxdigit((unsigned char) in[4]))
+			{
+				pg_wchar	unicode;
+
+				unicode = (hexval(in[1]) << 12) +
+					(hexval(in[2]) << 8) +
+					(hexval(in[3]) << 4) +
+					hexval(in[4]);
+				check_unicode_value(unicode,
+									position + in - str + 3,	/* 3 for U&" */
+									yyscanner);
+				if (pair_first)
+				{
+					if (is_utf16_surrogate_second(unicode))
+					{
+						unicode = surrogate_pair_to_codepoint(pair_first, unicode);
+						pair_first = 0;
+					}
+					else
+						goto invalid_pair;
+				}
+				else if (is_utf16_surrogate_second(unicode))
+					goto invalid_pair;
+
+				if (is_utf16_surrogate_first(unicode))
+					pair_first = unicode;
+				else
+				{
+					unicode_to_utf8(unicode, (unsigned char *) out);
+					out += pg_mblen(out);
+				}
+				in += 5;
+			}
+			else if (in[1] == '+' &&
+					 isxdigit((unsigned char) in[2]) &&
+					 isxdigit((unsigned char) in[3]) &&
+					 isxdigit((unsigned char) in[4]) &&
+					 isxdigit((unsigned char) in[5]) &&
+					 isxdigit((unsigned char) in[6]) &&
+					 isxdigit((unsigned char) in[7]))
+			{
+				pg_wchar	unicode;
+
+				unicode = (hexval(in[2]) << 20) +
+					(hexval(in[3]) << 16) +
+					(hexval(in[4]) << 12) +
+					(hexval(in[5]) << 8) +
+					(hexval(in[6]) << 4) +
+					hexval(in[7]);
+				check_unicode_value(unicode,
+									position + in - str + 3,	/* 3 for U&" */
+									yyscanner);
+				if (pair_first)
+				{
+					if (is_utf16_surrogate_second(unicode))
+					{
+						unicode = surrogate_pair_to_codepoint(pair_first, unicode);
+						pair_first = 0;
+					}
+					else
+						goto invalid_pair;
+				}
+				else if (is_utf16_surrogate_second(unicode))
+					goto invalid_pair;
+
+				if (is_utf16_surrogate_first(unicode))
+					pair_first = unicode;
+				else
+				{
+					unicode_to_utf8(unicode, (unsigned char *) out);
+					out += pg_mblen(out);
+				}
+				in += 8;
+			}
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("invalid Unicode escape value"),
+						 scanner_errposition(position + in - str + 3,	/* 3 for U&" */
+											 yyscanner)));
+		}
+		else
+		{
+			if (pair_first)
+				goto invalid_pair;
+
+			*out++ = *in++;
+		}
+	}
+
+	/* unfinished surrogate pair? */
+	if (pair_first)
+		goto invalid_pair;
+
+	*out = '\0';
+
+	/*
+	 * We could skip pg_verifymbstr if we didn't process any non-7-bit-ASCII
+	 * codes; but it's probably not worth the trouble, since this isn't likely
+	 * to be a performance-critical path.
+	 */
+	pg_verifymbstr(new, out - new, false);
+	return new;
+
+invalid_pair:
+	ereport(ERROR,
+			(errcode(ERRCODE_SYNTAX_ERROR),
+			 errmsg("invalid Unicode surrogate pair"),
+			 scanner_errposition(position + in - str + 3,	/* 3 for U&" */
+								 yyscanner)));
+	return NULL;				/* keep compiler quiet */
+}
diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l
index e25e12e461..2d8350aaad 100644
--- a/src/backend/parser/scan.l
+++ b/src/backend/parser/scan.l
@@ -13,8 +13,8 @@
  * in the sense that there is always a rule that can match the input
  * consumed so far (the rule action may internally throw back some input
  * with yyless(), however).  As explained in the flex manual, this makes
- * for a useful speed increase --- about a third faster than a plain -CF
- * lexer, in simple testing.  The extra complexity is mostly in the rules
+ * for a useful speed increase --- several percent faster when measuring
+ * raw parsing (Flex + Bison).  The extra complexity is mostly in the rules
  * for handling float numbers and continued string literals.  If you change
  * the lexical rules, verify that you haven't broken the no-backtrack
  * property by running flex with the "-b" option and checking that the
@@ -110,14 +110,9 @@ const uint16 ScanKeywordTokens[] = {
 static void addlit(char *ytext, int yleng, core_yyscan_t yyscanner);
 static void addlitchar(unsigned char ychar, core_yyscan_t yyscanner);
 static char *litbufdup(core_yyscan_t yyscanner);
-static char *litbuf_udeescape(unsigned char escape, core_yyscan_t yyscanner);
 static unsigned char unescape_single_char(unsigned char c, core_yyscan_t yyscanner);
 static int	process_integer_literal(const char *token, YYSTYPE *lval);
-static bool is_utf16_surrogate_first(pg_wchar c);
-static bool is_utf16_surrogate_second(pg_wchar c);
-static pg_wchar surrogate_pair_to_codepoint(pg_wchar first, pg_wchar second);
 static void addunicode(pg_wchar c, yyscan_t yyscanner);
-static bool check_uescapechar(unsigned char escape);
 
 #define yyerror(msg)  scanner_yyerror(msg, yyscanner)
 
@@ -168,12 +163,11 @@ extern void core_yyset_column(int column_no, yyscan_t yyscanner);
  *  <xd> delimited identifiers (double-quoted identifiers)
  *  <xh> hexadecimal numeric string
  *  <xq> standard quoted strings
+ *  <xqs> quote stop (detect continued strings)
  *  <xe> extended quoted strings (support backslash escape sequences)
  *  <xdolq> $foo$ quoted strings
  *  <xui> quoted identifier with Unicode escapes
- *  <xuiend> end of a quoted identifier with Unicode escapes, UESCAPE can follow
  *  <xus> quoted string with Unicode escapes
- *  <xusend> end of a quoted string with Unicode escapes, UESCAPE can follow
  *  <xeu> Unicode surrogate pair in extended quoted string
  *
  * Remember to add an <<EOF>> case whenever you add a new exclusive state!
@@ -185,12 +179,11 @@ extern void core_yyset_column(int column_no, yyscan_t yyscanner);
 %x xd
 %x xh
 %x xq
+%x xqs
 %x xe
 %x xdolq
 %x xui
-%x xuiend
 %x xus
-%x xusend
 %x xeu
 
 /*
@@ -231,19 +224,18 @@ special_whitespace		({space}+|{comment}{newline})
 horiz_whitespace		({horiz_space}|{comment})
 whitespace_with_newline	({horiz_whitespace}*{newline}{special_whitespace}*)
 
+quote			'
+/* If we see {quote} then {quotecontinue}, the quoted string continues */
+quotecontinue	{whitespace_with_newline}{quote}
+
 /*
- * To ensure that {quotecontinue} can be scanned without having to back up
- * if the full pattern isn't matched, we include trailing whitespace in
- * {quotestop}.  This matches all cases where {quotecontinue} fails to match,
- * except for {quote} followed by whitespace and just one "-" (not two,
- * which would start a {comment}).  To cover that we have {quotefail}.
- * The actions for {quotestop} and {quotefail} must throw back characters
- * beyond the quote proper.
+ * {quotecontinuefail} is needed to avoid lexer backup when we fail to match
+ * {quotecontinue}.  It might seem that this could just be {whitespace}*,
+ * but if there's a dash after {whitespace_with_newline}, it must be consumed
+ * to see if there's another dash --- which would start a {comment} and thus
+ * allow continuation of the {quotecontinue} token.
  */
-quote			'
-quotestop		{quote}{whitespace}*
-quotecontinue	{quote}{whitespace_with_newline}{quote}
-quotefail		{quote}{whitespace}*"-"
+quotecontinuefail	{whitespace}*"-"?
 
 /* Bit string
  * It is tempting to scan the string for only those characters
@@ -304,21 +296,12 @@ xdstop			{dquote}
 xddouble		{dquote}{dquote}
 xdinside		[^"]+
 
-/* Unicode escapes */
-uescape			[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']{quote}
-/* error rule to avoid backup */
-uescapefail		[uU][eE][sS][cC][aA][pP][eE]{whitespace}*"-"|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*|[uU][eE][sS][cC][aA][pP]|[uU][eE][sS][cC][aA]|[uU][eE][sS][cC]|[uU][eE][sS]|[uU][eE]|[uU]
-
 /* Quoted identifier with Unicode escapes */
 xuistart		[uU]&{dquote}
 
 /* Quoted string with Unicode escapes */
 xusstart		[uU]&{quote}
 
-/* Optional UESCAPE after a quoted string or identifier with Unicode escapes. */
-xustop1		{uescapefail}?
-xustop2		{uescape}
-
 /* error rule to avoid backup */
 xufailed		[uU]&
 
@@ -476,21 +459,10 @@ other			.
 					startlit();
 					addlitchar('b', yyscanner);
 				}
-<xb>{quotestop}	|
-<xb>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
-					yylval->str = litbufdup(yyscanner);
-					return BCONST;
-				}
 <xh>{xhinside}	|
 <xb>{xbinside}	{
 					addlit(yytext, yyleng, yyscanner);
 				}
-<xh>{quotecontinue}	|
-<xb>{quotecontinue}	{
-					/* ignore */
-				}
 <xb><<EOF>>		{ yyerror("unterminated bit string literal"); }
 
 {xhstart}		{
@@ -505,13 +477,6 @@ other			.
 					startlit();
 					addlitchar('x', yyscanner);
 				}
-<xh>{quotestop}	|
-<xh>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
-					yylval->str = litbufdup(yyscanner);
-					return XCONST;
-				}
 <xh><<EOF>>		{ yyerror("unterminated hexadecimal string literal"); }
 
 {xnstart}		{
@@ -568,53 +533,67 @@ other			.
 					BEGIN(xus);
 					startlit();
 				}
-<xq,xe>{quotestop}	|
-<xq,xe>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
+
+<xb,xh,xq,xe,xus>{quote} {
 					/*
-					 * check that the data remains valid if it might have been
-					 * made invalid by unescaping any chars.
+					 * When we are scanning a quoted string and see an end
+					 * quote, we must look ahead for a possible continuation.
+					 * If we don't see one, we know the end quote was in fact
+					 * the end of the string.  To reduce the lexer table size,
+					 * we use a single "xqs" state to do the lookahead for all
+					 * types of strings.
 					 */
-					if (yyextra->saw_non_ascii)
-						pg_verifymbstr(yyextra->literalbuf,
-									   yyextra->literallen,
-									   false);
-					yylval->str = litbufdup(yyscanner);
-					return SCONST;
-				}
-<xus>{quotestop} |
-<xus>{quotefail} {
-					/* throw back all but the quote */
-					yyless(1);
-					/* xusend state looks for possible UESCAPE */
-					BEGIN(xusend);
+					yyextra->state_before_str_stop = YYSTATE;
+					BEGIN(xqs);
 				}
-<xusend>{whitespace} {
-					/* stay in xusend state over whitespace */
+<xqs>{quotecontinue} {
+					/*
+					 * Found a quote continuation, so return to the in-quote
+					 * state and continue scanning the literal.
+					 */
+					BEGIN(yyextra->state_before_str_stop);
 				}
-<xusend><<EOF>> |
-<xusend>{other} |
-<xusend>{xustop1} {
-					/* no UESCAPE after the quote, throw back everything */
+<xqs>{quotecontinuefail} |
+<xqs><<EOF>> |
+<xqs>{other}	{
+					/*
+					 * Failed to see a quote continuation.  Throw back
+					 * everything after the end quote, and handle the string
+					 * according to the state we were in previously.
+					 */
 					yyless(0);
 					BEGIN(INITIAL);
-					yylval->str = litbuf_udeescape('\\', yyscanner);
-					return SCONST;
-				}
-<xusend>{xustop2} {
-					/* found UESCAPE after the end quote */
-					BEGIN(INITIAL);
-					if (!check_uescapechar(yytext[yyleng - 2]))
+
+					switch (yyextra->state_before_str_stop)
 					{
-						SET_YYLLOC();
-						ADVANCE_YYLLOC(yyleng - 2);
-						yyerror("invalid Unicode escape character");
+						case xb:
+							yylval->str = litbufdup(yyscanner);
+							return BCONST;
+						case xh:
+							yylval->str = litbufdup(yyscanner);
+							return XCONST;
+						case xq:
+							/* fallthrough */
+						case xe:
+							/*
+							 * Check that the data remains valid if it
+							 * might have been made invalid by unescaping
+							 * any chars.
+							 */
+							if (yyextra->saw_non_ascii)
+								pg_verifymbstr(yyextra->literalbuf,
+											   yyextra->literallen,
+											   false);
+							yylval->str = litbufdup(yyscanner);
+							return SCONST;
+						case xus:
+							yylval->str = litbufdup(yyscanner);
+							return USCONST;
+						default:
+							yyerror("unhandled previous state in xqs");
 					}
-					yylval->str = litbuf_udeescape(yytext[yyleng - 2],
-												   yyscanner);
-					return SCONST;
 				}
+
 <xq,xe,xus>{xqdouble} {
 					addlitchar('\'', yyscanner);
 				}
@@ -693,9 +672,6 @@ other			.
 					if (c == '\0' || IS_HIGHBIT_SET(c))
 						yyextra->saw_non_ascii = true;
 				}
-<xq,xe,xus>{quotecontinue} {
-					/* ignore */
-				}
 <xe>.			{
 					/* This is only needed for \ just before EOF */
 					addlitchar(yytext[0], yyscanner);
@@ -769,53 +745,13 @@ other			.
 					yylval->str = ident;
 					return IDENT;
 				}
-<xui>{dquote} {
-					yyless(1);
-					/* xuiend state looks for possible UESCAPE */
-					BEGIN(xuiend);
-				}
-<xuiend>{whitespace} {
-					/* stay in xuiend state over whitespace */
-				}
-<xuiend><<EOF>> |
-<xuiend>{other} |
-<xuiend>{xustop1} {
-					/* no UESCAPE after the quote, throw back everything */
-					char	   *ident;
-					int			identlen;
-
-					yyless(0);
-
-					BEGIN(INITIAL);
+<xui>{dquote}	{
 					if (yyextra->literallen == 0)
 						yyerror("zero-length delimited identifier");
-					ident = litbuf_udeescape('\\', yyscanner);
-					identlen = strlen(ident);
-					if (identlen >= NAMEDATALEN)
-						truncate_identifier(ident, identlen, true);
-					yylval->str = ident;
-					return IDENT;
-				}
-<xuiend>{xustop2}	{
-					/* found UESCAPE after the end quote */
-					char	   *ident;
-					int			identlen;
 
 					BEGIN(INITIAL);
-					if (yyextra->literallen == 0)
-						yyerror("zero-length delimited identifier");
-					if (!check_uescapechar(yytext[yyleng - 2]))
-					{
-						SET_YYLLOC();
-						ADVANCE_YYLLOC(yyleng - 2);
-						yyerror("invalid Unicode escape character");
-					}
-					ident = litbuf_udeescape(yytext[yyleng - 2], yyscanner);
-					identlen = strlen(ident);
-					if (identlen >= NAMEDATALEN)
-						truncate_identifier(ident, identlen, true);
-					yylval->str = ident;
-					return IDENT;
+					yylval->str = litbufdup(yyscanner);
+					return UIDENT;
 				}
 <xd,xui>{xddouble}	{
 					addlitchar('"', yyscanner);
@@ -1288,55 +1224,12 @@ process_integer_literal(const char *token, YYSTYPE *lval)
 	return ICONST;
 }
 
-static unsigned int
-hexval(unsigned char c)
-{
-	if (c >= '0' && c <= '9')
-		return c - '0';
-	if (c >= 'a' && c <= 'f')
-		return c - 'a' + 0xA;
-	if (c >= 'A' && c <= 'F')
-		return c - 'A' + 0xA;
-	elog(ERROR, "invalid hexadecimal digit");
-	return 0;					/* not reached */
-}
-
-static void
-check_unicode_value(pg_wchar c, char *loc, core_yyscan_t yyscanner)
-{
-	if (GetDatabaseEncoding() == PG_UTF8)
-		return;
-
-	if (c > 0x7F)
-	{
-		ADVANCE_YYLLOC(loc - yyextra->literalbuf + 3);	/* 3 for U&" */
-		yyerror("Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8");
-	}
-}
-
-static bool
-is_utf16_surrogate_first(pg_wchar c)
-{
-	return (c >= 0xD800 && c <= 0xDBFF);
-}
-
-static bool
-is_utf16_surrogate_second(pg_wchar c)
-{
-	return (c >= 0xDC00 && c <= 0xDFFF);
-}
-
-static pg_wchar
-surrogate_pair_to_codepoint(pg_wchar first, pg_wchar second)
-{
-	return ((first & 0x3FF) << 10) + 0x10000 + (second & 0x3FF);
-}
-
 static void
 addunicode(pg_wchar c, core_yyscan_t yyscanner)
 {
 	char		buf[8];
 
+	/* See also check_unicode_value() in parser.c */
 	if (c == 0 || c > 0x10FFFF)
 		yyerror("invalid Unicode escape value");
 	if (c > 0x7F)
@@ -1349,172 +1242,6 @@ addunicode(pg_wchar c, core_yyscan_t yyscanner)
 	addlit(buf, pg_mblen(buf), yyscanner);
 }
 
-/* is 'escape' acceptable as Unicode escape character (UESCAPE syntax) ? */
-static bool
-check_uescapechar(unsigned char escape)
-{
-	if (isxdigit(escape)
-		|| escape == '+'
-		|| escape == '\''
-		|| escape == '"'
-		|| scanner_isspace(escape))
-	{
-		return false;
-	}
-	else
-		return true;
-}
-
-/* like litbufdup, but handle unicode escapes */
-static char *
-litbuf_udeescape(unsigned char escape, core_yyscan_t yyscanner)
-{
-	char	   *new;
-	char	   *litbuf,
-			   *in,
-			   *out;
-	pg_wchar	pair_first = 0;
-
-	/* Make literalbuf null-terminated to simplify the scanning loop */
-	litbuf = yyextra->literalbuf;
-	litbuf[yyextra->literallen] = '\0';
-
-	/*
-	 * This relies on the subtle assumption that a UTF-8 expansion cannot be
-	 * longer than its escaped representation.
-	 */
-	new = palloc(yyextra->literallen + 1);
-
-	in = litbuf;
-	out = new;
-	while (*in)
-	{
-		if (in[0] == escape)
-		{
-			if (in[1] == escape)
-			{
-				if (pair_first)
-				{
-					ADVANCE_YYLLOC(in - litbuf + 3);	/* 3 for U&" */
-					yyerror("invalid Unicode surrogate pair");
-				}
-				*out++ = escape;
-				in += 2;
-			}
-			else if (isxdigit((unsigned char) in[1]) &&
-					 isxdigit((unsigned char) in[2]) &&
-					 isxdigit((unsigned char) in[3]) &&
-					 isxdigit((unsigned char) in[4]))
-			{
-				pg_wchar	unicode;
-
-				unicode = (hexval(in[1]) << 12) +
-					(hexval(in[2]) << 8) +
-					(hexval(in[3]) << 4) +
-					hexval(in[4]);
-				check_unicode_value(unicode, in, yyscanner);
-				if (pair_first)
-				{
-					if (is_utf16_surrogate_second(unicode))
-					{
-						unicode = surrogate_pair_to_codepoint(pair_first, unicode);
-						pair_first = 0;
-					}
-					else
-					{
-						ADVANCE_YYLLOC(in - litbuf + 3);		/* 3 for U&" */
-						yyerror("invalid Unicode surrogate pair");
-					}
-				}
-				else if (is_utf16_surrogate_second(unicode))
-					yyerror("invalid Unicode surrogate pair");
-
-				if (is_utf16_surrogate_first(unicode))
-					pair_first = unicode;
-				else
-				{
-					unicode_to_utf8(unicode, (unsigned char *) out);
-					out += pg_mblen(out);
-				}
-				in += 5;
-			}
-			else if (in[1] == '+' &&
-					 isxdigit((unsigned char) in[2]) &&
-					 isxdigit((unsigned char) in[3]) &&
-					 isxdigit((unsigned char) in[4]) &&
-					 isxdigit((unsigned char) in[5]) &&
-					 isxdigit((unsigned char) in[6]) &&
-					 isxdigit((unsigned char) in[7]))
-			{
-				pg_wchar	unicode;
-
-				unicode = (hexval(in[2]) << 20) +
-					(hexval(in[3]) << 16) +
-					(hexval(in[4]) << 12) +
-					(hexval(in[5]) << 8) +
-					(hexval(in[6]) << 4) +
-					hexval(in[7]);
-				check_unicode_value(unicode, in, yyscanner);
-				if (pair_first)
-				{
-					if (is_utf16_surrogate_second(unicode))
-					{
-						unicode = surrogate_pair_to_codepoint(pair_first, unicode);
-						pair_first = 0;
-					}
-					else
-					{
-						ADVANCE_YYLLOC(in - litbuf + 3);		/* 3 for U&" */
-						yyerror("invalid Unicode surrogate pair");
-					}
-				}
-				else if (is_utf16_surrogate_second(unicode))
-					yyerror("invalid Unicode surrogate pair");
-
-				if (is_utf16_surrogate_first(unicode))
-					pair_first = unicode;
-				else
-				{
-					unicode_to_utf8(unicode, (unsigned char *) out);
-					out += pg_mblen(out);
-				}
-				in += 8;
-			}
-			else
-			{
-				ADVANCE_YYLLOC(in - litbuf + 3);		/* 3 for U&" */
-				yyerror("invalid Unicode escape value");
-			}
-		}
-		else
-		{
-			if (pair_first)
-			{
-				ADVANCE_YYLLOC(in - litbuf + 3);		/* 3 for U&" */
-				yyerror("invalid Unicode surrogate pair");
-			}
-			*out++ = *in++;
-		}
-	}
-
-	/* unfinished surrogate pair? */
-	if (pair_first)
-	{
-		ADVANCE_YYLLOC(in - litbuf + 3);				/* 3 for U&" */
-		yyerror("invalid Unicode surrogate pair");
-	}
-
-	*out = '\0';
-
-	/*
-	 * We could skip pg_verifymbstr if we didn't process any non-7-bit-ASCII
-	 * codes; but it's probably not worth the trouble, since this isn't likely
-	 * to be a performance-critical path.
-	 */
-	pg_verifymbstr(new, out - new, false);
-	return new;
-}
-
 static unsigned char
 unescape_single_char(unsigned char c, core_yyscan_t yyscanner)
 {
diff --git a/src/fe_utils/psqlscan.l b/src/fe_utils/psqlscan.l
index 02cb356f34..7076503951 100644
--- a/src/fe_utils/psqlscan.l
+++ b/src/fe_utils/psqlscan.l
@@ -114,12 +114,11 @@ extern void psql_yyset_column(int column_no, yyscan_t yyscanner);
  *  <xd> delimited identifiers (double-quoted identifiers)
  *  <xh> hexadecimal numeric string
  *  <xq> standard quoted strings
+ *  <xqs> quote stop (detect continued strings)
  *  <xe> extended quoted strings (support backslash escape sequences)
  *  <xdolq> $foo$ quoted strings
  *  <xui> quoted identifier with Unicode escapes
- *  <xuiend> end of a quoted identifier with Unicode escapes, UESCAPE can follow
  *  <xus> quoted string with Unicode escapes
- *  <xusend> end of a quoted string with Unicode escapes, UESCAPE can follow
  *
  * Note: we intentionally don't mimic the backend's <xeu> state; we have
  * no need to distinguish it from <xe> state, and no good way to get out
@@ -132,12 +131,11 @@ extern void psql_yyset_column(int column_no, yyscan_t yyscanner);
 %x xd
 %x xh
 %x xq
+%x xqs
 %x xe
 %x xdolq
 %x xui
-%x xuiend
 %x xus
-%x xusend
 
 /*
  * In order to make the world safe for Windows and Mac clients as well as
@@ -177,19 +175,18 @@ special_whitespace		({space}+|{comment}{newline})
 horiz_whitespace		({horiz_space}|{comment})
 whitespace_with_newline	({horiz_whitespace}*{newline}{special_whitespace}*)
 
+quote			'
+/* If we see {quote} then {quotecontinue}, the quoted string continues */
+quotecontinue	{whitespace_with_newline}{quote}
+
 /*
- * To ensure that {quotecontinue} can be scanned without having to back up
- * if the full pattern isn't matched, we include trailing whitespace in
- * {quotestop}.  This matches all cases where {quotecontinue} fails to match,
- * except for {quote} followed by whitespace and just one "-" (not two,
- * which would start a {comment}).  To cover that we have {quotefail}.
- * The actions for {quotestop} and {quotefail} must throw back characters
- * beyond the quote proper.
+ * {quotecontinuefail} is needed to avoid lexer backup when we fail to match
+ * {quotecontinue}.  It might seem that this could just be {whitespace}*,
+ * but if there's a dash after {whitespace_with_newline}, it must be consumed
+ * to see if there's another dash --- which would start a {comment} and thus
+ * allow continuation of the {quotecontinue} token.
  */
-quote			'
-quotestop		{quote}{whitespace}*
-quotecontinue	{quote}{whitespace_with_newline}{quote}
-quotefail		{quote}{whitespace}*"-"
+quotecontinuefail	{whitespace}*"-"?
 
 /* Bit string
  * It is tempting to scan the string for only those characters
@@ -250,21 +247,12 @@ xdstop			{dquote}
 xddouble		{dquote}{dquote}
 xdinside		[^"]+
 
-/* Unicode escapes */
-uescape			[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']{quote}
-/* error rule to avoid backup */
-uescapefail		[uU][eE][sS][cC][aA][pP][eE]{whitespace}*"-"|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*|[uU][eE][sS][cC][aA][pP]|[uU][eE][sS][cC][aA]|[uU][eE][sS][cC]|[uU][eE][sS]|[uU][eE]|[uU]
-
 /* Quoted identifier with Unicode escapes */
 xuistart		[uU]&{dquote}
 
 /* Quoted string with Unicode escapes */
 xusstart		[uU]&{quote}
 
-/* Optional UESCAPE after a quoted string or identifier with Unicode escapes. */
-xustop1		{uescapefail}?
-xustop2		{uescape}
-
 /* error rule to avoid backup */
 xufailed		[uU]&
 
@@ -438,20 +426,10 @@ other			.
 					BEGIN(xb);
 					ECHO;
 				}
-<xb>{quotestop}	|
-<xb>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
-					ECHO;
-				}
 <xh>{xhinside}	|
 <xb>{xbinside}	{
 					ECHO;
 				}
-<xh>{quotecontinue}	|
-<xb>{quotecontinue}	{
-					ECHO;
-				}
 
 {xhstart}		{
 					/* Hexadecimal bit type.
@@ -463,12 +441,6 @@ other			.
 					BEGIN(xh);
 					ECHO;
 				}
-<xh>{quotestop}	|
-<xh>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
-					ECHO;
-				}
 
 {xnstart}		{
 					yyless(1);	/* eat only 'n' this time */
@@ -490,32 +462,38 @@ other			.
 					BEGIN(xus);
 					ECHO;
 				}
-<xq,xe>{quotestop}	|
-<xq,xe>{quotefail} {
-					yyless(1);
-					BEGIN(INITIAL);
-					ECHO;
-				}
-<xus>{quotestop} |
-<xus>{quotefail} {
-					/* throw back all but the quote */
-					yyless(1);
-					BEGIN(xusend);
+
+<xb,xh,xq,xe,xus>{quote} {
+					/*
+					 * When we are scanning a quoted string and see an end
+					 * quote, we must look ahead for a possible continuation.
+					 * If we don't see one, we know the end quote was in fact
+					 * the end of the string.  To reduce the lexer table size,
+					 * we use a single "xqs" state to do the lookahead for all
+					 * types of strings.
+					 */
+					cur_state->state_before_str_stop = YYSTATE;
+					BEGIN(xqs);
 					ECHO;
 				}
-<xusend>{whitespace} {
+<xqs>{quotecontinue} {
+					/*
+					 * Found a quote continuation, so return to the in-quote
+					 * state and continue scanning the literal.
+					 */
+					BEGIN(cur_state->state_before_str_stop);
 					ECHO;
 				}
-<xusend>{other} |
-<xusend>{xustop1} {
+<xqs>{quotecontinuefail} |
+<xqs>{other}	{
+					/*
+					 * Failed to see a quote continuation.  Throw back
+					 * everything after the end quote.
+					 */
 					yyless(0);
 					BEGIN(INITIAL);
-					ECHO;
-				}
-<xusend>{xustop2} {
-					BEGIN(INITIAL);
-					ECHO;
 				}
+
 <xq,xe,xus>{xqdouble} {
 					ECHO;
 				}
@@ -540,9 +518,6 @@ other			.
 <xe>{xehexesc}  {
 					ECHO;
 				}
-<xq,xe,xus>{quotecontinue} {
-					ECHO;
-				}
 <xe>.			{
 					/* This is only needed for \ just before EOF */
 					ECHO;
@@ -599,21 +574,7 @@ other			.
 					BEGIN(INITIAL);
 					ECHO;
 				}
-<xui>{dquote} {
-					yyless(1);
-					BEGIN(xuiend);
-					ECHO;
-				}
-<xuiend>{whitespace} {
-					ECHO;
-				}
-<xuiend>{other} |
-<xuiend>{xustop1} {
-					yyless(0);
-					BEGIN(INITIAL);
-					ECHO;
-				}
-<xuiend>{xustop2}	{
+<xui>{dquote}	{
 					BEGIN(INITIAL);
 					ECHO;
 				}
@@ -1084,8 +1045,7 @@ psql_scan(PsqlScanState state,
 			switch (state->start_state)
 			{
 				case INITIAL:
-				case xuiend:	/* we treat these like INITIAL */
-				case xusend:
+				case xqs:		/* we treat this like INITIAL */
 					if (state->paren_depth > 0)
 					{
 						result = PSCAN_INCOMPLETE;
@@ -1240,7 +1200,8 @@ psql_scan_reselect_sql_lexer(PsqlScanState state)
 bool
 psql_scan_in_quote(PsqlScanState state)
 {
-	return state->start_state != INITIAL;
+	return state->start_state != INITIAL &&
+			state->start_state != xqs;
 }
 
 /*
diff --git a/src/include/fe_utils/psqlscan_int.h b/src/include/fe_utils/psqlscan_int.h
index 98481e6553..311f80394a 100644
--- a/src/include/fe_utils/psqlscan_int.h
+++ b/src/include/fe_utils/psqlscan_int.h
@@ -110,6 +110,7 @@ typedef struct PsqlScanStateData
 	 * and updated with its finishing state on exit.
 	 */
 	int			start_state;	/* yylex's starting/finishing state */
+	int			state_before_str_stop;	/* start cond. before end quote */
 	int			paren_depth;	/* depth of nesting in parentheses */
 	int			xcdepth;		/* depth of nesting in slash-star comments */
 	char	   *dolqstart;		/* current $foo$ quote start string */
diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h
index 07ebc6365b..9a0cfe9a08 100644
--- a/src/include/mb/pg_wchar.h
+++ b/src/include/mb/pg_wchar.h
@@ -508,6 +508,27 @@ typedef uint32 (*utf_local_conversion_func) (uint32 code);
 								   (destencoding))
 
 
+/*
+ * Some handy functions for Unicode-specific tests.
+ */
+static inline bool
+is_utf16_surrogate_first(pg_wchar c)
+{
+	return (c >= 0xD800 && c <= 0xDBFF);
+}
+
+static inline bool
+is_utf16_surrogate_second(pg_wchar c)
+{
+	return (c >= 0xDC00 && c <= 0xDFFF);
+}
+
+static inline pg_wchar
+surrogate_pair_to_codepoint(pg_wchar first, pg_wchar second)
+{
+	return ((first & 0x3FF) << 10) + 0x10000 + (second & 0x3FF);
+}
+
 /*
  * These functions are considered part of libpq's exported API and
  * are also declared in libpq-fe.h.
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 0fe4e6cb20..9097f6748b 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -416,6 +416,7 @@ PG_KEYWORD("truncate", TRUNCATE, UNRESERVED_KEYWORD)
 PG_KEYWORD("trusted", TRUSTED, UNRESERVED_KEYWORD)
 PG_KEYWORD("type", TYPE_P, UNRESERVED_KEYWORD)
 PG_KEYWORD("types", TYPES_P, UNRESERVED_KEYWORD)
+PG_KEYWORD("uescape", UESCAPE, UNRESERVED_KEYWORD)
 PG_KEYWORD("unbounded", UNBOUNDED, UNRESERVED_KEYWORD)
 PG_KEYWORD("uncommitted", UNCOMMITTED, UNRESERVED_KEYWORD)
 PG_KEYWORD("unencrypted", UNENCRYPTED, UNRESERVED_KEYWORD)
diff --git a/src/include/parser/scanner.h b/src/include/parser/scanner.h
index da729fc42b..7a0e5e5d98 100644
--- a/src/include/parser/scanner.h
+++ b/src/include/parser/scanner.h
@@ -48,7 +48,7 @@ typedef union core_YYSTYPE
  * However, those are not defined in this file, because bison insists on
  * defining them for itself.  The token codes used by the core scanner are
  * the ASCII characters plus these:
- *	%token <str>	IDENT FCONST SCONST BCONST XCONST Op
+ *	%token <str>	IDENT UIDENT FCONST SCONST USCONST BCONST XCONST Op
  *	%token <ival>	ICONST PARAM
  *	%token			TYPECAST DOT_DOT COLON_EQUALS EQUALS_GREATER
  *	%token			LESS_EQUALS GREATER_EQUALS NOT_EQUALS
@@ -99,6 +99,7 @@ typedef struct core_yy_extra_type
 	int			literallen;		/* actual current string length */
 	int			literalalloc;	/* current allocated buffer size */
 
+	int			state_before_str_stop;	/* start cond. before end quote */
 	int			xcdepth;		/* depth of nesting in slash-star comments */
 	char	   *dolqstart;		/* current $foo$ quote start string */
 
diff --git a/src/interfaces/ecpg/preproc/ecpg.tokens b/src/interfaces/ecpg/preproc/ecpg.tokens
index 1d613af02f..8e0527fdb7 100644
--- a/src/interfaces/ecpg/preproc/ecpg.tokens
+++ b/src/interfaces/ecpg/preproc/ecpg.tokens
@@ -24,4 +24,3 @@
                 S_TYPEDEF
 
 %token CSTRING CVARIABLE CPP_LINE IP
-%token DOLCONST ECONST NCONST UCONST UIDENT
diff --git a/src/interfaces/ecpg/preproc/ecpg.trailer b/src/interfaces/ecpg/preproc/ecpg.trailer
index f58b41e675..784d1d199e 100644
--- a/src/interfaces/ecpg/preproc/ecpg.trailer
+++ b/src/interfaces/ecpg/preproc/ecpg.trailer
@@ -1719,46 +1719,13 @@ ecpg_bconst:	BCONST		{ $$ = make_name(); } ;
 
 ecpg_fconst:	FCONST		{ $$ = make_name(); } ;
 
-ecpg_sconst:
-		SCONST
-		{
-			/* could have been input as '' or $$ */
-			$$ = (char *)mm_alloc(strlen($1) + 3);
-			$$[0]='\'';
-			strcpy($$+1, $1);
-			$$[strlen($1)+1]='\'';
-			$$[strlen($1)+2]='\0';
-			free($1);
-		}
-		| ECONST
-		{
-			$$ = (char *)mm_alloc(strlen($1) + 4);
-			$$[0]='E';
-			$$[1]='\'';
-			strcpy($$+2, $1);
-			$$[strlen($1)+2]='\'';
-			$$[strlen($1)+3]='\0';
-			free($1);
-		}
-		| NCONST
-		{
-			$$ = (char *)mm_alloc(strlen($1) + 4);
-			$$[0]='N';
-			$$[1]='\'';
-			strcpy($$+2, $1);
-			$$[strlen($1)+2]='\'';
-			$$[strlen($1)+3]='\0';
-			free($1);
-		}
-		| UCONST	{ $$ = $1; }
-		| DOLCONST	{ $$ = $1; }
+ecpg_sconst:	SCONST		{ $$ = $1; }
 		;
 
 ecpg_xconst:	XCONST		{ $$ = make_name(); } ;
 
-ecpg_ident:	IDENT		{ $$ = make_name(); }
+ecpg_ident:	IDENT		{ $$ = $1; }
 		| CSTRING	{ $$ = make3_str(mm_strdup("\""), $1, mm_strdup("\"")); }
-		| UIDENT	{ $$ = $1; }
 		;
 
 quoted_ident_stringvar: name
diff --git a/src/interfaces/ecpg/preproc/ecpg.type b/src/interfaces/ecpg/preproc/ecpg.type
index 9497b91b9d..ffafa82af9 100644
--- a/src/interfaces/ecpg/preproc/ecpg.type
+++ b/src/interfaces/ecpg/preproc/ecpg.type
@@ -122,12 +122,8 @@
 %type <str> CSTRING
 %type <str> CPP_LINE
 %type <str> CVARIABLE
-%type <str> DOLCONST
-%type <str> ECONST
-%type <str> NCONST
 %type <str> SCONST
-%type <str> UCONST
-%type <str> UIDENT
+%type <str> IDENT
 
 %type  <struct_union> s_struct_union_symbol
 
diff --git a/src/interfaces/ecpg/preproc/parse.pl b/src/interfaces/ecpg/preproc/parse.pl
index 7d6c70dcf4..1a76b2d326 100644
--- a/src/interfaces/ecpg/preproc/parse.pl
+++ b/src/interfaces/ecpg/preproc/parse.pl
@@ -218,8 +218,8 @@ sub main
 				if ($a eq 'IDENT' && $prior eq '%nonassoc')
 				{
 
-					# add two more tokens to the list
-					$str = $str . "\n%nonassoc CSTRING\n%nonassoc UIDENT";
+					# add more tokens to the list
+					$str = $str . "\n%nonassoc CSTRING";
 				}
 				$prior = $a;
 			}
diff --git a/src/interfaces/ecpg/preproc/parser.c b/src/interfaces/ecpg/preproc/parser.c
index c27de59828..4e071c788f 100644
--- a/src/interfaces/ecpg/preproc/parser.c
+++ b/src/interfaces/ecpg/preproc/parser.c
@@ -6,6 +6,9 @@
  * This should match src/backend/parser/parser.c, except that we do not
  * need to bother with re-entrant interfaces.
  *
+ * Note: ECPG doesn't report error location like the backend does.
+ * This file will need work if we ever want it to.
+ *
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -27,8 +30,9 @@ static int	lookahead_token;	/* one-token lookahead */
 static YYSTYPE lookahead_yylval;	/* yylval for lookahead token */
 static YYLTYPE lookahead_yylloc;	/* yylloc for lookahead token */
 static char *lookahead_yytext;	/* start current token */
-static char *lookahead_end;		/* end of current token */
-static char lookahead_hold_char;	/* to be put back at *lookahead_end */
+
+static bool check_uescapechar(unsigned char escape);
+static bool ecpg_isspace(char ch);
 
 
 /*
@@ -43,13 +47,16 @@ static char lookahead_hold_char;	/* to be put back at *lookahead_end */
  * words.  Furthermore it's not clear how to do that without re-introducing
  * scanner backtrack, which would cost more performance than this filter
  * layer does.
+ *
+ * We also use this filter to convert UIDENT and USCONST sequences into
+ * plain IDENT and SCONST tokens.  While that could be handled by additional
+ * productions in the main grammar, it's more efficient to do it like this.
  */
 int
 filtered_base_yylex(void)
 {
 	int			cur_token;
 	int			next_token;
-	int			cur_token_length;
 	YYSTYPE		cur_yylval;
 	YYLTYPE		cur_yylloc;
 	char	   *cur_yytext;
@@ -61,41 +68,26 @@ filtered_base_yylex(void)
 		base_yylval = lookahead_yylval;
 		base_yylloc = lookahead_yylloc;
 		base_yytext = lookahead_yytext;
-		*lookahead_end = lookahead_hold_char;
 		have_lookahead = false;
 	}
 	else
 		cur_token = base_yylex();
 
 	/*
-	 * If this token isn't one that requires lookahead, just return it.  If it
-	 * does, determine the token length.  (We could get that via strlen(), but
-	 * since we have such a small set of possibilities, hardwiring seems
-	 * feasible and more efficient.)
+	 * If this token isn't one that requires lookahead, just return it.
 	 */
 	switch (cur_token)
 	{
 		case NOT:
-			cur_token_length = 3;
-			break;
 		case NULLS_P:
-			cur_token_length = 5;
-			break;
 		case WITH:
-			cur_token_length = 4;
+		case UIDENT:
+		case USCONST:
 			break;
 		default:
 			return cur_token;
 	}
 
-	/*
-	 * Identify end+1 of current token.  base_yylex() has temporarily stored a
-	 * '\0' here, and will undo that when we call it again.  We need to redo
-	 * it to fully revert the lookahead call for error reporting purposes.
-	 */
-	lookahead_end = base_yytext + cur_token_length;
-	Assert(*lookahead_end == '\0');
-
 	/* Save and restore lexer output variables around the call */
 	cur_yylval = base_yylval;
 	cur_yylloc = base_yylloc;
@@ -113,10 +105,6 @@ filtered_base_yylex(void)
 	base_yylloc = cur_yylloc;
 	base_yytext = cur_yytext;
 
-	/* Now revert the un-truncation of the current token */
-	lookahead_hold_char = *lookahead_end;
-	*lookahead_end = '\0';
-
 	have_lookahead = true;
 
 	/* Replace cur_token if needed, based on lookahead */
@@ -157,7 +145,83 @@ filtered_base_yylex(void)
 					break;
 			}
 			break;
+		case UIDENT:
+		case USCONST:
+			/* Look ahead for UESCAPE */
+			if (next_token == UESCAPE)
+			{
+				/* Yup, so get third token, which had better be SCONST */
+				const char *escstr;
+
+				/* Again save and restore lexer output variables around the call */
+				cur_yylval = base_yylval;
+				cur_yylloc = base_yylloc;
+				cur_yytext = base_yytext;
+
+				/* Get third token */
+				next_token = base_yylex();
+
+				if (next_token != SCONST)
+					mmerror(PARSE_ERROR, ET_ERROR, "UESCAPE must be followed by a simple string literal");
+
+				/* Save and check escape string, which the scanner returns with quotes */
+				escstr = base_yylval.str;
+				if (strlen(escstr) != 3 || !check_uescapechar(escstr[1]))
+					mmerror(PARSE_ERROR, ET_ERROR, "invalid Unicode escape character");
+
+				base_yylval = cur_yylval;
+				base_yylloc = cur_yylloc;
+				base_yytext = cur_yytext;
+
+				/* Combine 3 tokens into 1 */
+				base_yylval.str = psprintf("%s UESCAPE %s", base_yylval.str, escstr);
+
+				/*
+				 * Clear have_lookahead, thereby consuming all three tokens.
+				 */
+				have_lookahead = false;
+			}
+
+			if (cur_token == UIDENT)
+				cur_token = IDENT;
+			else if (cur_token == USCONST)
+				cur_token = SCONST;
+			break;
 	}
 
 	return cur_token;
 }
+
+/*
+ * check_uescapechar() and ecpg_isspace() should match their equivalents
+ * in pgc.l.
+ */
+
+/* is 'escape' acceptable as Unicode escape character (UESCAPE syntax) ? */
+static bool
+check_uescapechar(unsigned char escape)
+{
+	if (isxdigit(escape)
+		|| escape == '+'
+		|| escape == '\''
+		|| escape == '"'
+		|| ecpg_isspace(escape))
+		return false;
+	else
+		return true;
+}
+
+/*
+ * ecpg_isspace() --- return true if flex scanner considers char whitespace
+ */
+static bool
+ecpg_isspace(char ch)
+{
+	if (ch == ' ' ||
+		ch == '\t' ||
+		ch == '\n' ||
+		ch == '\r' ||
+		ch == '\f')
+		return true;
+	return false;
+}
diff --git a/src/interfaces/ecpg/preproc/pgc.l b/src/interfaces/ecpg/preproc/pgc.l
index 0385fde719..daca8d9dd9 100644
--- a/src/interfaces/ecpg/preproc/pgc.l
+++ b/src/interfaces/ecpg/preproc/pgc.l
@@ -6,6 +6,9 @@
  *
  * This is a modified version of src/backend/parser/scan.l
  *
+ * The ecpg scanner is not backup-free, so the fail rules are
+ * only here to simplify syncing this file with scan.l.
+ *
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -61,7 +64,10 @@ static bool isdefine(void);
 static bool isinformixdefine(void);
 
 char *token_start;
-static int state_before;
+
+/* vars to keep track of start conditions when scanning literals */
+static int state_before_str_start;
+static int state_before_str_stop;
 
 struct _yy_buffer
 {
@@ -112,6 +118,7 @@ static struct _if_value
  *  <xh> hexadecimal numeric string
  *  <xn> national character quoted strings
  *  <xq> standard quoted strings
+ *  <xqs> quote stop (detect continued strings)
  *  <xe> extended quoted strings (support backslash escape sequences)
  *  <xqc> single-quoted strings in C
  *  <xdolq> $foo$ quoted strings
@@ -120,6 +127,9 @@ static struct _if_value
  *  <xcond> condition of an EXEC SQL IFDEF construct
  *  <xskip> skipping the inactive part of an EXEC SQL IFDEF construct
  *
+ * Note: we intentionally don't mimic the backend's <xeu> state; we have
+ * no need to distinguish it from <xe> state.
+ *
  * Remember to add an <<EOF>> case whenever you add a new exclusive state!
  * The default one is probably not the right thing.
  */
@@ -132,6 +142,7 @@ static struct _if_value
 %x xh
 %x xn
 %x xq
+%x xqs
 %x xe
 %x xqc
 %x xdolq
@@ -181,9 +192,17 @@ horiz_whitespace		({horiz_space}|{comment})
 whitespace_with_newline	({horiz_whitespace}*{newline}{whitespace}*)
 
 quote			'
-quotestop		{quote}{whitespace}*
-quotecontinue	{quote}{whitespace_with_newline}{quote}
-quotefail		{quote}{whitespace}*"-"
+/* If we see {quote} then {quotecontinue}, the quoted string continues */
+quotecontinue	{whitespace_with_newline}{quote}
+
+/*
+ * {quotecontinuefail} is needed to avoid lexer backup when we fail to match
+ * {quotecontinue}.  It might seem that this could just be {whitespace}*,
+ * but if there's a dash after {whitespace_with_newline}, it must be consumed
+ * to see if there's another dash --- which would start a {comment} and thus
+ * allow continuation of the {quotecontinue} token.
+ */
+quotecontinuefail	{whitespace}*"-"?
 
 /* Bit string
  */
@@ -237,19 +256,11 @@ xdstop			{dquote}
 xddouble		{dquote}{dquote}
 xdinside		[^"]+
 
-/* Unicode escapes */
-/* (The ecpg scanner is not backup-free, so the fail rules in scan.l are
- * not needed here, but could be added if desired.)
- */
-uescape			[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']{quote}
-
 /* Quoted identifier with Unicode escapes */
 xuistart		[uU]&{dquote}
-xuistop			{dquote}({whitespace}*{uescape})?
 
 /* Quoted string with Unicode escapes */
 xusstart		[uU]&{quote}
-xusstop			{quote}({whitespace}*{uescape})?
 
 /* special stuff for C strings */
 xdcqq			\\\\
@@ -411,7 +422,7 @@ cppline			{space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})((\/\*[^*/]*\*+
 
 {xcstart}		{
 					token_start = yytext;
-					state_before = YYSTATE;
+					state_before_str_start = YYSTATE;
 					xcdepth = 0;
 					BEGIN(xcsql);
 					/* Put back any characters past slash-star; see above */
@@ -422,7 +433,7 @@ cppline			{space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})((\/\*[^*/]*\*+
 
 <C>{xcstart}	{
 					token_start = yytext;
-					state_before = YYSTATE;
+					state_before_str_start = YYSTATE;
 					xcdepth = 0;
 					BEGIN(xcc);
 					/* Put back any characters past slash-star; see above */
@@ -440,7 +451,7 @@ cppline			{space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})((\/\*[^*/]*\*+
 					if (xcdepth <= 0)
 					{
 						ECHO;
-						BEGIN(state_before);
+						BEGIN(state_before_str_start);
 						token_start = NULL;
 					}
 					else
@@ -451,7 +462,7 @@ cppline			{space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})((\/\*[^*/]*\*+
 				}
 <xcc>{xcstop}	{
 					ECHO;
-					BEGIN(state_before);
+					BEGIN(state_before_str_start);
 					token_start = NULL;
 				}
 
@@ -482,23 +493,10 @@ cppline			{space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})((\/\*[^*/]*\*+
 				}
 } /* <SQL> */
 
-<xb>{quotestop}	|
-<xb>{quotefail} {
-					yyless(1);
-					BEGIN(SQL);
-					if (literalbuf[strspn(literalbuf, "01") + 1] != '\0')
-						mmerror(PARSE_ERROR, ET_ERROR, "invalid bit string literal");
-					base_yylval.str = mm_strdup(literalbuf);
-					return BCONST;
-				}
 <xh>{xhinside}	|
 <xb>{xbinside}	{
 					addlit(yytext, yyleng);
 				}
-<xh>{quotecontinue}	|
-<xb>{quotecontinue}	{
-					/* ignore */
-				}
 <xb><<EOF>>		{ mmfatal(PARSE_ERROR, "unterminated bit string literal"); }
 
 <SQL>{xhstart}	{
@@ -507,19 +505,11 @@ cppline			{space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})((\/\*[^*/]*\*+
 					startlit();
 					addlitchar('x');
 				}
-<xh>{quotestop}	|
-<xh>{quotefail} {
-					yyless(1);
-					BEGIN(SQL);
-					base_yylval.str = mm_strdup(literalbuf);
-					return XCONST;
-				}
-
 <xh><<EOF>>		{ mmfatal(PARSE_ERROR, "unterminated hexadecimal string literal"); }
 
 <C>{xqstart}	{
 					token_start = yytext;
-					state_before = YYSTATE;
+					state_before_str_start = YYSTATE;
 					BEGIN(xqc);
 					startlit();
 				}
@@ -530,59 +520,90 @@ cppline			{space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})((\/\*[^*/]*\*+
 					 * Transfer it as-is to the backend.
 					 */
 					token_start = yytext;
-					state_before = YYSTATE;
+					state_before_str_start = YYSTATE;
 					BEGIN(xn);
 					startlit();
 				}
 
 {xqstart}		{
 					token_start = yytext;
-					state_before = YYSTATE;
+					state_before_str_start = YYSTATE;
 					BEGIN(xq);
 					startlit();
 				}
 {xestart}		{
 					token_start = yytext;
-					state_before = YYSTATE;
+					state_before_str_start = YYSTATE;
 					BEGIN(xe);
 					startlit();
 				}
 {xusstart}		{
 					token_start = yytext;
-					state_before = YYSTATE;
+					state_before_str_start = YYSTATE;
 					BEGIN(xus);
 					startlit();
-					addlit(yytext, yyleng);
 				}
 } /* <SQL> */
 
-<xq,xqc>{quotestop} |
-<xq,xqc>{quotefail} {
-					yyless(1);
-					BEGIN(state_before);
-					base_yylval.str = mm_strdup(literalbuf);
-					return SCONST;
-				}
-<xe>{quotestop} |
-<xe>{quotefail} {
-					yyless(1);
-					BEGIN(state_before);
-					base_yylval.str = mm_strdup(literalbuf);
-					return ECONST;
+<xb,xh,xq,xqc,xe,xn,xus>{quote} {
+					/*
+					 * When we are scanning a quoted string and see an end
+					 * quote, we must look ahead for a possible continuation.
+					 * If we don't see one, we know the end quote was in fact
+					 * the end of the string.  To reduce the lexer table size,
+					 * we use a single "xqs" state to do the lookahead for all
+					 * types of strings.
+					 */
+					state_before_str_stop = YYSTATE;
+					BEGIN(xqs);
 				}
-<xn>{quotestop} |
-<xn>{quotefail} {
-					yyless(1);
-					BEGIN(state_before);
-					base_yylval.str = mm_strdup(literalbuf);
-					return NCONST;
+<xqs>{quotecontinue} {
+					/*
+					 * Found a quote continuation, so return to the in-quote
+					 * state and continue scanning the literal.
+					 */
+					BEGIN(state_before_str_stop);
 				}
-<xus>{xusstop} {
-					addlit(yytext, yyleng);
-					BEGIN(state_before);
-					base_yylval.str = mm_strdup(literalbuf);
-					return UCONST;
+<xqs>{quotecontinuefail} |
+<xqs><<EOF>> |
+<xqs>{other}	{
+					/*
+					 * Failed to see a quote continuation.  Throw back
+					 * everything after the end quote, and handle the string
+					 * according to the state we were in previously.
+					 */
+					yyless(0);
+					BEGIN(state_before_str_start);
+
+					switch (state_before_str_stop)
+					{
+						case xb:
+							if (literalbuf[strspn(literalbuf, "01") + 1] != '\0')
+								mmerror(PARSE_ERROR, ET_ERROR, "invalid bit string literal");
+							base_yylval.str = mm_strdup(literalbuf);
+							return BCONST;
+						case xh:
+							base_yylval.str = mm_strdup(literalbuf);
+							return XCONST;
+						case xq:
+							/* fallthrough */
+						case xqc:
+							base_yylval.str = psprintf("'%s'", literalbuf);
+							return SCONST;
+						case xe:
+							base_yylval.str = psprintf("E'%s'", literalbuf);
+							return SCONST;
+						case xn:
+							base_yylval.str = psprintf("N'%s'", literalbuf);
+							return SCONST;
+						case xus:
+							base_yylval.str = psprintf("U&'%s'", literalbuf);
+							return USCONST;
+						default:
+							mmfatal(PARSE_ERROR, "unhandled previous state in xqs\n");
+					}
 				}
+
 <xq,xe,xn,xus>{xqdouble}	{ addlitchar('\''); }
 <xqc>{xqcquote}	{
 					addlitchar('\\');
@@ -604,9 +625,6 @@ cppline			{space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})((\/\*[^*/]*\*+
 <xe>{xehexesc}  {
 					addlit(yytext, yyleng);
 				}
-<xq,xqc,xe,xn,xus>{quotecontinue}	{
-					/* ignore */
-				}
 <xe>.			{
 					/* This is only needed for \ just before EOF */
 					addlitchar(yytext[0]);
@@ -639,7 +657,7 @@ cppline			{space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})((\/\*[^*/]*\*+
 						dolqstart = NULL;
 						BEGIN(SQL);
 						base_yylval.str = mm_strdup(literalbuf);
-						return DOLCONST;
+						return SCONST;
 					}
 					else
 					{
@@ -666,20 +684,19 @@ cppline			{space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})((\/\*[^*/]*\*+
 
 <SQL>{
 {xdstart}		{
-					state_before = YYSTATE;
+					state_before_str_start = YYSTATE;
 					BEGIN(xd);
 					startlit();
 				}
 {xuistart}		{
-					state_before = YYSTATE;
+					state_before_str_start = YYSTATE;
 					BEGIN(xui);
 					startlit();
-					addlit(yytext, yyleng);
 				}
 } /* <SQL> */
 
 <xd>{xdstop}	{
-					BEGIN(state_before);
+					BEGIN(state_before_str_start);
 					if (literallen == 0)
 						mmerror(PARSE_ERROR, ET_ERROR, "zero-length delimited identifier");
 					/* The backend will truncate the identifier here. We do not as it does not change the result. */
@@ -687,17 +704,16 @@ cppline			{space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})((\/\*[^*/]*\*+
 					return CSTRING;
 				}
 <xdc>{xdstop}	{
-					BEGIN(state_before);
+					BEGIN(state_before_str_start);
 					base_yylval.str = mm_strdup(literalbuf);
 					return CSTRING;
 				}
-<xui>{xuistop}	{
-					BEGIN(state_before);
+<xui>{dquote}	{
+					BEGIN(state_before_str_start);
 					if (literallen == 2) /* "U&" */
 						mmerror(PARSE_ERROR, ET_ERROR, "zero-length delimited identifier");
 					/* The backend will truncate the identifier here. We do not as it does not change the result. */
-					addlit(yytext, yyleng);
-					base_yylval.str = mm_strdup(literalbuf);
+					base_yylval.str = psprintf("U&\"%s\"", literalbuf);
 					return UIDENT;
 				}
 <xd,xui>{xddouble}	{
@@ -708,7 +724,7 @@ cppline			{space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})((\/\*[^*/]*\*+
 				}
 <xd,xui><<EOF>>	{ mmfatal(PARSE_ERROR, "unterminated quoted identifier"); }
 <C>{xdstart}	{
-					state_before = YYSTATE;
+					state_before_str_start = YYSTATE;
 					BEGIN(xdc);
 					startlit();
 				}
diff --git a/src/interfaces/ecpg/test/expected/preproc-strings.c b/src/interfaces/ecpg/test/expected/preproc-strings.c
index 2053443e81..e695007b13 100644
--- a/src/interfaces/ecpg/test/expected/preproc-strings.c
+++ b/src/interfaces/ecpg/test/expected/preproc-strings.c
@@ -45,7 +45,7 @@ int main(void)
 #line 13 "strings.pgc"
 
 
-  { ECPGdo(__LINE__, 0, 1, NULL, 0, ECPGst_normal, "select 'abcdef' , N'abcdef' as foo , E'abc\\bdef' as \"foo\" , U&'d\\0061t\\0061' as U&\"foo\" , U&'d!+000061t!+000061' uescape '!' , $foo$abc$def$foo$", ECPGt_EOIT, 
+  { ECPGdo(__LINE__, 0, 1, NULL, 0, ECPGst_normal, "select 'abcdef' , N'abcdef' as foo , E'abc\\bdef' as \"foo\" , U&'d\\0061t\\0061' as U&\"foo\" , U&'d!+000061t!+000061' UESCAPE '!' , $foo$abc$def$foo$", ECPGt_EOIT, 
 	ECPGt_char,&(s1),(long)0,(long)1,(1)*sizeof(char), 
 	ECPGt_NO_INDICATOR, NULL , 0L, 0L, 0L, 
 	ECPGt_char,&(s2),(long)0,(long)1,(1)*sizeof(char), 
diff --git a/src/interfaces/ecpg/test/expected/preproc-strings.stderr b/src/interfaces/ecpg/test/expected/preproc-strings.stderr
index 0478fd84ae..dbc9e5c0b8 100644
--- a/src/interfaces/ecpg/test/expected/preproc-strings.stderr
+++ b/src/interfaces/ecpg/test/expected/preproc-strings.stderr
@@ -8,7 +8,7 @@
 [NO_PID]: sqlca: code: 0, state: 00000
 [NO_PID]: ecpg_process_output on line 13: OK: SET
 [NO_PID]: sqlca: code: 0, state: 00000
-[NO_PID]: ecpg_execute on line 15: query: select 'abcdef' , N'abcdef' as foo , E'abc\bdef' as "foo" , U&'d\0061t\0061' as U&"foo" , U&'d!+000061t!+000061' uescape '!' , $foo$abc$def$foo$; with 0 parameter(s) on connection ecpg1_regression
+[NO_PID]: ecpg_execute on line 15: query: select 'abcdef' , N'abcdef' as foo , E'abc\bdef' as "foo" , U&'d\0061t\0061' as U&"foo" , U&'d!+000061t!+000061' UESCAPE '!' , $foo$abc$def$foo$; with 0 parameter(s) on connection ecpg1_regression
 [NO_PID]: sqlca: code: 0, state: 00000
 [NO_PID]: ecpg_execute on line 15: using PQexec
 [NO_PID]: sqlca: code: 0, state: 00000
diff --git a/src/pl/plpgsql/src/pl_gram.y b/src/pl/plpgsql/src/pl_gram.y
index ef0a5d5d16..6778d0e771 100644
--- a/src/pl/plpgsql/src/pl_gram.y
+++ b/src/pl/plpgsql/src/pl_gram.y
@@ -232,7 +232,7 @@ static	void			check_raise_parameters(PLpgSQL_stmt_raise *stmt);
  * Some of these are not directly referenced in this file, but they must be
  * here anyway.
  */
-%token <str>	IDENT FCONST SCONST BCONST XCONST Op
+%token <str>	IDENT UIDENT FCONST SCONST USCONST BCONST XCONST Op
 %token <ival>	ICONST PARAM
 %token			TYPECAST DOT_DOT COLON_EQUALS EQUALS_GREATER
 %token			LESS_EQUALS GREATER_EQUALS NOT_EQUALS
diff --git a/src/test/regress/expected/strings.out b/src/test/regress/expected/strings.out
index 6d96843e5b..60cb86193c 100644
--- a/src/test/regress/expected/strings.out
+++ b/src/test/regress/expected/strings.out
@@ -48,17 +48,21 @@ SELECT 'tricky' AS U&"\" UESCAPE '!';
 (1 row)
 
 SELECT U&'wrong: \061';
-ERROR:  invalid Unicode escape value at or near "\061'"
+ERROR:  invalid Unicode escape value
 LINE 1: SELECT U&'wrong: \061';
                          ^
 SELECT U&'wrong: \+0061';
-ERROR:  invalid Unicode escape value at or near "\+0061'"
+ERROR:  invalid Unicode escape value
 LINE 1: SELECT U&'wrong: \+0061';
                          ^
+SELECT U&'wrong: +0061' UESCAPE +;
+ERROR:  UESCAPE must be followed by a simple string literal at or near "+"
+LINE 1: SELECT U&'wrong: +0061' UESCAPE +;
+                                        ^
 SELECT U&'wrong: +0061' UESCAPE '+';
-ERROR:  invalid Unicode escape character at or near "+'"
+ERROR:  invalid Unicode escape character at or near "'+'"
 LINE 1: SELECT U&'wrong: +0061' UESCAPE '+';
-                                         ^
+                                        ^
 SET standard_conforming_strings TO off;
 SELECT U&'d\0061t\+000061' AS U&"d\0061t\+000061";
 ERROR:  unsafe use of string constant with Unicode escapes
diff --git a/src/test/regress/sql/strings.sql b/src/test/regress/sql/strings.sql
index 0afb94964b..c5cd15142a 100644
--- a/src/test/regress/sql/strings.sql
+++ b/src/test/regress/sql/strings.sql
@@ -27,6 +27,7 @@ SELECT 'tricky' AS U&"\" UESCAPE '!';
 
 SELECT U&'wrong: \061';
 SELECT U&'wrong: \+0061';
+SELECT U&'wrong: +0061' UESCAPE +;
 SELECT U&'wrong: +0061' UESCAPE '+';
 
 SET standard_conforming_strings TO off;
-- 
2.22.0

v11-0002-Merge-ECPG-scanner-states-regarding-C-comments.patchapplication/octet-stream; name=v11-0002-Merge-ECPG-scanner-states-regarding-C-comments.patchDownload

From 92c1a5195149da907ec5aab9b9e1f0518e440bc6 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@2ndquadrant.com>
Date: Mon, 13 Jan 2020 17:57:25 +0800
Subject: [PATCH v11 2/2] Merge ECPG scanner states regarding C comments.

Previously, there were different start conditions for C-style comments
used in SQL and in C, since those have different rules regarding nested
comments. Since we already have the ability to keep track of the previous
start condition, use this to handle the different cases within a single
start condition. This matches the core scanner more closely.
---
 src/interfaces/ecpg/preproc/pgc.l | 74 ++++++++++++++++---------------
 1 file changed, 38 insertions(+), 36 deletions(-)

diff --git a/src/interfaces/ecpg/preproc/pgc.l b/src/interfaces/ecpg/preproc/pgc.l
index daca8d9dd9..208ccd6c94 100644
--- a/src/interfaces/ecpg/preproc/pgc.l
+++ b/src/interfaces/ecpg/preproc/pgc.l
@@ -111,8 +111,7 @@ static struct _if_value
  * and to eliminate parsing troubles for numeric strings.
  * Exclusive states:
  *  <xb> bit string literal
- *  <xcc> extended C-style comments in C
- *  <xcsql> extended C-style comments in SQL
+ *  <xc> extended C-style comments
  *  <xd> delimited identifiers (double-quoted identifiers)
  *  <xdc> double-quoted strings in C
  *  <xh> hexadecimal numeric string
@@ -135,8 +134,7 @@ static struct _if_value
  */
 
 %x xb
-%x xcc
-%x xcsql
+%x xc
 %x xd
 %x xdc
 %x xh
@@ -419,54 +417,58 @@ cppline			{space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})((\/\*[^*/]*\*+
 {whitespace}	{
 					/* ignore */
 				}
+} /* <SQL> */
 
+<C,SQL>{
 {xcstart}		{
 					token_start = yytext;
 					state_before_str_start = YYSTATE;
 					xcdepth = 0;
-					BEGIN(xcsql);
+					BEGIN(xc);
 					/* Put back any characters past slash-star; see above */
 					yyless(2);
 					fputs("/*", yyout);
 				}
-} /* <SQL> */
+} /* <C,SQL> */
 
-<C>{xcstart}	{
-					token_start = yytext;
-					state_before_str_start = YYSTATE;
-					xcdepth = 0;
-					BEGIN(xcc);
-					/* Put back any characters past slash-star; see above */
-					yyless(2);
-					fputs("/*", yyout);
-				}
-<xcc>{xcstart}	{ ECHO; }
-<xcsql>{xcstart}	{
-					xcdepth++;
-					/* Put back any characters past slash-star; see above */
-					yyless(2);
-					fputs("/_*", yyout);
-				}
-<xcsql>{xcstop}	{
-					if (xcdepth <= 0)
+<xc>{
+{xcstart}		{
+					if (state_before_str_start == SQL)
 					{
-						ECHO;
-						BEGIN(state_before_str_start);
-						token_start = NULL;
+						xcdepth++;
+						/* Put back any characters past slash-star; see above */
+						yyless(2);
+						fputs("/_*", yyout);
 					}
-					else
+					else if (state_before_str_start == C)
 					{
-						xcdepth--;
-						fputs("*_/", yyout);
+						ECHO;
 					}
 				}
-<xcc>{xcstop}	{
-					ECHO;
-					BEGIN(state_before_str_start);
-					token_start = NULL;
+
+{xcstop}		{
+					if (state_before_str_start == SQL)
+					{
+						if (xcdepth <= 0)
+						{
+							ECHO;
+							BEGIN(SQL);
+							token_start = NULL;
+						}
+						else
+						{
+							xcdepth--;
+							fputs("*_/", yyout);
+						}
+					}
+					else if (state_before_str_start == C)
+					{
+						ECHO;
+						BEGIN(C);
+						token_start = NULL;
+					}
 				}
 
-<xcc,xcsql>{
 {xcinside}		{
 					ECHO;
 				}
@@ -482,7 +484,7 @@ cppline			{space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})((\/\*[^*/]*\*+
 <<EOF>>			{
 					mmfatal(PARSE_ERROR, "unterminated /* comment");
 				}
-} /* <xcc,xcsql> */
+} /* <xc> */
 
 <SQL>{
 {xbstart}		{
-- 
2.22.0

#29

Tom Lane

tgl@sss.pgh.pa.us

almost 6 years ago

In reply to: John Naylor (#28)

Re: benchmarking Flex practices

John Naylor <john.naylor@2ndquadrant.com> writes:

[ v11 patch ]

I pushed this with some small cosmetic adjustments.

One non-cosmetic adjustment I experimented with was to change
str_udeescape() to overwrite the source string in-place, since
we know that's modifiable storage and de-escaping can't make
the string longer. I reasoned that saving a palloc() might help
reduce the extra cost of UESCAPE processing. It didn't seem to
move the needle much though, so I didn't commit it that way.
A positive reason to keep the API as it stands is that if we
do something about the idea of allowing Unicode strings in
non-UTF8 backend encodings, that'd likely break the assumption
about how the string can't get longer.

I'm about to go off and look at the non-UTF8 idea, btw.

regards, tom lane

#30

John Naylor

john.naylor@2ndquadrant.com

almost 6 years ago

In reply to: Tom Lane (#29)

Re: benchmarking Flex practices

On Tue, Jan 14, 2020 at 4:12 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

John Naylor <john.naylor@2ndquadrant.com> writes:

[ v11 patch ]

I pushed this with some small cosmetic adjustments.

Thanks for your help hacking on the token filter.

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services