Invalid "trailing junk" error message when non-English letters are used
Hi hackers,
When error "trailing junk after numeric literal" occurs at a number
followed by a symbol that is presented by more than one byte, that symbol
in the error message is not displayed correctly. Instead of that symbol
there is only its first byte. That makes the error message an invalid
UTF-8 (or whatever encoding is set). The whole log file where this error
message goes also becomes invalid. That could lead to problems with
reading logs. You can see an invalid message by trying "SELECT 123ä;".
Rejecting trailing junk after numeric literals was introduced in commit
2549f066 to prevent scanning a number immediately followed by an
identifier without whitespace as number and identifier. All the tokens
that made to catch such cases match a numeric literal and the next byte,
and that is where the problem comes from. I thought that it could be fixed
just by using tokens that match a numeric literal immediately followed by
an identifier, not only one byte. This also improves error messages in
cases with English letters. After these changes, for "SELECT 123abc;" the
error message will say that the error appeared at or near "123abc" instead
of "123a".
I've attached the patch. Are there any pitfalls I can't see? It just keeps
bothering me why wasn't it done from the beginning. Matching the whole
identifier after a numeric literal just seems more obvious to me than
matching its first byte.
Best regards,
Karina Litskevich
Postgres Professional: http://postgrespro.com/
Attachments:
v1-0001-Improve-error-message-for-rejecting-trailing-junk.patchtext/x-patch; charset=UTF-8; name=v1-0001-Improve-error-message-for-rejecting-trailing-junk.patchDownload
From 281303312471b4ef831f57d124208ce1699aed54 Mon Sep 17 00:00:00 2001
From: Karina Litskevich <litskevichkarina@gmail.com>
Date: Tue, 27 Aug 2024 17:01:49 +0300
Subject: [PATCH v1] Improve error message for rejecting trailing junk after
numeric literals
Rejecting trailing junk after numeric literals was introduced in
commit 2549f066 to prevent scanning a number immediately followed
by an identifier without whitespace as number and identifier.
Unfortunately, all the tokens made to catch such numeric literals
followed by non-digits match a numeric literal and the next byte. The
lexemes found by these tokens are broken in case the next symbol after
a numeric literal is presented by several bytes as only the first byte
of the symbol gets to the lexeme. When this lexeme is then printed as
a part of an error message that message became broken too along with
the whole log file where it goes.
This commit fixes the problem by using tokens that match a numeric
literal immediately followed by an identifier, not only one byte. This
also improves error messages in cases with English letters. For 123abc
the error message now will say that the error appeared at or near
"123abc" instead of "123a".
---
src/backend/parser/scan.l | 14 +++++++-------
src/fe_utils/psqlscan.l | 14 +++++++-------
src/interfaces/ecpg/preproc/pgc.l | 14 +++++++-------
src/test/regress/expected/numerology.out | 14 +++++++++-----
src/test/regress/sql/numerology.sql | 1 +
5 files changed, 31 insertions(+), 26 deletions(-)
diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l
index f74059e7b0..448f9b5afe 100644
--- a/src/backend/parser/scan.l
+++ b/src/backend/parser/scan.l
@@ -412,16 +412,16 @@ numericfail {decinteger}\.\.
real ({decinteger}|{numeric})[Ee][-+]?{decinteger}
realfail ({decinteger}|{numeric})[Ee][-+]
-decinteger_junk {decinteger}{ident_start}
-hexinteger_junk {hexinteger}{ident_start}
-octinteger_junk {octinteger}{ident_start}
-bininteger_junk {bininteger}{ident_start}
-numeric_junk {numeric}{ident_start}
-real_junk {real}{ident_start}
+decinteger_junk {decinteger}{identifier}
+hexinteger_junk {hexinteger}{identifier}
+octinteger_junk {octinteger}{identifier}
+bininteger_junk {bininteger}{identifier}
+numeric_junk {numeric}{identifier}
+real_junk {real}{identifier}
/* Positional parameters don't accept underscores. */
param \${decdigit}+
-param_junk \${decdigit}+{ident_start}
+param_junk \${decdigit}+{identifier}
other .
diff --git a/src/fe_utils/psqlscan.l b/src/fe_utils/psqlscan.l
index ddc4658b92..282f117b81 100644
--- a/src/fe_utils/psqlscan.l
+++ b/src/fe_utils/psqlscan.l
@@ -348,16 +348,16 @@ numericfail {decinteger}\.\.
real ({decinteger}|{numeric})[Ee][-+]?{decinteger}
realfail ({decinteger}|{numeric})[Ee][-+]
-decinteger_junk {decinteger}{ident_start}
-hexinteger_junk {hexinteger}{ident_start}
-octinteger_junk {octinteger}{ident_start}
-bininteger_junk {bininteger}{ident_start}
-numeric_junk {numeric}{ident_start}
-real_junk {real}{ident_start}
+decinteger_junk {decinteger}{identifier}
+hexinteger_junk {hexinteger}{identifier}
+octinteger_junk {octinteger}{identifier}
+bininteger_junk {bininteger}{identifier}
+numeric_junk {numeric}{identifier}
+real_junk {real}{identifier}
/* Positional parameters don't accept underscores. */
param \${decdigit}+
-param_junk \${decdigit}+{ident_start}
+param_junk \${decdigit}+{identifier}
/* psql-specific: characters allowed in variable names */
variable_char [A-Za-z\200-\377_0-9]
diff --git a/src/interfaces/ecpg/preproc/pgc.l b/src/interfaces/ecpg/preproc/pgc.l
index f363a34659..7887773277 100644
--- a/src/interfaces/ecpg/preproc/pgc.l
+++ b/src/interfaces/ecpg/preproc/pgc.l
@@ -381,16 +381,16 @@ numericfail {decinteger}\.\.
real ({decinteger}|{numeric})[Ee][-+]?{decinteger}
realfail ({decinteger}|{numeric})[Ee][-+]
-decinteger_junk {decinteger}{ident_start}
-hexinteger_junk {hexinteger}{ident_start}
-octinteger_junk {octinteger}{ident_start}
-bininteger_junk {bininteger}{ident_start}
-numeric_junk {numeric}{ident_start}
-real_junk {real}{ident_start}
+decinteger_junk {decinteger}{identifier}
+hexinteger_junk {hexinteger}{identifier}
+octinteger_junk {octinteger}{identifier}
+bininteger_junk {bininteger}{identifier}
+numeric_junk {numeric}{identifier}
+real_junk {real}{identifier}
/* Positional parameters don't accept underscores. */
param \${decdigit}+
-param_junk \${decdigit}+{ident_start}
+param_junk \${decdigit}+{identifier}
/* special characters for other dbms */
/* we have to react differently in compat mode */
diff --git a/src/test/regress/expected/numerology.out b/src/test/regress/expected/numerology.out
index 717a237df9..f4bd6fbfee 100644
--- a/src/test/regress/expected/numerology.out
+++ b/src/test/regress/expected/numerology.out
@@ -171,9 +171,13 @@ SELECT -0x8000000000000001;
-- error cases
SELECT 123abc;
-ERROR: trailing junk after numeric literal at or near "123a"
+ERROR: trailing junk after numeric literal at or near "123abc"
LINE 1: SELECT 123abc;
^
+SELECT 1ä;
+ERROR: trailing junk after numeric literal at or near "1ä"
+LINE 1: SELECT 1ä;
+ ^
SELECT 0x0o;
ERROR: trailing junk after numeric literal at or near "0x0o"
LINE 1: SELECT 0x0o;
@@ -322,7 +326,7 @@ ERROR: trailing junk after numeric literal at or near "100_"
LINE 1: SELECT 100_;
^
SELECT 100__000;
-ERROR: trailing junk after numeric literal at or near "100_"
+ERROR: trailing junk after numeric literal at or near "100__000"
LINE 1: SELECT 100__000;
^
SELECT _1_000.5;
@@ -334,7 +338,7 @@ ERROR: trailing junk after numeric literal at or near "1_000_"
LINE 1: SELECT 1_000_.5;
^
SELECT 1_000._5;
-ERROR: trailing junk after numeric literal at or near "1_000._"
+ERROR: trailing junk after numeric literal at or near "1_000._5"
LINE 1: SELECT 1_000._5;
^
SELECT 1_000.5_;
@@ -342,11 +346,11 @@ ERROR: trailing junk after numeric literal at or near "1_000.5_"
LINE 1: SELECT 1_000.5_;
^
SELECT 1_000.5e_1;
-ERROR: trailing junk after numeric literal at or near "1_000.5e"
+ERROR: trailing junk after numeric literal at or near "1_000.5e_1"
LINE 1: SELECT 1_000.5e_1;
^
PREPARE p1 AS SELECT $0_1;
-ERROR: trailing junk after parameter at or near "$0_"
+ERROR: trailing junk after parameter at or near "$0_1"
LINE 1: PREPARE p1 AS SELECT $0_1;
^
--
diff --git a/src/test/regress/sql/numerology.sql b/src/test/regress/sql/numerology.sql
index 3ae491cc98..d1a42b046c 100644
--- a/src/test/regress/sql/numerology.sql
+++ b/src/test/regress/sql/numerology.sql
@@ -44,6 +44,7 @@ SELECT -0x8000000000000001;
-- error cases
SELECT 123abc;
+SELECT 1ä;
SELECT 0x0o;
SELECT 0.a;
SELECT 0.0a;
--
2.34.1
Hi, Karina!
On Tue, 27 Aug 2024 at 19:06, Karina Litskevich <litskevichkarina@gmail.com>
wrote:
Hi hackers,
When error "trailing junk after numeric literal" occurs at a number
followed by a symbol that is presented by more than one byte, that symbol
in the error message is not displayed correctly. Instead of that symbol
there is only its first byte. That makes the error message an invalid
UTF-8 (or whatever encoding is set). The whole log file where this error
message goes also becomes invalid. That could lead to problems with
reading logs. You can see an invalid message by trying "SELECT 123ä;".Rejecting trailing junk after numeric literals was introduced in commit
2549f066 to prevent scanning a number immediately followed by an
identifier without whitespace as number and identifier. All the tokens
that made to catch such cases match a numeric literal and the next byte,
and that is where the problem comes from. I thought that it could be fixed
just by using tokens that match a numeric literal immediately followed by
an identifier, not only one byte. This also improves error messages in
cases with English letters. After these changes, for "SELECT 123abc;" the
error message will say that the error appeared at or near "123abc" instead
of "123a".I've attached the patch. Are there any pitfalls I can't see? It just keeps
bothering me why wasn't it done from the beginning. Matching the whole
identifier after a numeric literal just seems more obvious to me than
matching its first byte.
I see the following compile time warnings:
scan.l:1062: warning, rule cannot be matched
scan.l:1066: warning, rule cannot be matched
scan.l:1070: warning, rule cannot be matched
pgc.l:1030: warning, rule cannot be matched
pgc.l:1033: warning, rule cannot be matched
pgc.l:1036: warning, rule cannot be matched
psqlscan.l:905: warning, rule cannot be matched
psqlscan.l:908: warning, rule cannot be matched
psqlscan.l:911: warning, rule cannot be matched
FWIW output of the whole string in the error message doesnt' look nice to
me, but other places of code do this anyway e.g:
select ('1'||repeat('p',1000000))::integer;
This may be worth fixing.
Regards,
Pavel Borisov
Supabase
On Wed, Aug 28, 2024 at 12:06 AM Pavel Borisov <pashkin.elfe@gmail.com>
wrote:
I see the following compile time warnings:
scan.l:1062: warning, rule cannot be matched
scan.l:1066: warning, rule cannot be matched
scan.l:1070: warning, rule cannot be matched
pgc.l:1030: warning, rule cannot be matched
pgc.l:1033: warning, rule cannot be matched
pgc.l:1036: warning, rule cannot be matched
psqlscan.l:905: warning, rule cannot be matched
psqlscan.l:908: warning, rule cannot be matched
psqlscan.l:911: warning, rule cannot be matched
Thanks for the feedback!
I somehow missed these warnings, my bad. The problem is with queries like
"select 0x12junk;". In master "0x" matches decinteger_junk token and
"0x12j" matches hexinteger_junk token and flex chooses the longest match,
no conflict. But with the patch "0x12junk" matches both decinteger_junk
(decinteger "0" + identifier "x12junk") and hexinteger_junk (hexinteger
"0x12" + identifier "junk"). Since any match to hexinteger_junk also
matches decinteger_junk, and the rule for hexinteger_junk is below the
rule for decinteger_junk, it's never reached.
I see the two solutions here: either move the rule for decinteger_junk
below the rules for hexinteger_junk, octinteger_junk and bininteger_junk,
or just use a single rule decinteger_junk for all these cases, since the
error message is the same anyway. I implemented the latter in the second
version of the patch, also renamed this common rule to integer_junk.
Additionally, I noticed that this patch is going to change error messages
in some cases, though I don't think it's a big deal. Example:
Without patch:
postgres=# select 0xyz;
ERROR: invalid hexadecimal integer at or near "0x"
With patch:
postgres=# select 0xyz;
ERROR: trailing junk after numeric literal at or near "0xyz"
FWIW output of the whole string in the error message doesnt' look nice to
me, but other places of code do this anyway e.g:
select ('1'||repeat('p',1000000))::integer;
This may be worth fixing.
That's interesting. I didn't know we could do that to create a long error
message. At first I thought that it's not a problem for error messages
from the scanner, since its "at or near" string cannot be longer than the
query typed in psql or written in a script file so it won't be enormously
big. But that's just not true, because we can send a generated query.
Something like that:
With patch:
postgres=# select 'select '||'1'||repeat('p',1000000) \gexec
ERROR: trailing junk after numeric literal at or near "1ppp<lots of p>"
And another query that leads to this without patch:
postgres=# select 'select 1'||repeat('@',1000000)||'1' \gexec
ERROR: operator too long at or near "@@@<lots of @>"
It would be nice to prevent such long strings in error messages. Maybe a
GUC variable to set the maximum length for such strings could work. But
how do we determine all places where it is needed?
Best regards,
Karina Litskevich
Postgres Professional: http://postgrespro.com/
Attachments:
v2-0001-Improve-error-message-for-rejecting-trailing-junk.patchtext/x-patch; charset=UTF-8; name=v2-0001-Improve-error-message-for-rejecting-trailing-junk.patchDownload
From e0d49c4e81054e5c66be1f78cb57ad2ae87fa02d Mon Sep 17 00:00:00 2001
From: Karina Litskevich <litskevichkarina@gmail.com>
Date: Wed, 28 Aug 2024 11:52:25 +0300
Subject: [PATCH v2] Improve error message for rejecting trailing junk after
numeric literals
Rejecting trailing junk after numeric literals was introduced in
commit 2549f066 to prevent scanning a number immediately followed
by an identifier without whitespace as number and identifier.
Unfortunately, all the tokens made to catch such numeric literals
followed by non-digits match a numeric literal and the next byte. The
lexemes found by these tokens are broken in case the next symbol after
a numeric literal is presented by several bytes as only the first byte
of the symbol gets to the lexeme. When this lexeme is then printed as
a part of an error message that message became broken too along with
the whole log file where it goes.
This commit fixes the problem by using tokens that match a numeric
literal immediately followed by an identifier, not only one byte. This
also improves error messages in cases with English letters. For 123abc
the error message now will say that the error appeared at or near
"123abc" instead of "123a".
Since anything that matches hexinteger, octinteger or bininteger
followed by identifier also matches decinteger followed by identifier,
use one common token for that.
---
src/backend/parser/scan.l | 25 +++++-------------------
src/fe_utils/psqlscan.l | 22 +++++----------------
src/interfaces/ecpg/preproc/pgc.l | 22 +++++----------------
src/test/regress/expected/numerology.out | 14 ++++++++-----
src/test/regress/sql/numerology.sql | 1 +
5 files changed, 25 insertions(+), 59 deletions(-)
diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l
index f74059e7b0..028f938789 100644
--- a/src/backend/parser/scan.l
+++ b/src/backend/parser/scan.l
@@ -412,16 +412,13 @@ numericfail {decinteger}\.\.
real ({decinteger}|{numeric})[Ee][-+]?{decinteger}
realfail ({decinteger}|{numeric})[Ee][-+]
-decinteger_junk {decinteger}{ident_start}
-hexinteger_junk {hexinteger}{ident_start}
-octinteger_junk {octinteger}{ident_start}
-bininteger_junk {bininteger}{ident_start}
-numeric_junk {numeric}{ident_start}
-real_junk {real}{ident_start}
+integer_junk {decinteger}{identifier}
+numeric_junk {numeric}{identifier}
+real_junk {real}{identifier}
/* Positional parameters don't accept underscores. */
param \${decdigit}+
-param_junk \${decdigit}+{ident_start}
+param_junk \${decdigit}+{identifier}
other .
@@ -1055,19 +1052,7 @@ other .
SET_YYLLOC();
yyerror("trailing junk after numeric literal");
}
-{decinteger_junk} {
- SET_YYLLOC();
- yyerror("trailing junk after numeric literal");
- }
-{hexinteger_junk} {
- SET_YYLLOC();
- yyerror("trailing junk after numeric literal");
- }
-{octinteger_junk} {
- SET_YYLLOC();
- yyerror("trailing junk after numeric literal");
- }
-{bininteger_junk} {
+{integer_junk} {
SET_YYLLOC();
yyerror("trailing junk after numeric literal");
}
diff --git a/src/fe_utils/psqlscan.l b/src/fe_utils/psqlscan.l
index ddc4658b92..70b6561548 100644
--- a/src/fe_utils/psqlscan.l
+++ b/src/fe_utils/psqlscan.l
@@ -348,16 +348,13 @@ numericfail {decinteger}\.\.
real ({decinteger}|{numeric})[Ee][-+]?{decinteger}
realfail ({decinteger}|{numeric})[Ee][-+]
-decinteger_junk {decinteger}{ident_start}
-hexinteger_junk {hexinteger}{ident_start}
-octinteger_junk {octinteger}{ident_start}
-bininteger_junk {bininteger}{ident_start}
-numeric_junk {numeric}{ident_start}
-real_junk {real}{ident_start}
+integer_junk {decinteger}{identifier}
+numeric_junk {numeric}{identifier}
+real_junk {real}{identifier}
/* Positional parameters don't accept underscores. */
param \${decdigit}+
-param_junk \${decdigit}+{ident_start}
+param_junk \${decdigit}+{identifier}
/* psql-specific: characters allowed in variable names */
variable_char [A-Za-z\200-\377_0-9]
@@ -898,16 +895,7 @@ other .
{realfail} {
ECHO;
}
-{decinteger_junk} {
- ECHO;
- }
-{hexinteger_junk} {
- ECHO;
- }
-{octinteger_junk} {
- ECHO;
- }
-{bininteger_junk} {
+{integer_junk} {
ECHO;
}
{numeric_junk} {
diff --git a/src/interfaces/ecpg/preproc/pgc.l b/src/interfaces/ecpg/preproc/pgc.l
index f363a34659..064af29b4d 100644
--- a/src/interfaces/ecpg/preproc/pgc.l
+++ b/src/interfaces/ecpg/preproc/pgc.l
@@ -381,16 +381,13 @@ numericfail {decinteger}\.\.
real ({decinteger}|{numeric})[Ee][-+]?{decinteger}
realfail ({decinteger}|{numeric})[Ee][-+]
-decinteger_junk {decinteger}{ident_start}
-hexinteger_junk {hexinteger}{ident_start}
-octinteger_junk {octinteger}{ident_start}
-bininteger_junk {bininteger}{ident_start}
-numeric_junk {numeric}{ident_start}
-real_junk {real}{ident_start}
+integer_junk {decinteger}{identifier}
+numeric_junk {numeric}{identifier}
+real_junk {real}{identifier}
/* Positional parameters don't accept underscores. */
param \${decdigit}+
-param_junk \${decdigit}+{ident_start}
+param_junk \${decdigit}+{identifier}
/* special characters for other dbms */
/* we have to react differently in compat mode */
@@ -1023,16 +1020,7 @@ cppline {space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})((\/\*[^*/]*\*+
* Note that some trailing junk is valid in C (such as 100LL), so we
* contain this to SQL mode.
*/
-{decinteger_junk} {
- mmfatal(PARSE_ERROR, "trailing junk after numeric literal");
- }
-{hexinteger_junk} {
- mmfatal(PARSE_ERROR, "trailing junk after numeric literal");
- }
-{octinteger_junk} {
- mmfatal(PARSE_ERROR, "trailing junk after numeric literal");
- }
-{bininteger_junk} {
+{integer_junk} {
mmfatal(PARSE_ERROR, "trailing junk after numeric literal");
}
{numeric_junk} {
diff --git a/src/test/regress/expected/numerology.out b/src/test/regress/expected/numerology.out
index 717a237df9..f4bd6fbfee 100644
--- a/src/test/regress/expected/numerology.out
+++ b/src/test/regress/expected/numerology.out
@@ -171,9 +171,13 @@ SELECT -0x8000000000000001;
-- error cases
SELECT 123abc;
-ERROR: trailing junk after numeric literal at or near "123a"
+ERROR: trailing junk after numeric literal at or near "123abc"
LINE 1: SELECT 123abc;
^
+SELECT 1ä;
+ERROR: trailing junk after numeric literal at or near "1ä"
+LINE 1: SELECT 1ä;
+ ^
SELECT 0x0o;
ERROR: trailing junk after numeric literal at or near "0x0o"
LINE 1: SELECT 0x0o;
@@ -322,7 +326,7 @@ ERROR: trailing junk after numeric literal at or near "100_"
LINE 1: SELECT 100_;
^
SELECT 100__000;
-ERROR: trailing junk after numeric literal at or near "100_"
+ERROR: trailing junk after numeric literal at or near "100__000"
LINE 1: SELECT 100__000;
^
SELECT _1_000.5;
@@ -334,7 +338,7 @@ ERROR: trailing junk after numeric literal at or near "1_000_"
LINE 1: SELECT 1_000_.5;
^
SELECT 1_000._5;
-ERROR: trailing junk after numeric literal at or near "1_000._"
+ERROR: trailing junk after numeric literal at or near "1_000._5"
LINE 1: SELECT 1_000._5;
^
SELECT 1_000.5_;
@@ -342,11 +346,11 @@ ERROR: trailing junk after numeric literal at or near "1_000.5_"
LINE 1: SELECT 1_000.5_;
^
SELECT 1_000.5e_1;
-ERROR: trailing junk after numeric literal at or near "1_000.5e"
+ERROR: trailing junk after numeric literal at or near "1_000.5e_1"
LINE 1: SELECT 1_000.5e_1;
^
PREPARE p1 AS SELECT $0_1;
-ERROR: trailing junk after parameter at or near "$0_"
+ERROR: trailing junk after parameter at or near "$0_1"
LINE 1: PREPARE p1 AS SELECT $0_1;
^
--
diff --git a/src/test/regress/sql/numerology.sql b/src/test/regress/sql/numerology.sql
index 3ae491cc98..d1a42b046c 100644
--- a/src/test/regress/sql/numerology.sql
+++ b/src/test/regress/sql/numerology.sql
@@ -44,6 +44,7 @@ SELECT -0x8000000000000001;
-- error cases
SELECT 123abc;
+SELECT 1ä;
SELECT 0x0o;
SELECT 0.a;
SELECT 0.0a;
--
2.34.1
Karina Litskevich <litskevichkarina@gmail.com> writes:
I see the two solutions here: either move the rule for decinteger_junk
below the rules for hexinteger_junk, octinteger_junk and bininteger_junk,
or just use a single rule decinteger_junk for all these cases, since the
error message is the same anyway. I implemented the latter in the second
version of the patch, also renamed this common rule to integer_junk.
That seems reasonable, but IMO this code was unacceptably
undercommented before and what you've done has made it worse.
We really need a comment block associated with the flex macros,
perhaps along the lines of
/*
* An identifier immediately following a numeric literal is disallowed
* because in some cases it's ambiguous what is meant: for example,
* 0x1234 could be either a hexinteger or a decinteger "0" and an
* identifier "x1234". We can detect such problems by seeing if
* integer_junk matches a longer substring than any of the XXXinteger
* patterns. (One "junk" pattern is sufficient because this will match
* all the same strings we'd match with {hexinteger}{identifier} etc.)
* Note that the rule for integer_junk must appear after the ones for
* XXXinteger to make this work correctly.
*/
(Hmm, actually, is that last sentence true? My flex is a bit rusty.)
param_junk really needs a similar comment, or maybe we could put
all the XXX_junk macros together and use one comment for all.
Additionally, I noticed that this patch is going to change error messages
in some cases, though I don't think it's a big deal. Example:
Without patch:
postgres=# select 0xyz;
ERROR: invalid hexadecimal integer at or near "0x"
With patch:
postgres=# select 0xyz;
ERROR: trailing junk after numeric literal at or near "0xyz"
That's sort of annoying, but I don't really see a better way,
or at least not one that's worth the effort.
FWIW output of the whole string in the error message doesnt' look nice to
me, but other places of code do this anyway e.g:
select ('1'||repeat('p',1000000))::integer;
This may be worth fixing.
I think this is nonsense: we are already in the habit of repeating the
whole failing query string in the STATEMENT field. In any case it's
not something for this patch to worry about.
regards, tom lane
On Thu, Sep 5, 2024 at 1:52 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Karina Litskevich <litskevichkarina@gmail.com> writes:
I see the two solutions here: either move the rule for decinteger_junk
below the rules for hexinteger_junk, octinteger_junk and
bininteger_junk,
or just use a single rule decinteger_junk for all these cases, since the
error message is the same anyway. I implemented the latter in the second
version of the patch, also renamed this common rule to integer_junk.That seems reasonable, but IMO this code was unacceptably
undercommented before and what you've done has made it worse.
We really need a comment block associated with the flex macros,
perhaps along the lines of/*
* An identifier immediately following a numeric literal is disallowed
* because in some cases it's ambiguous what is meant: for example,
* 0x1234 could be either a hexinteger or a decinteger "0" and an
* identifier "x1234". We can detect such problems by seeing if
* integer_junk matches a longer substring than any of the XXXinteger
* patterns. (One "junk" pattern is sufficient because this will match
* all the same strings we'd match with {hexinteger}{identifier} etc.)
* Note that the rule for integer_junk must appear after the ones for
* XXXinteger to make this work correctly.
*/
Thank you, this piece of code definitely needs a comment.
(Hmm, actually, is that last sentence true? My flex is a bit rusty.)
Yes, the rule for integer_junk must appear after the ones for XXXinteger.
Here is a quote from
https://ftp.gnu.org/old-gnu/Manuals/flex-2.5.4/html_mono/flex.html
"If it finds more than one match, it takes the one matching the most text
(...). If it finds two or more matches of the same length, the rule listed
first in the flex input file is chosen."
For example, 0x123 is matched by both integer_junk and hexinteger, and we
want the rule for hexinteger to be chosen, so we should place it before
the rule for integer_junk.
param_junk really needs a similar comment, or maybe we could put
all the XXX_junk macros together and use one comment for all.
In v3 of the patch I grouped all the *_junk rules together and included
the suggested comment with a little added something.
Best regards,
Karina Litskevich
Postgres Professional: http://postgrespro.com/
On Thu, Sep 5, 2024 at 6:07 PM Karina Litskevich
<litskevichkarina@gmail.com> wrote:
In v3 of the patch I grouped all the *_junk rules together and included
the suggested comment with a little added something.
Oops, I forgot to attach the patch, here it is.
Best regards,
Karina Litskevich
Postgres Professional: http://postgrespro.com/
Attachments:
v3-0001-Improve-error-message-for-rejecting-trailing-junk.patchtext/x-patch; charset=UTF-8; name=v3-0001-Improve-error-message-for-rejecting-trailing-junk.patchDownload
From b993d45d34c0bff676bf439e75dfd8f58244f8ed Mon Sep 17 00:00:00 2001
From: Karina Litskevich <litskevichkarina@gmail.com>
Date: Wed, 28 Aug 2024 11:52:25 +0300
Subject: [PATCH v3] Improve error message for rejecting trailing junk after
numeric literals
Rejecting trailing junk after numeric literals was introduced in
commit 2549f066 to prevent scanning a number immediately followed
by an identifier without whitespace as number and identifier.
Unfortunately, all the tokens made to catch such numeric literals
followed by non-digits match a numeric literal and the next byte. The
lexemes found by these tokens are broken in case the next symbol after
a numeric literal is presented by several bytes as only the first byte
of the symbol gets to the lexeme. When this lexeme is then printed as
a part of an error message that message became broken too along with
the whole log file where it goes.
This commit fixes the problem by using tokens that match a numeric
literal immediately followed by an identifier, not only one byte. This
also improves error messages in cases with English letters. For 123abc
the error message now will say that the error appeared at or near
"123abc" instead of "123a".
Since anything that matches hexinteger, octinteger or bininteger
followed by identifier also matches decinteger followed by identifier,
use one common rule for that.
---
src/backend/parser/scan.l | 43 ++++++++++++------------
src/fe_utils/psqlscan.l | 40 ++++++++++++----------
src/interfaces/ecpg/preproc/pgc.l | 40 ++++++++++++----------
src/test/regress/expected/numerology.out | 14 +++++---
src/test/regress/sql/numerology.sql | 1 +
5 files changed, 76 insertions(+), 62 deletions(-)
diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l
index f74059e7b0..1a7c9b9d8f 100644
--- a/src/backend/parser/scan.l
+++ b/src/backend/parser/scan.l
@@ -412,16 +412,29 @@ numericfail {decinteger}\.\.
real ({decinteger}|{numeric})[Ee][-+]?{decinteger}
realfail ({decinteger}|{numeric})[Ee][-+]
-decinteger_junk {decinteger}{ident_start}
-hexinteger_junk {hexinteger}{ident_start}
-octinteger_junk {octinteger}{ident_start}
-bininteger_junk {bininteger}{ident_start}
-numeric_junk {numeric}{ident_start}
-real_junk {real}{ident_start}
-
/* Positional parameters don't accept underscores. */
param \${decdigit}+
-param_junk \${decdigit}+{ident_start}
+
+/*
+ * An identifier immediately following a numeric literal is disallowed because
+ * in some cases it's ambiguous what is meant: for example, 0x1234 could be
+ * either a hexinteger or a decinteger "0" and an identifier "x1234". We can
+ * detect such problems by seeing if integer_junk matches a longer substring
+ * than any of the XXXinteger patterns (decinteger, hexinteger, octinteger,
+ * and bininteger). One "junk" pattern is sufficient because
+ * {decinteger}{identifier} will match all the same strings we'd match with
+ * {hexinteger}{identifier} etc.
+ *
+ * Note that the rule for integer_junk must appear after the ones for
+ * XXXinteger to make this work correctly.
+ *
+ * Also disallow strings matched by numeric_junk, real_junk and param_junk for
+ * consistency.
+ */
+integer_junk {decinteger}{identifier}
+numeric_junk {numeric}{identifier}
+real_junk {real}{identifier}
+param_junk \${decdigit}+{identifier}
other .
@@ -1055,19 +1068,7 @@ other .
SET_YYLLOC();
yyerror("trailing junk after numeric literal");
}
-{decinteger_junk} {
- SET_YYLLOC();
- yyerror("trailing junk after numeric literal");
- }
-{hexinteger_junk} {
- SET_YYLLOC();
- yyerror("trailing junk after numeric literal");
- }
-{octinteger_junk} {
- SET_YYLLOC();
- yyerror("trailing junk after numeric literal");
- }
-{bininteger_junk} {
+{integer_junk} {
SET_YYLLOC();
yyerror("trailing junk after numeric literal");
}
diff --git a/src/fe_utils/psqlscan.l b/src/fe_utils/psqlscan.l
index ddc4658b92..f1c24d505b 100644
--- a/src/fe_utils/psqlscan.l
+++ b/src/fe_utils/psqlscan.l
@@ -348,16 +348,29 @@ numericfail {decinteger}\.\.
real ({decinteger}|{numeric})[Ee][-+]?{decinteger}
realfail ({decinteger}|{numeric})[Ee][-+]
-decinteger_junk {decinteger}{ident_start}
-hexinteger_junk {hexinteger}{ident_start}
-octinteger_junk {octinteger}{ident_start}
-bininteger_junk {bininteger}{ident_start}
-numeric_junk {numeric}{ident_start}
-real_junk {real}{ident_start}
-
/* Positional parameters don't accept underscores. */
param \${decdigit}+
-param_junk \${decdigit}+{ident_start}
+
+/*
+ * An identifier immediately following a numeric literal is disallowed because
+ * in some cases it's ambiguous what is meant: for example, 0x1234 could be
+ * either a hexinteger or a decinteger "0" and an identifier "x1234". We can
+ * detect such problems by seeing if integer_junk matches a longer substring
+ * than any of the XXXinteger patterns (decinteger, hexinteger, octinteger,
+ * and bininteger). One "junk" pattern is sufficient because
+ * {decinteger}{identifier} will match all the same strings we'd match with
+ * {hexinteger}{identifier} etc.
+ *
+ * Note that the rule for integer_junk must appear after the ones for
+ * XXXinteger to make this work correctly.
+ *
+ * Also disallow strings matched by numeric_junk, real_junk and param_junk for
+ * consistency.
+ */
+integer_junk {decinteger}{identifier}
+numeric_junk {numeric}{identifier}
+real_junk {real}{identifier}
+param_junk \${decdigit}+{identifier}
/* psql-specific: characters allowed in variable names */
variable_char [A-Za-z\200-\377_0-9]
@@ -898,16 +911,7 @@ other .
{realfail} {
ECHO;
}
-{decinteger_junk} {
- ECHO;
- }
-{hexinteger_junk} {
- ECHO;
- }
-{octinteger_junk} {
- ECHO;
- }
-{bininteger_junk} {
+{integer_junk} {
ECHO;
}
{numeric_junk} {
diff --git a/src/interfaces/ecpg/preproc/pgc.l b/src/interfaces/ecpg/preproc/pgc.l
index f363a34659..54e7571ef2 100644
--- a/src/interfaces/ecpg/preproc/pgc.l
+++ b/src/interfaces/ecpg/preproc/pgc.l
@@ -381,16 +381,29 @@ numericfail {decinteger}\.\.
real ({decinteger}|{numeric})[Ee][-+]?{decinteger}
realfail ({decinteger}|{numeric})[Ee][-+]
-decinteger_junk {decinteger}{ident_start}
-hexinteger_junk {hexinteger}{ident_start}
-octinteger_junk {octinteger}{ident_start}
-bininteger_junk {bininteger}{ident_start}
-numeric_junk {numeric}{ident_start}
-real_junk {real}{ident_start}
-
/* Positional parameters don't accept underscores. */
param \${decdigit}+
-param_junk \${decdigit}+{ident_start}
+
+/*
+ * An identifier immediately following a numeric literal is disallowed because
+ * in some cases it's ambiguous what is meant: for example, 0x1234 could be
+ * either a hexinteger or a decinteger "0" and an identifier "x1234". We can
+ * detect such problems by seeing if integer_junk matches a longer substring
+ * than any of the XXXinteger patterns (decinteger, hexinteger, octinteger,
+ * and bininteger). One "junk" pattern is sufficient because
+ * {decinteger}{identifier} will match all the same strings we'd match with
+ * {hexinteger}{identifier} etc.
+ *
+ * Note that the rule for integer_junk must appear after the ones for
+ * XXXinteger to make this work correctly.
+ *
+ * Also disallow strings matched by numeric_junk, real_junk and param_junk for
+ * consistency.
+ */
+integer_junk {decinteger}{identifier}
+numeric_junk {numeric}{identifier}
+real_junk {real}{identifier}
+param_junk \${decdigit}+{identifier}
/* special characters for other dbms */
/* we have to react differently in compat mode */
@@ -1023,16 +1036,7 @@ cppline {space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})((\/\*[^*/]*\*+
* Note that some trailing junk is valid in C (such as 100LL), so we
* contain this to SQL mode.
*/
-{decinteger_junk} {
- mmfatal(PARSE_ERROR, "trailing junk after numeric literal");
- }
-{hexinteger_junk} {
- mmfatal(PARSE_ERROR, "trailing junk after numeric literal");
- }
-{octinteger_junk} {
- mmfatal(PARSE_ERROR, "trailing junk after numeric literal");
- }
-{bininteger_junk} {
+{integer_junk} {
mmfatal(PARSE_ERROR, "trailing junk after numeric literal");
}
{numeric_junk} {
diff --git a/src/test/regress/expected/numerology.out b/src/test/regress/expected/numerology.out
index 717a237df9..f4bd6fbfee 100644
--- a/src/test/regress/expected/numerology.out
+++ b/src/test/regress/expected/numerology.out
@@ -171,9 +171,13 @@ SELECT -0x8000000000000001;
-- error cases
SELECT 123abc;
-ERROR: trailing junk after numeric literal at or near "123a"
+ERROR: trailing junk after numeric literal at or near "123abc"
LINE 1: SELECT 123abc;
^
+SELECT 1ä;
+ERROR: trailing junk after numeric literal at or near "1ä"
+LINE 1: SELECT 1ä;
+ ^
SELECT 0x0o;
ERROR: trailing junk after numeric literal at or near "0x0o"
LINE 1: SELECT 0x0o;
@@ -322,7 +326,7 @@ ERROR: trailing junk after numeric literal at or near "100_"
LINE 1: SELECT 100_;
^
SELECT 100__000;
-ERROR: trailing junk after numeric literal at or near "100_"
+ERROR: trailing junk after numeric literal at or near "100__000"
LINE 1: SELECT 100__000;
^
SELECT _1_000.5;
@@ -334,7 +338,7 @@ ERROR: trailing junk after numeric literal at or near "1_000_"
LINE 1: SELECT 1_000_.5;
^
SELECT 1_000._5;
-ERROR: trailing junk after numeric literal at or near "1_000._"
+ERROR: trailing junk after numeric literal at or near "1_000._5"
LINE 1: SELECT 1_000._5;
^
SELECT 1_000.5_;
@@ -342,11 +346,11 @@ ERROR: trailing junk after numeric literal at or near "1_000.5_"
LINE 1: SELECT 1_000.5_;
^
SELECT 1_000.5e_1;
-ERROR: trailing junk after numeric literal at or near "1_000.5e"
+ERROR: trailing junk after numeric literal at or near "1_000.5e_1"
LINE 1: SELECT 1_000.5e_1;
^
PREPARE p1 AS SELECT $0_1;
-ERROR: trailing junk after parameter at or near "$0_"
+ERROR: trailing junk after parameter at or near "$0_1"
LINE 1: PREPARE p1 AS SELECT $0_1;
^
--
diff --git a/src/test/regress/sql/numerology.sql b/src/test/regress/sql/numerology.sql
index 3ae491cc98..d1a42b046c 100644
--- a/src/test/regress/sql/numerology.sql
+++ b/src/test/regress/sql/numerology.sql
@@ -44,6 +44,7 @@ SELECT -0x8000000000000001;
-- error cases
SELECT 123abc;
+SELECT 1ä;
SELECT 0x0o;
SELECT 0.a;
SELECT 0.0a;
--
2.34.1
Karina Litskevich <litskevichkarina@gmail.com> writes:
On Thu, Sep 5, 2024 at 6:07 PM Karina Litskevich
<litskevichkarina@gmail.com> wrote:In v3 of the patch I grouped all the *_junk rules together and included
the suggested comment with a little added something.
Oops, I forgot to attach the patch, here it is.
Pushed with a bit of further wordsmithing on the comment.
I left out the proposed new test case "SELECT 1ä;". The trouble
with that is it'd introduce an encoding dependency into the test.
For example, it'd likely fail with some other error message in
a server encoding that lacks an equivalent to UTF8 "ä". While
we have methods for coping with such cases, it requires some
pushups, and I didn't see the value. The changes in existing
test case results are sufficient to show the patch does what
we want.
Also, while the bug exists in v15, the patch didn't apply at all.
I got lazy and just did the minimal s/ident_start/identifier/ change
in that branch, instead of back-patching the cosmetic aspects.
regards, tom lane
On Thu, 5 Sept 2024 at 20:49, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Karina Litskevich <litskevichkarina@gmail.com> writes:
On Thu, Sep 5, 2024 at 6:07 PM Karina Litskevich
<litskevichkarina@gmail.com> wrote:In v3 of the patch I grouped all the *_junk rules together and included
the suggested comment with a little added something.Oops, I forgot to attach the patch, here it is.
Pushed with a bit of further wordsmithing on the comment.
I left out the proposed new test case "SELECT 1ä;". The trouble
with that is it'd introduce an encoding dependency into the test.
For example, it'd likely fail with some other error message in
a server encoding that lacks an equivalent to UTF8 "ä". While
we have methods for coping with such cases, it requires some
pushups, and I didn't see the value. The changes in existing
test case results are sufficient to show the patch does what
we want.Also, while the bug exists in v15, the patch didn't apply at all.
I got lazy and just did the minimal s/ident_start/identifier/ change
in that branch, instead of back-patching the cosmetic aspects.
Good! Thank you!
Pavel