Invalid "trailing junk" error message when non-English letters are used

Started by Karina Litskevichalmost 2 years ago8 messageshackers

litskevichkarina@gmail.com

almost 2 years ago

Hi hackers,

When error "trailing junk after numeric literal" occurs at a number
followed by a symbol that is presented by more than one byte, that symbol
in the error message is not displayed correctly. Instead of that symbol
there is only its first byte. That makes the error message an invalid
UTF-8 (or whatever encoding is set). The whole log file where this error
message goes also becomes invalid. That could lead to problems with
reading logs. You can see an invalid message by trying "SELECT 123ä;".

Rejecting trailing junk after numeric literals was introduced in commit
2549f066 to prevent scanning a number immediately followed by an
identifier without whitespace as number and identifier. All the tokens
that made to catch such cases match a numeric literal and the next byte,
and that is where the problem comes from. I thought that it could be fixed
just by using tokens that match a numeric literal immediately followed by
an identifier, not only one byte. This also improves error messages in
cases with English letters. After these changes, for "SELECT 123abc;" the
error message will say that the error appeared at or near "123abc" instead
of "123a".

I've attached the patch. Are there any pitfalls I can't see? It just keeps
bothering me why wasn't it done from the beginning. Matching the whole
identifier after a numeric literal just seems more obvious to me than
matching its first byte.

Best regards,
Karina Litskevich
Postgres Professional: http://postgrespro.com/

Pavel Borisov

pashkin.elfe@gmail.com

almost 2 years ago

In reply to: Karina Litskevich (#1)

Re: Invalid "trailing junk" error message when non-English letters are used

Hi, Karina!

On Tue, 27 Aug 2024 at 19:06, Karina Litskevich <litskevichkarina@gmail.com>
wrote:

Hi hackers,

When error "trailing junk after numeric literal" occurs at a number
followed by a symbol that is presented by more than one byte, that symbol
in the error message is not displayed correctly. Instead of that symbol
there is only its first byte. That makes the error message an invalid
UTF-8 (or whatever encoding is set). The whole log file where this error
message goes also becomes invalid. That could lead to problems with
reading logs. You can see an invalid message by trying "SELECT 123ä;".

Rejecting trailing junk after numeric literals was introduced in commit
2549f066 to prevent scanning a number immediately followed by an
identifier without whitespace as number and identifier. All the tokens
that made to catch such cases match a numeric literal and the next byte,
and that is where the problem comes from. I thought that it could be fixed
just by using tokens that match a numeric literal immediately followed by
an identifier, not only one byte. This also improves error messages in
cases with English letters. After these changes, for "SELECT 123abc;" the
error message will say that the error appeared at or near "123abc" instead
of "123a".

I've attached the patch. Are there any pitfalls I can't see? It just keeps
bothering me why wasn't it done from the beginning. Matching the whole
identifier after a numeric literal just seems more obvious to me than
matching its first byte.

I see the following compile time warnings:
scan.l:1062: warning, rule cannot be matched
scan.l:1066: warning, rule cannot be matched
scan.l:1070: warning, rule cannot be matched
pgc.l:1030: warning, rule cannot be matched
pgc.l:1033: warning, rule cannot be matched
pgc.l:1036: warning, rule cannot be matched
psqlscan.l:905: warning, rule cannot be matched
psqlscan.l:908: warning, rule cannot be matched
psqlscan.l:911: warning, rule cannot be matched

FWIW output of the whole string in the error message doesnt' look nice to
me, but other places of code do this anyway e.g:
select ('1'||repeat('p',1000000))::integer;
This may be worth fixing.

Regards,
Pavel Borisov
Supabase

Karina Litskevich

litskevichkarina@gmail.com

almost 2 years ago

In reply to: Pavel Borisov (#2)

Re: Invalid "trailing junk" error message when non-English letters are used

On Wed, Aug 28, 2024 at 12:06 AM Pavel Borisov <pashkin.elfe@gmail.com>
wrote:

I see the following compile time warnings:
scan.l:1062: warning, rule cannot be matched
scan.l:1066: warning, rule cannot be matched
scan.l:1070: warning, rule cannot be matched
pgc.l:1030: warning, rule cannot be matched
pgc.l:1033: warning, rule cannot be matched
pgc.l:1036: warning, rule cannot be matched
psqlscan.l:905: warning, rule cannot be matched
psqlscan.l:908: warning, rule cannot be matched
psqlscan.l:911: warning, rule cannot be matched

Thanks for the feedback!

I somehow missed these warnings, my bad. The problem is with queries like
"select 0x12junk;". In master "0x" matches decinteger_junk token and
"0x12j" matches hexinteger_junk token and flex chooses the longest match,
no conflict. But with the patch "0x12junk" matches both decinteger_junk
(decinteger "0" + identifier "x12junk") and hexinteger_junk (hexinteger
"0x12" + identifier "junk"). Since any match to hexinteger_junk also
matches decinteger_junk, and the rule for hexinteger_junk is below the
rule for decinteger_junk, it's never reached.

I see the two solutions here: either move the rule for decinteger_junk
below the rules for hexinteger_junk, octinteger_junk and bininteger_junk,
or just use a single rule decinteger_junk for all these cases, since the
error message is the same anyway. I implemented the latter in the second
version of the patch, also renamed this common rule to integer_junk.

Additionally, I noticed that this patch is going to change error messages
in some cases, though I don't think it's a big deal. Example:

Without patch:
postgres=# select 0xyz;
ERROR: invalid hexadecimal integer at or near "0x"

With patch:
postgres=# select 0xyz;
ERROR: trailing junk after numeric literal at or near "0xyz"

FWIW output of the whole string in the error message doesnt' look nice to
me, but other places of code do this anyway e.g:
select ('1'||repeat('p',1000000))::integer;
This may be worth fixing.

That's interesting. I didn't know we could do that to create a long error
message. At first I thought that it's not a problem for error messages
from the scanner, since its "at or near" string cannot be longer than the
query typed in psql or written in a script file so it won't be enormously
big. But that's just not true, because we can send a generated query.
Something like that:

With patch:
postgres=# select 'select '||'1'||repeat('p',1000000) \gexec
ERROR: trailing junk after numeric literal at or near "1ppp<lots of p>"

And another query that leads to this without patch:
postgres=# select 'select 1'||repeat('@',1000000)||'1' \gexec
ERROR: operator too long at or near "@@@<lots of @>"

It would be nice to prevent such long strings in error messages. Maybe a
GUC variable to set the maximum length for such strings could work. But
how do we determine all places where it is needed?

Best regards,
Karina Litskevich
Postgres Professional: http://postgrespro.com/

Tom Lane

tgl@sss.pgh.pa.us

almost 2 years ago

In reply to: Karina Litskevich (#3)

Re: Invalid "trailing junk" error message when non-English letters are used

Karina Litskevich <litskevichkarina@gmail.com> writes:

I see the two solutions here: either move the rule for decinteger_junk
below the rules for hexinteger_junk, octinteger_junk and bininteger_junk,
or just use a single rule decinteger_junk for all these cases, since the
error message is the same anyway. I implemented the latter in the second
version of the patch, also renamed this common rule to integer_junk.

That seems reasonable, but IMO this code was unacceptably
undercommented before and what you've done has made it worse.
We really need a comment block associated with the flex macros,
perhaps along the lines of

/*
* An identifier immediately following a numeric literal is disallowed
* because in some cases it's ambiguous what is meant: for example,
* 0x1234 could be either a hexinteger or a decinteger "0" and an
* identifier "x1234". We can detect such problems by seeing if
* integer_junk matches a longer substring than any of the XXXinteger
* patterns. (One "junk" pattern is sufficient because this will match
* all the same strings we'd match with {hexinteger}{identifier} etc.)
* Note that the rule for integer_junk must appear after the ones for
* XXXinteger to make this work correctly.
*/

(Hmm, actually, is that last sentence true? My flex is a bit rusty.)

param_junk really needs a similar comment, or maybe we could put
all the XXX_junk macros together and use one comment for all.

Additionally, I noticed that this patch is going to change error messages
in some cases, though I don't think it's a big deal. Example:
Without patch:
postgres=# select 0xyz;
ERROR: invalid hexadecimal integer at or near "0x"
With patch:
postgres=# select 0xyz;
ERROR: trailing junk after numeric literal at or near "0xyz"

That's sort of annoying, but I don't really see a better way,
or at least not one that's worth the effort.

FWIW output of the whole string in the error message doesnt' look nice to
me, but other places of code do this anyway e.g:
select ('1'||repeat('p',1000000))::integer;
This may be worth fixing.

I think this is nonsense: we are already in the habit of repeating the
whole failing query string in the STATEMENT field. In any case it's
not something for this patch to worry about.

regards, tom lane

Karina Litskevich

litskevichkarina@gmail.com

almost 2 years ago

In reply to: Tom Lane (#4)

Re: Invalid "trailing junk" error message when non-English letters are used

On Thu, Sep 5, 2024 at 1:52 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Karina Litskevich <litskevichkarina@gmail.com> writes:

I see the two solutions here: either move the rule for decinteger_junk
below the rules for hexinteger_junk, octinteger_junk and

bininteger_junk,

or just use a single rule decinteger_junk for all these cases, since the
error message is the same anyway. I implemented the latter in the second
version of the patch, also renamed this common rule to integer_junk.

That seems reasonable, but IMO this code was unacceptably
undercommented before and what you've done has made it worse.
We really need a comment block associated with the flex macros,
perhaps along the lines of

/*
* An identifier immediately following a numeric literal is disallowed
* because in some cases it's ambiguous what is meant: for example,
* 0x1234 could be either a hexinteger or a decinteger "0" and an
* identifier "x1234". We can detect such problems by seeing if
* integer_junk matches a longer substring than any of the XXXinteger
* patterns. (One "junk" pattern is sufficient because this will match
* all the same strings we'd match with {hexinteger}{identifier} etc.)
* Note that the rule for integer_junk must appear after the ones for
* XXXinteger to make this work correctly.
*/

Thank you, this piece of code definitely needs a comment.

(Hmm, actually, is that last sentence true? My flex is a bit rusty.)

Yes, the rule for integer_junk must appear after the ones for XXXinteger.
Here is a quote from
https://ftp.gnu.org/old-gnu/Manuals/flex-2.5.4/html_mono/flex.html
"If it finds more than one match, it takes the one matching the most text
(...). If it finds two or more matches of the same length, the rule listed
first in the flex input file is chosen."
For example, 0x123 is matched by both integer_junk and hexinteger, and we
want the rule for hexinteger to be chosen, so we should place it before
the rule for integer_junk.

param_junk really needs a similar comment, or maybe we could put
all the XXX_junk macros together and use one comment for all.

In v3 of the patch I grouped all the *_junk rules together and included
the suggested comment with a little added something.

Best regards,
Karina Litskevich
Postgres Professional: http://postgrespro.com/

Karina Litskevich

litskevichkarina@gmail.com

almost 2 years ago

In reply to: Karina Litskevich (#5)

Re: Invalid "trailing junk" error message when non-English letters are used

On Thu, Sep 5, 2024 at 6:07 PM Karina Litskevich
<litskevichkarina@gmail.com> wrote:

In v3 of the patch I grouped all the *_junk rules together and included
the suggested comment with a little added something.

Oops, I forgot to attach the patch, here it is.

Best regards,
Karina Litskevich
Postgres Professional: http://postgrespro.com/

Tom Lane

tgl@sss.pgh.pa.us

almost 2 years ago

In reply to: Karina Litskevich (#6)

Re: Invalid "trailing junk" error message when non-English letters are used

Karina Litskevich <litskevichkarina@gmail.com> writes:

On Thu, Sep 5, 2024 at 6:07 PM Karina Litskevich
<litskevichkarina@gmail.com> wrote:

In v3 of the patch I grouped all the *_junk rules together and included
the suggested comment with a little added something.

Oops, I forgot to attach the patch, here it is.

Pushed with a bit of further wordsmithing on the comment.

I left out the proposed new test case "SELECT 1ä;". The trouble
with that is it'd introduce an encoding dependency into the test.
For example, it'd likely fail with some other error message in
a server encoding that lacks an equivalent to UTF8 "ä". While
we have methods for coping with such cases, it requires some
pushups, and I didn't see the value. The changes in existing
test case results are sufficient to show the patch does what
we want.

Also, while the bug exists in v15, the patch didn't apply at all.
I got lazy and just did the minimal s/ident_start/identifier/ change
in that branch, instead of back-patching the cosmetic aspects.

regards, tom lane

Pavel Borisov

pashkin.elfe@gmail.com

almost 2 years ago

In reply to: Tom Lane (#7)

Re: Invalid "trailing junk" error message when non-English letters are used

On Thu, 5 Sept 2024 at 20:49, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Karina Litskevich <litskevichkarina@gmail.com> writes:

On Thu, Sep 5, 2024 at 6:07 PM Karina Litskevich
<litskevichkarina@gmail.com> wrote:

In v3 of the patch I grouped all the *_junk rules together and included
the suggested comment with a little added something.

Oops, I forgot to attach the patch, here it is.

Pushed with a bit of further wordsmithing on the comment.

I left out the proposed new test case "SELECT 1ä;". The trouble
with that is it'd introduce an encoding dependency into the test.
For example, it'd likely fail with some other error message in
a server encoding that lacks an equivalent to UTF8 "ä". While
we have methods for coping with such cases, it requires some
pushups, and I didn't see the value. The changes in existing
test case results are sufficient to show the patch does what
we want.

Also, while the bug exists in v15, the patch didn't apply at all.
I got lazy and just did the minimal s/ident_start/identifier/ change
in that branch, instead of back-patching the cosmetic aspects.

Good! Thank you!
Pavel

Invalid "trailing junk" error message when non-English letters are used

Attachments:

Attachments:

Attachments: