A really subtle lexer bug

Started by Andrew Gierthalmost 8 years ago8 messageshackers

andrew@tao11.riddles.org.uk

almost 8 years ago

Currently in PG, the precedence of = and <> is supposed to be equal, and
the precedence of unary - is very high.

So, (true=1 between 1 and 1) parses as ((true)=(1 between 1 and 1)),
and (true=-1 between 1 and 1) parses as ((true)=((-1) between 1 and 1)).
All good so far.

(true<>-1 between 1 and 1) parses as ((true<>(-1)) between 1 and 1). ???

The fault here is in the lexer. The input "<>-" is being lexed as an Op
followed by '-' rather than as NOT_EQUAL followed by '-' because it
looks like a match for a multi-character operator, with the - being
thrown back into the input afterwards. So the precedence of <> gets
inflated to that of Op, which is higher than BETWEEN.

More seriously, this also breaks named arguments:

create function f(a integer) returns integer language sql
as $$ select a; $$;

select f(a => 1); -- works
select f(a => -1); -- works
select f(a =>-1); -- ERROR: column "a" does not exist

I guess the fix is to extend the existing special case code that checks
for one character left after removing trailing [+-] and also check for
the two-character ops "<>" ">=" "<=" "=>" "!=".

--
Andrew (irc:RhodiumToad)

Andrew Gierth

andrew@tao11.riddles.org.uk

almost 8 years ago

In reply to: Andrew Gierth (#1)

Re: A really subtle lexer bug

"Andrew" == Andrew Gierth <andrew@tao11.riddles.org.uk> writes:

Andrew> select f(a =>-1); -- ERROR: column "a" does not exist

Andrew> I guess the fix is to extend the existing special case code
Andrew> that checks for one character left after removing trailing [+-]
Andrew> and also check for the two-character ops "<>" ">=" "<=" "=>"
Andrew> "!=".

Patch attached.

This fixes two bugs: first the mis-lexing of two-char ops as mentioned
originally; second, the O(N^3) lexing time of strings of - or +
characters is reduced to O(N^2) (in practice it's better than O(N^2)
once N gets large because the bison stack gets blown out, ending the
loop early).

--
Andrew (irc:RhodiumToad)

Andrew Gierth

andrew@tao11.riddles.org.uk

almost 8 years ago

In reply to: Andrew Gierth (#2)

Re: A really subtle lexer bug

"Andrew" == Andrew Gierth <andrew@tao11.riddles.org.uk> writes:

Andrew> Patch attached.

Andrew> This fixes two bugs:

I'm going to split this into two patches, since the O(N^3) fix can be
backpatched further than the operator token fix; the latter bug was
introduced in 9.5 when operator precedences were corrected, while the
former has been there forever.

(But don't wait for me to post those before commenting on either issue)

--
Andrew (irc:RhodiumToad)

Tom Lane

tgl@sss.pgh.pa.us

almost 8 years ago

In reply to: Andrew Gierth (#2)

Re: A really subtle lexer bug

Andrew Gierth <andrew@tao11.riddles.org.uk> writes:

"Andrew" == Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
Andrew> I guess the fix is to extend the existing special case code
Andrew> that checks for one character left after removing trailing [+-]
Andrew> and also check for the two-character ops "<>" ">=" "<=" "=>"
Andrew> "!=".

Patch attached.
This fixes two bugs: first the mis-lexing of two-char ops as mentioned
originally; second, the O(N^3) lexing time of strings of - or +
characters is reduced to O(N^2) (in practice it's better than O(N^2)
once N gets large because the bison stack gets blown out, ending the
loop early).

Looks reasonable offhand (didn't test). A couple of thoughts:

* Some regression tests exercising these code paths might be a good thing.

* There should likely be a comment near where EQUALS_GREATER and
friends are defined, pointing out that if we add any more multi-character
operators with special precedences, this code has to be taught about them.

regards, tom lane

Andrew Gierth

andrew@tao11.riddles.org.uk

almost 8 years ago

In reply to: Tom Lane (#4)

Re: A really subtle lexer bug

"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:

Patch attached.
This fixes two bugs: first the mis-lexing of two-char ops as mentioned
originally; second, the O(N^3) lexing time of strings of - or +
characters is reduced to O(N^2) (in practice it's better than O(N^2)
once N gets large because the bison stack gets blown out, ending the
loop early).

Tom> Looks reasonable offhand (didn't test). A couple of thoughts:

Tom> * Some regression tests exercising these code paths might be a
Tom> good thing.

Agreed. Any preferences where they should go?

Tom> * There should likely be a comment near where EQUALS_GREATER and
Tom> friends are defined, pointing out that if we add any more
Tom> multi-character operators with special precedences, this code has
Tom> to be taught about them.

Agreed; will do this.

--
Andrew (irc:RhodiumToad)

Tom Lane

tgl@sss.pgh.pa.us

almost 8 years ago

In reply to: Andrew Gierth (#5)

Re: A really subtle lexer bug

Andrew Gierth <andrew@tao11.riddles.org.uk> writes:

"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
Tom> * Some regression tests exercising these code paths might be a
Tom> good thing.

Agreed. Any preferences where they should go?

There's not really a "lexer/grammar" test script. Maybe you could
drop it into create_operator.sql, especially if you need to make any
funny operators to test with?

regards, tom lane

Andrew Gierth

andrew@tao11.riddles.org.uk

almost 8 years ago

In reply to: Tom Lane (#6)

Re: A really subtle lexer bug

Here's what I will push unless there's something important I missed.

I split the regression tests between create_operator.sql and
polymorphism.sql, because => is really "named argument syntax" rather
than an operator as such, and polymorphism.sql is where the existing
tests were for that.

I tried to keep the three lexers as closely matched as possible, even
though psqlscan.l doesn't actually need some of the changes.

--
Andrew (irc:RhodiumToad)

Tom Lane

tgl@sss.pgh.pa.us

almost 8 years ago

In reply to: Andrew Gierth (#7)

Re: A really subtle lexer bug

Andrew Gierth <andrew@tao11.riddles.org.uk> writes:

Here's what I will push unless there's something important I missed.

Stylistic nitpick: I don't especially like "continue" as the body of
a for-loop. How about instead of this:

for (nchars--;
nchars > 1 &&
(yytext[nchars - 1] == '+' ||
yytext[nchars - 1] == '-');
nchars--)
continue;

do this:

do {
nchars--;
} while (nchars > 1 &&
(yytext[nchars - 1] == '+' ||
yytext[nchars - 1] == '-'));

That's a clearer separation between loop action and loop test, and
it makes it more obvious that you're relying on the loop condition
to be true at the start.

Also, I'm not entirely convinced that replacing the strchr() with
a handmade equivalent is a good idea. Some versions of strchr()
are pretty quick.

No objections beyond that.

regards, tom lane

A really subtle lexer bug

Attachments:

Attachments: