Postgres' lexer

Started by Leonover 26 years ago16 messages
#1Leon
leon@udmnet.ru

Hi!

I'm currently fooling around with Postgres's parser, and I must admit
some things puzzle me completely. Please tell me what these things in
lexer stand for:

{operator}/-[\.0-9] {
yylval.str = pstrdup((char*)yytext);
return Op;
}
Is it an operator followed by mandatory '-' and (dot or digit) ?

And what this stands for:

{identifier}/{space}*-{number}

What's the meaning of all these?

--
Leon.
---------
"This may seem a bit weird, but that's okay, because it is weird." -
Perl manpage.

#2Ansley, Michael
Michael.Ansley@intec.co.za
In reply to: Leon (#1)
RE: [HACKERS] Postgres' lexer

Hi!

Hi, Leon

I'm currently fooling around with Postgres's parser, and I must admit
some things puzzle me completely. Please tell me what these things in
lexer stand for:

{operator}/-[\.0-9] {
yylval.str = pstrdup((char*)yytext);
return Op;
}
Is it an operator followed by mandatory '-' and (dot or digit) ?

I think this is used to recognize an operator followed by a minus or any
single character (the period is escaped, the character can be used to denote
the base of the number) or a single digit.
But check this, I'm not totally sure.

And what this stands for:

{identifier}/{space}*-{number}

An identifier followed by any number of spaces, and then a minus, or a
number. Again, double check this with a reference of some sorts.

What's the meaning of all these?

You really should get a reference that deals with regular expressions. My
understanding is (anybody feel free to comment here) that flex uses normal
regular expressions to generate scanners.

Cheers...

MikeA

#3Ansley, Michael
Michael.Ansley@intec.co.za
In reply to: Ansley, Michael (#2)
RE: [HACKERS] Postgres' lexer

Is it an operator followed by mandatory '-' and (dot or digit) ?

I think this is used to recognize an operator followed by a minus or any
single character (the period is escaped, the character can be used to

denote

the base of the number) or a single digit.

Sorry, make that an operator followed by a minus AND then any single
character or a single digit.

MikeA

#4Leon
leon@udmnet.ru
In reply to: Ansley, Michael (#2)
Re: [HACKERS] Postgres' lexer

Ansley, Michael wrote:
...

And what this stands for:

{identifier}/{space}*-{number}

An identifier followed by any number of spaces, and then a minus, or a
number. Again, double check this with a reference of some sorts.

Well, I studied flex manpage from top to bottom, and almost everything
in Postgres's lexer makes sense. But these "followed by spaces and a
queer minused number" do not. Can someone tell me what do these
minused single - digit numbers stand for?

--
Leon.
---------
"This may seem a bit weird, but that's okay, because it is weird." -
Perl manpage.

#5Ansley, Michael
Michael.Ansley@intec.co.za
In reply to: Leon (#4)
RE: [HACKERS] Postgres' lexer

Leon, I see that you have been running into the vltc problem ;-) I just run
a flex -p, and went to line 314.

MikeA

Show quoted text

-----Original Message-----
From: Ansley, Michael [mailto:Michael.Ansley@intec.co.za]
Sent: Friday, August 20, 1999 11:27 AM
To: 'Leon'; hackers
Subject: RE: [HACKERS] Postgres' lexer

Is it an operator followed by mandatory '-' and (dot

or digit) ?

I think this is used to recognize an operator followed by

a minus or any

single character (the period is escaped, the character

can be used to
denote

the base of the number) or a single digit.

Sorry, make that an operator followed by a minus AND then any single
character or a single digit.

MikeA

************

#6Ansley, Michael
Michael.Ansley@intec.co.za
In reply to: Ansley, Michael (#5)
RE: [HACKERS] Postgres' lexer

I figured it out. The first thing the code does is set the state to xm
(BEGIN (xm)). If you look in the comments at the top, Tom Lane put these in
to deal with numeric strings with embedded minuses. Tom, can you give us a
run-down of what the problem was that required this stuff. Perhaps if we
can find another way around it, we can reduce the vltc's

Thanks...

MikeA

Show quoted text

-----Original Message-----
From: Leon [mailto:leon@udmnet.ru]
Sent: Friday, August 20, 1999 11:36 AM
To: hackers
Subject: Re: [HACKERS] Postgres' lexer

Ansley, Michael wrote:
...

And what this stands for:

{identifier}/{space}*-{number}

An identifier followed by any number of spaces, and then a

minus, or a

number. Again, double check this with a reference of some sorts.

Well, I studied flex manpage from top to bottom, and almost
everything
in Postgres's lexer makes sense. But these "followed by spaces and a
queer minused number" do not. Can someone tell me what do these
minused single - digit numbers stand for?

--
Leon.
---------
"This may seem a bit weird, but that's okay, because it is weird." -
Perl manpage.

************

#7Ansley, Michael
Michael.Ansley@intec.co.za
In reply to: Ansley, Michael (#6)
RE: [HACKERS] Postgres' lexer

Sorry, Tom, I saw the tgl initals, and assumed it was you, before realising
that there are a couple of people who could be identified by those initials.
tgl, please stand up ;-)

MikeA

Show quoted text

-----Original Message-----
From: Ansley, Michael
Sent: Friday, August 20, 1999 12:37 PM
To: 'Leon'; hackers; 'Tom Lane'
Subject: RE: [HACKERS] Postgres' lexer

I figured it out. The first thing the code does is set the
state to xm (BEGIN (xm)). If you look in the comments at
the top, Tom Lane put these in to deal with numeric strings
with embedded minuses. Tom, can you give us a run-down of
what the problem was that required this stuff. Perhaps if
we can find another way around it, we can reduce the vltc's

Thanks...

MikeA

-----Original Message-----
From: Leon [mailto:leon@udmnet.ru]
Sent: Friday, August 20, 1999 11:36 AM
To: hackers
Subject: Re: [HACKERS] Postgres' lexer

Ansley, Michael wrote:
...

And what this stands for:

{identifier}/{space}*-{number}

An identifier followed by any number of spaces, and then a

minus, or a

number. Again, double check this with a reference of

some sorts.

Well, I studied flex manpage from top to bottom, and almost
everything
in Postgres's lexer makes sense. But these "followed by

spaces and a

queer minused number" do not. Can someone tell me what do these
minused single - digit numbers stand for?

--
Leon.
---------
"This may seem a bit weird, but that's okay, because it

is weird." -

Perl manpage.

************

#8Leon
leon@udmnet.ru
In reply to: Ansley, Michael (#5)
Re: [HACKERS] Postgres' lexer

Ansley, Michael wrote:

Leon, I see that you have been running into the vltc problem ;-) I just run
a flex -p, and went to line 314.

I got it. It is done to prevent minus from sticking to number in
expressions like 'a -2'. Dirty, but it works.

--
Leon.
---------
"This may seem a bit weird, but that's okay, because it is weird." -
Perl manpage.

#9Ansley, Michael
Michael.Ansley@intec.co.za
In reply to: Leon (#8)
RE: [HACKERS] Postgres' lexer

Leon, I see that you have been running into the vltc

problem ;-) I just run

a flex -p, and went to line 314.

I got it. It is done to prevent minus from sticking to number in
expressions like 'a -2'. Dirty, but it works.

Dirty, but it also breaks the scanner.

MikeA

#10Ansley, Michael
Michael.Ansley@intec.co.za
In reply to: Ansley, Michael (#9)
RE: [HACKERS] Postgres' lexer

Leon, if you manage to find a replacement for this, please let me know.
I'll probably only pick it up after the weekend.

I think that we need to find another way to tokenise the minus. First of
all, though, how is the parser supposed to tell whether this:
a -2
means this:
(a - 2)
or this:
a (-2)

i.e.: does the unary - operator take precedence over the binary - operator
or not? Is there even a difference. If the parser runs into this: 'a -2',
perhaps we could replace it with 'a + (-2)' instead.

How does a C compiler tokenize this? Or some other standard SQL parser?

MikeA

Show quoted text

-----Original Message-----
From: Leon [mailto:leon@udmnet.ru]
Sent: Friday, August 20, 1999 3:34 PM
To: hackers
Subject: Re: [HACKERS] Postgres' lexer

Ansley, Michael wrote:

Leon, I see that you have been running into the vltc

problem ;-) I just run

a flex -p, and went to line 314.

I got it. It is done to prevent minus from sticking to number in
expressions like 'a -2'. Dirty, but it works.

--
Leon.
---------
"This may seem a bit weird, but that's okay, because it is weird." -
Perl manpage.

************

#11Brook Milligan
brook@biology.nmsu.edu
In reply to: Ansley, Michael (#10)
Re: [HACKERS] Postgres' lexer

I think that we need to find another way to tokenise the minus. First of
all, though, how is the parser supposed to tell whether this:
a -2
means this:
(a - 2)
or this:
a (-2)

i.e.: does the unary - operator take precedence over the binary - operator
or not? Is there even a difference. If the parser runs into this: 'a -2',
perhaps we could replace it with 'a + (-2)' instead.

How does a C compiler tokenize this? Or some other standard SQL parser?

For the C compiler a -2 can only mean (a - 2); a (-2) must explicitly
be a function call and isn't generated by the compiler from a -2.

I think the question for SQL is, does the language allow an ambiguity
here? If not, wouldn't it be much smarter to keep the minus sign as
its own token and deal with the semantics in the parser?

Cheers,
Brook

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Brook Milligan (#11)
Re: [HACKERS] Postgres' lexer

"Ansley, Michael" <Michael.Ansley@intec.co.za> writes:

Sorry, Tom, I saw the tgl initals, and assumed it was you, before realising
that there are a couple of people who could be identified by those initials.

All of those are Lockhart. I recall having done something with the
string-constant lexing, but I have no idea what this <xm> is all about.

regards, tom lane

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#12)
Re: [HACKERS] Postgres' lexer

Brook Milligan <brook@biology.nmsu.edu> writes:

I think the question for SQL is, does the language allow an ambiguity
here? If not, wouldn't it be much smarter to keep the minus sign as
its own token and deal with the semantics in the parser?

I don't see a good reason to tokenize the '-' as part of the number
either. I think that someone may have hacked the lexer to try to
merge unary minus into numeric constants, so that in an expression like
WHERE field < -2
the -2 would be treated as a constant rather than an expression
involving application of unary minus --- which is important because
the optimizer is too dumb to optimize the query if it looks like an
expression.

However, trying to make that happen at lex time is just silly.
The lexer doesn't have enough context to handle all the cases
anyway. We've currently got code in the grammar to do the same
reduction. (IMHO that's still too early, and it ought to be done
post-rewriter as part of a general-purpose constant expression
reducer; will get around to that someday ;-).)

So it seems to me that we should just rip *all* this cruft out of the
lexer, and always return '-' as a separate token, never as part of
a number. (*) Then we wouldn't need this lookahead feature.

But it'd be good to get an opinion from the other tgl first ;-).
I'm just a kibitzer when it comes to the lex/yacc stuff.

regards, tom lane

(*) not counting float exponents, eg "1.234e-56" of course.

#14Leon
leon@udmnet.ru
In reply to: Ansley, Michael (#10)
1 attachment(s)
Re: [HACKERS] Postgres' lexer

Ansley, Michael wrote:

Leon, if you manage to find a replacement for this, please let me know.
I'll probably only pick it up after the weekend.

I think that we need to find another way to tokenise the minus. First of
all, though, how is the parser supposed to tell whether this:
a -2
means this:
(a - 2)
or this:
a (-2)

I think that the current behavior is ok - it is what we would expect
from expressions like 'a -2'.

I have produced a patch to cleanup the code. It works due to the
fact that unary minus gets processed in doNegate() in parser anyway,
and it is by no way lexer's job to do grammatical parsing - i.e.
deciding if operator is to be treated as binary or unary.

I ran regression tests, everything seems to be ok. It is my first
diff/patch experience in *NIX, so take it with mercy :) But it
seems to be correct. It is to be applied against 6.5.0 (I have
not upgraded to 6.5.1 yet, but hope lexer hasn't changed since
then.) The patch mainly contains nuked code. The only thing added
is my short comment :)

Have I done some right thing? :)

--
Leon.
---------
"This may seem a bit weird, but that's okay, because it is weird." -
Perl manpage.

Attachments:

patchapplication/octet-stream; name=patchDownload
--- scan.ll	Fri Aug 20 21:18:26 1999
+++ scan.l	Fri Aug 20 23:04:53 1999
@@ -84,7 +84,6 @@
  *  <xc> extended C-style comments - tgl 1997-07-12
  *  <xd> delimited identifiers (double-quoted identifiers) - tgl 1997-10-27
  *  <xh> hexadecimal numeric string - thomas 1997-11-16
- *  <xm> numeric strings with embedded minus sign - tgl 1997-09-05
  *  <xq> quoted strings - tgl 1997-07-30
  *
  * The "extended comment" syntax closely resembles allowable operator syntax.
@@ -98,7 +97,6 @@
 %x xc
 %x xd
 %x xh
-%x xm
 %x xq
 
 /* Binary number
@@ -147,7 +145,6 @@
 xcstar			[^/]
 
 digit			[0-9]
-number			[-+.0-9Ee]
 letter			[\200-\377_A-Za-z]
 letter_or_digit	[\200-\377_A-Za-z0-9]
 
@@ -159,13 +156,16 @@
 op_and_self		[\~\!\@\#\^\&\|\`\?\$\:\+\-\*\/\%\<\>\=]
 operator		{op_and_self}+
 
-xmstop			-
+/* we do not allow unary minus in numbers. 
+ * instead we pass it verbatim to parser. there it gets
+ * coerced via doNegate() -- Leon aug 20 1999 
+ */
 
-integer			[\-]?{digit}+
-decimal			[\-]?(({digit}*\.{digit}+)|({digit}+\.{digit}*))
-real			[\-]?((({digit}*\.{digit}+)|({digit}+\.{digit}*)|({digit}+))([Ee][-+]?{digit}+))
+integer			{digit}+
+decimal			(({digit}*\.{digit}+)|({digit}+\.{digit}*))
+real				((({digit}*\.{digit}+)|({digit}+\.{digit}*)|({digit}+))([Ee][-+]?{digit}+))
 /*
-real			[\-]?(((({digit}*\.{digit}+)|({digit}+\.{digit}*))([Ee][-+]?{digit}+)?)|({digit}+[Ee][-+]?{digit}+))
+real				(((({digit}*\.{digit}+)|({digit}+\.{digit}*))([Ee][-+]?{digit}+)?)|({digit}+[Ee][-+]?{digit}+))
 */
 
 param			\${integer}
@@ -281,26 +281,10 @@
 					llen += yyleng;
 				}
 
-
-<xm>{space}*	{ /* ignore */ }
-<xm>{xmstop}	{
-					BEGIN(INITIAL);
-					return yytext[0];
-				}
-
-
 {typecast}		{ return TYPECAST; }
 
-{self}/{space}*-[\.0-9]	{
-					BEGIN(xm);
-					return yytext[0];
-				}
-{self}			{ 	return yytext[0]; }
 {self}			{ 	return yytext[0]; }
-{operator}/-[\.0-9]	{
-					yylval.str = pstrdup((char*)yytext);
-					return Op;
-				}
+
 {operator}		{
 					if (strcmp((char*)yytext,"!=") == 0)
 						yylval.str = pstrdup("<>"); /* compatability */
@@ -314,76 +298,6 @@
 				}
 
 
-{identifier}/{space}*-{number}	{
-					int i;
-					ScanKeyword		*keyword;
-					BEGIN(xm);
-					for(i = 0; yytext[i]; i++)
-						if (isascii((unsigned char)yytext[i]) &&
-							isupper(yytext[i]))
-							yytext[i] = tolower(yytext[i]);
-					if (i >= NAMEDATALEN)
-						yytext[NAMEDATALEN-1] = '\0';
-
-					keyword = ScanKeywordLookup((char*)yytext);
-					if (keyword != NULL) {
-						return keyword->value;
-					}
-					else
-					{
-						yylval.str = pstrdup((char*)yytext);
-						return IDENT;
-					}
-				}
-{integer}/{space}*-{number}	{
-					char* endptr;
-
-					BEGIN(xm);
-					errno = 0;
-					yylval.ival = strtol((char *)yytext,&endptr,10);
-					if (*endptr != '\0' || errno == ERANGE)
-					{
-						errno = 0;
-#if 0
-						yylval.dval = strtod(((char *)yytext),&endptr);
-						if (*endptr != '\0' || errno == ERANGE)
-							elog(ERROR,"Bad integer input '%s'",yytext);
-						CheckFloat8Val(yylval.dval);
-						elog(NOTICE,"Integer input '%s' is out of range; promoted to float", yytext);
-						return FCONST;
-#endif
-						yylval.str = pstrdup((char*)yytext);
-						return SCONST;
-					}
-					return ICONST;
-				}
-{decimal}/{space}*-{number} {
-					char* endptr;
-
-					BEGIN(xm);
-					if (strlen((char *)yytext) <= 17)
-					{
-						errno = 0;
-						yylval.dval = strtod(((char *)yytext),&endptr);
-						if (*endptr != '\0' || errno == ERANGE)
-							elog(ERROR,"Bad float8 input '%s'",yytext);
-						CheckFloat8Val(yylval.dval);
-						return FCONST;
-					}
-					yylval.str = pstrdup((char*)yytext);
-					return SCONST;
-				}
-{real}/{space}*-{number} {
-					char* endptr;
-
-					BEGIN(xm);
-					errno = 0;
-					yylval.dval = strtod(((char *)yytext),&endptr);
-					if (*endptr != '\0' || errno == ERANGE)
-						elog(ERROR,"Bad float8 input '%s'",yytext);
-					CheckFloat8Val(yylval.dval);
-					return FCONST;
-				}
 {integer}		{
 					char* endptr;
 
#15Leon
leon@udmnet.ru
In reply to: Tom Lane (#12)
Re: [HACKERS] Postgres' lexer

Tom Lane wrote:

"Ansley, Michael" <Michael.Ansley@intec.co.za> writes:

Sorry, Tom, I saw the tgl initals, and assumed it was you, before realising
that there are a couple of people who could be identified by those initials.

All of those are Lockhart. I recall having done something with the
string-constant lexing, but I have no idea what this <xm> is all about.

BTW, one more stu-u-u-upid question: why unary minus needs high
precedence? Seems that all works well without any specified
precedence for uminus ;) - it is only a remark.

--
Leon.
---------
"This may seem a bit weird, but that's okay, because it is weird." -
Perl manpage.

#16Ansley, Michael
Michael.Ansley@intec.co.za
In reply to: Leon (#15)
RE: [HACKERS] Postgres' lexer

I've just grabbed it now, I'll get back to you Monday.

Show quoted text

-----Original Message-----
From: Leon [mailto:leon@udmnet.ru]
Sent: Friday, August 20, 1999 8:28 PM
To: Ansley, Michael
Cc: hackers
Subject: Re: [HACKERS] Postgres' lexer

Ansley, Michael wrote:

Leon, if you manage to find a replacement for this, please

let me know.

I'll probably only pick it up after the weekend.

I think that we need to find another way to tokenise the

minus. First of

all, though, how is the parser supposed to tell whether this:
a -2
means this:
(a - 2)
or this:
a (-2)

I think that the current behavior is ok - it is what we would expect
from expressions like 'a -2'.

I have produced a patch to cleanup the code. It works due to the
fact that unary minus gets processed in doNegate() in parser anyway,
and it is by no way lexer's job to do grammatical parsing - i.e.
deciding if operator is to be treated as binary or unary.

I ran regression tests, everything seems to be ok. It is my first
diff/patch experience in *NIX, so take it with mercy :) But it
seems to be correct. It is to be applied against 6.5.0 (I have
not upgraded to 6.5.1 yet, but hope lexer hasn't changed since
then.) The patch mainly contains nuked code. The only thing added
is my short comment :)

Have I done some right thing? :)

--
Leon.
---------
"This may seem a bit weird, but that's okay, because it is weird." -
Perl manpage.