Lex and things...

Started by Ansley, Michaelover 26 years ago12 messages
#1Ansley, Michael
Michael.Ansley@intec.co.za

Hi,

Shot, Leon. The patch removes the #define YY_USES_REJECT from scan.c, which
means we now have expandable tokens. Of course, it also removes the
scanning of "embedded minuses", which apparently causes the optimizer to
unoptimize a little. However, the next step is attacking the limit on the
size of string literals. These seemed to be wired to YY_BUF_SIZE, or
something. Is there any reason for this?

MikeA

#2Leon
leon@udmnet.ru
In reply to: Ansley, Michael (#1)
Re: [HACKERS] Lex and things...

Ansley, Michael wrote:

Hi,

Shot, Leon. The patch removes the #define YY_USES_REJECT from scan.c, which
means we now have expandable tokens. Of course, it also removes the
scanning of "embedded minuses", which apparently causes the optimizer to
unoptimize a little.

Oh, no. Unary minus gets to grammar parser and there is recognized as
such. Then for numeric constants it becomes an *embedded* minus in
function doNegate. So unary minus after parser in numeric constants
is embedded minus, as it was earlier before patch. In other words,
I can see no change in representation of grammar after patching.

However, the next step is attacking the limit on the
size of string literals. These seemed to be wired to YY_BUF_SIZE, or
something. Is there any reason for this?

Hmm. There is something going on to remove fixed length limits
entirely, maybe someone is already doing something to lexer in
that respect? If no, I could look at what can be done there.

--
Leon.

#3Ansley, Michael
Michael.Ansley@intec.co.za
In reply to: Leon (#2)
RE: [HACKERS] Lex and things...

Shot, Leon. The patch removes the #define YY_USES_REJECT from scan.c,

which

means we now have expandable tokens. Of course, it also removes the
scanning of "embedded minuses", which apparently causes the optimizer

to

unoptimize a little.

Oh, no. Unary minus gets to grammar parser and there is recognized as
such. Then for numeric constants it becomes an *embedded* minus in
function doNegate. So unary minus after parser in numeric constants
is embedded minus, as it was earlier before patch. In other words,
I can see no change in representation of grammar after patching.

Great.

However, the next step is attacking the limit on the
size of string literals. These seemed to be wired to YY_BUF_SIZE, or
something. Is there any reason for this?

Hmm. There is something going on to remove fixed length limits
entirely, maybe someone is already doing something to lexer in
that respect? If no, I could look at what can be done there.

Yes, me. I've removed the query string limit from psql, libpq, and as much
of the backend as I can see. I have done some (very) preliminary testing,
and managed to get a 95kB query to execute. However, the two remaining
problems that I have run into so far are token size (which you have just
removed, many thanks ;-), and string literals, which are limited, it seems
to YY_BUF_SIZE (I think).

You see, if I can get the query string limited removed, perhaps someone who
knows a bit more than I do will do something like, hmmm, say, remove the
block size limit from tuple size... hint, hint... anybody...

MikeA

Show quoted text

--
Leon.

#4Ansley, Michael
Michael.Ansley@intec.co.za
In reply to: Ansley, Michael (#3)
RE: [HACKERS] Lex and things...

Sorry, I forgot to mention in the previous mail: I sent patches to the
patches mailing list (available from the web server), which patch psql, and
libpq, and scan.l (except for you patch). The were sent at the beginning of
this month, so maybe get them, and see how they work for you.

Show quoted text

-----Original Message-----
From: Ansley, Michael [mailto:Michael.Ansley@intec.co.za]
Sent: Tuesday, August 24, 1999 12:57 PM
To: 'Leon'; 'pgsql-hackers@postgresql.org'
Subject: RE: [HACKERS] Lex and things...

Shot, Leon. The patch removes the #define

YY_USES_REJECT from scan.c,
which

means we now have expandable tokens. Of course, it

also removes the

scanning of "embedded minuses", which apparently causes

the optimizer
to

unoptimize a little.

Oh, no. Unary minus gets to grammar parser and there is

recognized as

such. Then for numeric constants it becomes an *embedded* minus in
function doNegate. So unary minus after parser in numeric

constants

is embedded minus, as it was earlier before patch. In other words,
I can see no change in representation of grammar after patching.

Great.

However, the next step is attacking the limit on the
size of string literals. These seemed to be wired to

YY_BUF_SIZE, or

something. Is there any reason for this?

Hmm. There is something going on to remove fixed length limits
entirely, maybe someone is already doing something to lexer in
that respect? If no, I could look at what can be done there.

Yes, me. I've removed the query string limit from psql,
libpq, and as much
of the backend as I can see. I have done some (very)
preliminary testing,
and managed to get a 95kB query to execute. However, the
two remaining
problems that I have run into so far are token size (which
you have just
removed, many thanks ;-), and string literals, which are
limited, it seems
to YY_BUF_SIZE (I think).

You see, if I can get the query string limited removed,
perhaps someone who
knows a bit more than I do will do something like, hmmm,
say, remove the
block size limit from tuple size... hint, hint... anybody...

MikeA

--
Leon.

************

#5Leon
leon@udmnet.ru
In reply to: Ansley, Michael (#4)
Re: [HACKERS] Lex and things...

Ansley, Michael wrote:

Sorry, I forgot to mention in the previous mail: I sent patches to the
patches mailing list (available from the web server), which patch psql, and
libpq, and scan.l (except for you patch). The were sent at the beginning of
this month, so maybe get them, and see how they work for you.

Hmm. This is beta - testing? I'm afraid there isn't much resources
with me for it (time, experience etc.). What can I do now is make
le-e-etlle changes (improvements, I hope) to the code :)
--
Leon.

#6Leon
leon@udmnet.ru
In reply to: Ansley, Michael (#3)
Re: [HACKERS] Lex and things...

Ansley, Michael wrote:

Hmm. There is something going on to remove fixed length limits
entirely, maybe someone is already doing something to lexer in
that respect? If no, I could look at what can be done there.

Yes, me. I've removed the query string limit from psql, libpq, and as much
of the backend as I can see. I have done some (very) preliminary testing,
and managed to get a 95kB query to execute. However, the two remaining
problems that I have run into so far are token size (which you have just
removed, many thanks ;-),

I'm afraid not. There is arbitrary limit (named NAMEDATALEN) in lexer.
If identifier exeeds it, it gets '\0' at that limit, so truncated
effectively. Strings are also limited by MAX_PARSE_BUFFER which is
finally something like QUERY_BUF_SIZE = 8k*2.

Seems that string literals are the primary target, because it is
real-life constraint here now. This is not the case with supposed
huge identifiers. Should I work on it, or will you do it yourself?

and string literals, which are limited, it seems
to YY_BUF_SIZE (I think).

--
Leon.

#7Ansley, Michael
Michael.Ansley@intec.co.za
In reply to: Leon (#6)
RE: [HACKERS] Lex and things...

As far as I understand it, the MAX_PARSE_BUFFER limit only applies if char
parsestring[] is used, not if char *parsestring is used. This is the whole
reason for using flex. And scan.l is set up to compile using char
*parsestring, not char parsestring[].

The NAMEDATALEN limit is imposed by the db structure, and is the limit of an
identifier. Because this is not actual data, I'm not too concerned with
this at the moment. As long as we can get pretty much unlimited data into
the tuples, I don't care what I have to call my tables, views, procedures,
etc.

Ansley, Michael wrote:

Hmm. There is something going on to remove fixed length limits
entirely, maybe someone is already doing something to lexer in
that respect? If no, I could look at what can be done there.

Yes, me. I've removed the query string limit from psql, libpq, and as

much

of the backend as I can see. I have done some (very) preliminary

testing,

and managed to get a 95kB query to execute. However, the two remaining
problems that I have run into so far are token size (which you have

just

Show quoted text

removed, many thanks ;-),

I'm afraid not. There is arbitrary limit (named NAMEDATALEN)
in lexer.
If identifier exeeds it, it gets '\0' at that limit, so truncated
effectively. Strings are also limited by MAX_PARSE_BUFFER which is
finally something like QUERY_BUF_SIZE = 8k*2.

Seems that string literals are the primary target, because it is
real-life constraint here now. This is not the case with supposed
huge identifiers. Should I work on it, or will you do it yourself?

and string literals, which are limited, it seems
to YY_BUF_SIZE (I think).

--
Leon.

#8Adriaan Joubert
a.joubert@albourne.com
In reply to: Ansley, Michael (#3)
Re: [HACKERS] Lex and things...

I'm afraid not. There is arbitrary limit (named NAMEDATALEN) in lexer.
If identifier exeeds it, it gets '\0' at that limit, so truncated
effectively. Strings are also limited by MAX_PARSE_BUFFER which is
finally something like QUERY_BUF_SIZE = 8k*2.

I think NAMEDATALEN referes to the size of a NAME field in the database,
which is used to store attribute names etc. So you cannot exceed
NAMEDATALEN, or the identifier won't fit into the system tables.

Adriaan

#9Leon
leon@udmnet.ru
In reply to: Ansley, Michael (#7)
Re: [HACKERS] Lex and things...

Ansley, Michael wrote:

As far as I understand it, the MAX_PARSE_BUFFER limit only applies if char
parsestring[] is used, not if char *parsestring is used. This is the whole
reason for using flex. And scan.l is set up to compile using char
*parsestring, not char parsestring[].

What is defined explicitly:

#ifdef YY_READ_BUF_SIZE
#undef YY_READ_BUF_SIZE
#endif
#define YY_READ_BUF_SIZE MAX_PARSE_BUFFER

(these strings are repeated twice :)

...
char literal[MAX_PARSE_BUFFER];

...
<xq>{xqliteral} {
if ((llen+yyleng) > (MAX_PARSE_BUFFER - 1))
elog(ERROR,"quoted string parse buffer of %d chars
exceeded",MAX_PARSE_BUFFER);
memcpy(literal+llen, yytext, yyleng+1);
llen += yyleng;
}

Seems that limits are everywhere ;)

--
Leon.

#10Leon
leon@udmnet.ru
In reply to: Ansley, Michael (#3)
Re: [HACKERS] Lex and things...

Adriaan Joubert wrote:

I think NAMEDATALEN referes to the size of a NAME field in the database,
which is used to store attribute names etc. So you cannot exceed
NAMEDATALEN, or the identifier won't fit into the system tables.

Ok. Let's leave identifiers alone.

--
Leon.

#11Ansley, Michael
Michael.Ansley@intec.co.za
In reply to: Leon (#10)
RE: [HACKERS] Lex and things...

Yes, I'll go with that.

Show quoted text

Adriaan Joubert wrote:

I think NAMEDATALEN referes to the size of a NAME field in

the database,

which is used to store attribute names etc. So you cannot exceed
NAMEDATALEN, or the identifier won't fit into the system tables.

Ok. Let's leave identifiers alone.

--
Leon.

************

#12Ansley, Michael
Michael.Ansley@intec.co.za
In reply to: Ansley, Michael (#11)
RE: [HACKERS] Lex and things...

Ansley, Michael wrote:

As far as I understand it, the MAX_PARSE_BUFFER limit only applies if

char

parsestring[] is used, not if char *parsestring is used. This is the

whole

reason for using flex. And scan.l is set up to compile using char
*parsestring, not char parsestring[].

What is defined explicitly:

#ifdef YY_READ_BUF_SIZE
#undef YY_READ_BUF_SIZE
#endif
#define YY_READ_BUF_SIZE MAX_PARSE_BUFFER

(these strings are repeated twice :)

I noticed that, but hey, who am I to argue.

...
char literal[MAX_PARSE_BUFFER];

...
<xq>{xqliteral} {
if ((llen+yyleng) >
(MAX_PARSE_BUFFER - 1))

elog(ERROR,"quoted string parse buffer of %d chars
exceeded",MAX_PARSE_BUFFER);
memcpy(literal+llen,
yytext, yyleng+1);
llen += yyleng;
}

Seems that limits are everywhere ;)

--
Leon.

I think we can turn literal into a char *, if we change the code for
<xq>{xqliteral}. This doesn't look like it will be too much of a mission,
but the outer limit is going to be close to the block size, because tuples
can't expand past the end of a block. I think that it would be wise to
leave this limit in place until such time as the tuple size limit is fixed.
Then we can remove it.

So, for the moment, I think we can consider the job pretty much done, apart
from bug-fixes. We can revisit the MAX_PARSE_BUFFER limit when tuple size
is delinked from block size. My aim with this work was to remove the
general limit on the length of a query string, and that has basically been
achieved. We have, as a result of the work, come across other limits, but
those have dependencies, and will have to wait.

Cheers...

MikeA