Replacing plpgsql's lexer
Whichever way the current discussion about Unicode literals turns out,
it's clear that plpgsql is not up to speed on matching the core lexer's
behavior --- it's wrong anyway with respect to
standard_conforming_strings.
I had earlier speculated semi-facetiously about ripping out the plpgsql
lexer altogether, but the more I think about it the less silly the idea
looks. Suppose that we change the core lexer so that the keyword lookup
table it's supposed to use is passed to scanner_init() rather than being
hard-wired in. Then make plpgsql call the core lexer using its own
keyword table. Everything else would match core lexical behavior
automatically. The special behavior that we do want, such as being
able to construct a string representing a desired subrange of the input,
could all be handled in plpgsql-specific wrapper code.
I've just spent a few minutes looking for trouble spots in this theory,
and so far the only real ugliness I can see is that plpgsql treats
":=" and ".." as single tokens whereas the core would parse them as two
tokens. We could hack the core lexer to have an additional switch that
controls that. Or maybe just make it always return them as single
tokens --- AFAICS, neither combination is legal in core SQL anyway,
so this would only result in a small change in the exact syntax error
you get if you write such a thing in core SQL.
Another trouble spot is the #option syntax, but that could be handled
by a special-purpose prescan, or just dropped altogether; it's not like
we've ever used that for anything but debugging.
It looks like this might take about a day's worth of work (IOW two
or three days real time) to get done.
Normally I'd only consider doing such a thing during development phase,
but since we're staring at at least one and maybe two bugs that are
going to be hard to fix in any materially-less-intrusive way, I'm
thinking about doing it now. Theoretically this change shouldn't break
any working code, so letting it hit the streets in 8.4beta2 doesn't seem
totally unreasonable.
Comments, objections, better ideas?
regards, tom lane
On Tue, Apr 14, 2009 at 4:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Whichever way the current discussion about Unicode literals turns out,
it's clear that plpgsql is not up to speed on matching the core lexer's
behavior --- it's wrong anyway with respect to
standard_conforming_strings.I had earlier speculated semi-facetiously about ripping out the plpgsql
lexer altogether, but the more I think about it the less silly the idea
looks. Suppose that we change the core lexer so that the keyword lookup
table it's supposed to use is passed to scanner_init() rather than being
hard-wired in. Then make plpgsql call the core lexer using its own
keyword table. Everything else would match core lexical behavior
automatically. The special behavior that we do want, such as being
able to construct a string representing a desired subrange of the input,
could all be handled in plpgsql-specific wrapper code.I've just spent a few minutes looking for trouble spots in this theory,
and so far the only real ugliness I can see is that plpgsql treats
":=" and ".." as single tokens whereas the core would parse them as two
tokens. We could hack the core lexer to have an additional switch that
controls that. Or maybe just make it always return them as single
tokens --- AFAICS, neither combination is legal in core SQL anyway,
so this would only result in a small change in the exact syntax error
you get if you write such a thing in core SQL.Another trouble spot is the #option syntax, but that could be handled
by a special-purpose prescan, or just dropped altogether; it's not like
we've ever used that for anything but debugging.It looks like this might take about a day's worth of work (IOW two
or three days real time) to get done.Normally I'd only consider doing such a thing during development phase,
but since we're staring at at least one and maybe two bugs that are
going to be hard to fix in any materially-less-intrusive way, I'm
thinking about doing it now. Theoretically this change shouldn't break
any working code, so letting it hit the streets in 8.4beta2 doesn't seem
totally unreasonable.Comments, objections, better ideas?
All this sounds good. As for how to handle := and .., I think making
them lex the same way in PL/pgsql and core SQL would be a good thing.
...Robert
On Tue, 2009-04-14 at 16:37 -0400, Tom Lane wrote:
Comments, objections, better ideas?
Please, if you do this, make it optional.
Potentially changing the behaviour of thousands of functions just to fix
a rare bug will not endear us to our users. The bug may be something
that people are relying on in some subtle way, ugly as that sounds.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
Robert Haas wrote:
All this sounds good. As for how to handle := and .., I think making
them lex the same way in PL/pgsql and core SQL would be a good thing.
They don't have any significance in core SQL. What would we do with the
lexeme?
ISTR we've used some hacks in the past to split lexemes into pieces, and
presumably we'd have to do something similar with these.
The only thing that makes me nervous about this is that we're very close
to Beta. OTOH, this is one area the regression suite should give a
fairly good workout to.
cheers
andrew
Andrew Dunstan <andrew@dunslane.net> writes:
Robert Haas wrote:
All this sounds good. As for how to handle := and .., I think making
them lex the same way in PL/pgsql and core SQL would be a good thing.
They don't have any significance in core SQL. What would we do with the
lexeme?
It would just fail --- the core grammar will have no production that can
accept it. Right offhand I think the only difference is that instead of
regression=# select a .. 2;
ERROR: syntax error at or near "."
LINE 1: select a .. 2;
^
you'd see
regression=# select a .. 2;
ERROR: syntax error at or near ".."
LINE 1: select a .. 2;
^
ie it acts like one token not two in the error message.
This solution would become problematic if the core grammar ever had a
meaning for := or .. that required treating them as two tokens (eg,
the grammar allowed this sequence with whitespace between). I don't
think that's very likely though; and if it did happen we could fix it
with the aforementioned control switch.
The only thing that makes me nervous about this is that we're very close
to Beta. OTOH, this is one area the regression suite should give a
fairly good workout to.
Yeah, I'd rather have done it before beta1, but too late. The other
solution still entails massive changes to the plpgsql lexer, so it
doesn't really look like much lower risk. AFAICS the practical
alternatives are a reimplementation in beta2, or no fix until 8.5.
regards, tom lane
Simon Riggs <simon@2ndQuadrant.com> writes:
On Tue, 2009-04-14 at 16:37 -0400, Tom Lane wrote:
Comments, objections, better ideas?
Please, if you do this, make it optional.
I don't think making the plpgsql lexer pluggable is realistic.
Potentially changing the behaviour of thousands of functions just to fix
a rare bug will not endear us to our users. The bug may be something
that people are relying on in some subtle way, ugly as that sounds.
That's why I don't want to change it in a minor release. In a major
release, however, it's fair game.
regards, tom lane
Tom Lane wrote:
Simon Riggs <simon@2ndQuadrant.com> writes:
On Tue, 2009-04-14 at 16:37 -0400, Tom Lane wrote:
Comments, objections, better ideas?
Please, if you do this, make it optional.
I don't think making the plpgsql lexer pluggable is realistic.
Potentially changing the behaviour of thousands of functions just to fix
a rare bug will not endear us to our users. The bug may be something
that people are relying on in some subtle way, ugly as that sounds.That's why I don't want to change it in a minor release. In a major
release, however, it's fair game.
Well, this bug has existed long before 8.4 so we could just leave it for
8.5, and it is not like we have had tons of complaints; the only
complaint I saw was one from March, 2008.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <bruce@momjian.us> writes:
Well, this bug has existed long before 8.4 so we could just leave it for
8.5, and it is not like we have had tons of complaints; the only
complaint I saw was one from March, 2008.
We had one last week, which is what prompted me to start looking at the
plpgsql lexer situation in the first place. Also, if the unicode
literal situation doesn't change, that's going to be problematic as
well.
regards, tom lane
* Bruce Momjian (bruce@momjian.us) wrote:
Well, this bug has existed long before 8.4 so we could just leave it for
8.5, and it is not like we have had tons of complaints; the only
complaint I saw was one from March, 2008.
I think it's a good thing to do in general. I'm also concerned about
if it will impact the plpgsql functions we have (which are pretty
numerous..) but in the end I'd rather have it fixed in 8.4 than possibly
delayed indefinitely (after all, if it's in 8.4, why fix it for 8.5?).
Thanks,
Stephen
On Tue, 2009-04-14 at 18:29 -0400, Tom Lane wrote:
Simon Riggs <simon@2ndQuadrant.com> writes:
On Tue, 2009-04-14 at 16:37 -0400, Tom Lane wrote:
Comments, objections, better ideas?
Please, if you do this, make it optional.
I don't think making the plpgsql lexer pluggable is realistic.
Doesn't sound easy, no. (I didn't suggest pluggable, just optional).
Potentially changing the behaviour of thousands of functions just to fix
a rare bug will not endear us to our users. The bug may be something
that people are relying on in some subtle way, ugly as that sounds.That's why I don't want to change it in a minor release. In a major
release, however, it's fair game.
If we want to make easy upgrades a reality, this is the type of issue we
must consider. Not much point having perfect binary upgrades if all your
functions start behaving differently after upgrade and then you discover
there isn't a binary downgrade path...
Rather than come up with specific solutions, let me just ask the
question: Is there a workaround for people caught by these changes?
Let's plan that alongside the change itself, so we have a reserve
'chute.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
Simon Riggs wrote:
On Tue, 2009-04-14 at 18:29 -0400, Tom Lane wrote:
Simon Riggs <simon@2ndQuadrant.com> writes:
Potentially changing the behaviour of thousands of functions just to fix
a rare bug will not endear us to our users. The bug may be something
that people are relying on in some subtle way, ugly as that sounds.That's why I don't want to change it in a minor release. In a major
release, however, it's fair game.If we want to make easy upgrades a reality, this is the type of issue we
must consider. Not much point having perfect binary upgrades if all your
functions start behaving differently after upgrade and then you discover
there isn't a binary downgrade path...Rather than come up with specific solutions, let me just ask the
question: Is there a workaround for people caught by these changes?
Let's plan that alongside the change itself, so we have a reserve
'chute.
Extract the source of the offending plpgsql function using e.g pg_dump,
modify it so that it works again, and restore the function. There's your
workaround.
I haven't been following what the issues we have with the current
plpgsql lexer are, so I'm not sure what I think of the plan as a whole.
Sharing the main lexer seems like a good idea, but it also seems like
it's way too late in the release cycle for such changes. But then again,
if we have issues that need to be fixed anyway, it might well be the
best way to fix them.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Wed, 2009-04-15 at 11:36 +0300, Heikki Linnakangas wrote:
Extract the source of the offending plpgsql function using e.g
pg_dump, modify it so that it works again, and restore the function.
There's your workaround.
Forcing manual re-editing of an unknown number of lines of code is not a
useful workaround, its just the default.
How do you know which is the offending function? If we force a full
application retest we put in place a significant barrier to upgrade.
That isn't useful for us as developers, nor is it useful for users.
I'm happy for Tom to make changes now; delay has no advantage. If we
have to add some lines of code/complexity to help our users then it
seems a reasonable thing to do rather than keeping our code pure and
demanding everybody else make changes.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
On Wed, Apr 15, 2009 at 10:12 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
How do you know which is the offending function? If we force a full
application retest we put in place a significant barrier to upgrade.
That isn't useful for us as developers, nor is it useful for users.
This is a fundamental conflict, not one that has a single simple answer.
However this seems like a strange place to pick your battle. Something
as low-level as the lexer is very costly to provide multiple
interfaces to. It's basically impossible short of simply providing two
different plpgsql languages -- something which won't scale at all if
we have to do it every time we make a syntax change to the language.
I'm actually concerned that we've become *too* conservative. Pretty
much any change that doesn't have Tom's full support and credibility
standing behind it ends up being criticized on the basis that we don't
know precisely what effects it will have in every possible scenario.
One of free software's big advantages over commercial software is that
it moves so much more quickly. Oracle, AIX, Windows, etc are burdened
by hundreds of layers of backwards-compatibility which take up a huge
portion of their development and Q/A effort. The reason Linux,
Postgres, and others have been able to come up so quickly and overtake
them is partly because we don't worry about such things.
As far as I'm concerned commercial support companies can put effort
into developing backwards-compatibility modules which add no long-term
value for their paying customers who need it today while the free
software developers can keep improving the software for new users.
--
greg
Simon Riggs wrote:
On Wed, 2009-04-15 at 11:36 +0300, Heikki Linnakangas wrote:
Extract the source of the offending plpgsql function using e.g
pg_dump, modify it so that it works again, and restore the function.
There's your workaround.Forcing manual re-editing of an unknown number of lines of code is not a
useful workaround, its just the default.How do you know which is the offending function? If we force a full
application retest we put in place a significant barrier to upgrade.
That isn't useful for us as developers, nor is it useful for users.
If I understood correctly, the proposed change is not supposed to have
any user-visible effects. It doesn't force a full application retest any
more than any of the other changes that have gone into 8.4.
We're talking about what we'll do or tell the users to do if we missed
something. By definition we don't know what we've missed, so I don't
think we can come up with a more specific solution than that.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Wed, 2009-04-15 at 10:56 +0100, Greg Stark wrote:
On Wed, Apr 15, 2009 at 10:12 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
How do you know which is the offending function? If we force a full
application retest we put in place a significant barrier to upgrade.
That isn't useful for us as developers, nor is it useful for users.This is a fundamental conflict, not one that has a single simple answer.
However this seems like a strange place to pick your battle.
I think you are right that you perceive a fundamental conflict and most
things I say become battles. That is not my choice and I will withdraw
from further discussion. My point has been made clearly and has not been
made to cause conflict. I've better things to do with my time than that,
though it's a shame you think that of me.
As far as I'm concerned commercial support companies can put effort
into developing backwards-compatibility modules which add no long-term
value for their paying customers who need it today while the free
software developers can keep improving the software for new users.
We will all doubtless make money from difficult upgrades, though that is
not my choice, nor that of my customers.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
On Wed, Apr 15, 2009 at 11:33 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
This is a fundamental conflict, not one that has a single simple answer.
However this seems like a strange place to pick your battle.
I think you are right that you perceive a fundamental conflict and most
things I say become battles. That is not my choice and I will withdraw
from further discussion. My point has been made clearly and has not been
made to cause conflict. I've better things to do with my time than that,
though it's a shame you think that of me.
Uhm, I didn't intend this as criticism at all, except inasmuch as the
judgement about whether the plpgsql lexer was a good choice of place
to make this stand. The use of "battle" was only because of the idiom
"pick your battle".
I think we are in general too conservative about making changes and
you are concerned that we're not giving enough thought to the upgrade
pain and should be more conservative. We can talk about general
policies but ultimately we'll have to debate each change on its
merits.
In this case it would help if we described the specific kinds of code
and consequences users. I'm not sure we're all on the same page.
I think changing the lexer to match the SQL lexer will only affect
string constants and only if standards_conforming_strings is enabled,
and only those instances which are handled internally to plpgsql and
not passed to the SQL engine. So the fix will pretty much always be
local to the behaviour change. It's possible for an escaped string to
need an E'' and for the backslash to migrate to other parts of the
code before triggering a bug (or possibly even get stored in the
database and cause a problem in other parts of the application). But
it should still be pretty straightforward to find the original source
of the string and also pretty easy to recognize string constants
throughout the source code.
As it currently stands a programmer sometimes has to use E'\x' and
sometimes has to use '\x' depending on whether the plpgsql is lexing
the string or is passing it to the SQL engine unlexed. It's not
obvious which parts get handled in which way to a user since some
constructs are handled as SQL which don't appear to be SQL and vice
versa -- at least it's not obvious to me even having read the source
in the past.
If I understand things correctly I think the change improves the
language for future users by far more than it imposes maintenance
costs on existing users, especially considering that anyone depending
on '\x' strings with standards_conforming_strings enabled is only
probably getting it wrong in some places without realizing it anyways
.
--
greg
On Wed, Apr 15, 2009 at 5:56 AM, Greg Stark <stark@enterprisedb.com> wrote:
On Wed, Apr 15, 2009 at 10:12 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
How do you know which is the offending function? If we force a full
application retest we put in place a significant barrier to upgrade.
That isn't useful for us as developers, nor is it useful for users.This is a fundamental conflict, not one that has a single simple answer.
However this seems like a strange place to pick your battle. Something
as low-level as the lexer is very costly to provide multiple
interfaces to. It's basically impossible short of simply providing two
different plpgsql languages -- something which won't scale at all if
we have to do it every time we make a syntax change to the language.
Completely agreed.
I'm actually concerned that we've become *too* conservative. Pretty
much any change that doesn't have Tom's full support and credibility
standing behind it ends up being criticized on the basis that we don't
know precisely what effects it will have in every possible scenario.
I think we've become too conservative in some areas and not
conservative enough in others. In particular, we're not very
conservative AT ALL about changes to the on-disk format - which is
like unto a bullet through the head for in-place upgrade. And we
sometimes make behavior changes that have potentially catastrophic
user consequences (like that one to TRUNCATE... which one, you ask?
ah, well, you'd better not use TRUNCATE in 8.4 until you RTFM then),
but then we'll have an argument about whether it's OK to make some
change where it's difficult to image the user impact being all that
severe, like:
- this one
- removing the special case for %% in the log_filename
- forward-compatible, backward-compatible improvements to CREATE OR REPLACE VIEW
- lots of others
So it seems that there is no consistent heuristic (other than, as you
say, Tom's approval or lack thereof) applied to these changes.
...Robert
Simon Riggs wrote:
How do you know which is the offending function? If we force a full
application retest we put in place a significant barrier to upgrade.
That isn't useful for us as developers, nor is it useful for users.
We support back branches for a long time for a reason. Nobody in their
right mind should upgrade to a new version without without first
extensively testing (and if necessary adjusting) their application on
it. This is true regardless of this issue and will be true for every
release.
cheers
andrew
On Wed, Apr 15, 2009 at 5:56 AM, Greg Stark <stark@enterprisedb.com> wrote:
On Wed, Apr 15, 2009 at 10:12 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
How do you know which is the offending function? If we force a full
application retest we put in place a significant barrier to upgrade.
That isn't useful for us as developers, nor is it useful for users.This is a fundamental conflict, not one that has a single simple answer.
However this seems like a strange place to pick your battle. Something
as low-level as the lexer is very costly to provide multiple
interfaces to. It's basically impossible short of simply providing two
different plpgsql languages -- something which won't scale at all if
we have to do it every time we make a syntax change to the language.
Completely agreed.
I'm actually concerned that we've become *too* conservative. Pretty
much any change that doesn't have Tom's full support and credibility
standing behind it ends up being criticized on the basis that we don't
know precisely what effects it will have in every possible scenario.
I think we've become too conservative in some areas and not
conservative enough in others. In particular, we're not very
conservative AT ALL about changes to the on-disk format - which is
like unto a bullet through the head for in-place upgrade. And we
sometimes make behavior changes that have potentially catastrophic
user consequences (like that one to TRUNCATE... which one, you ask?
ah, well, you'd better not use TRUNCATE in 8.4 until you RTFM then),
but then we'll have an argument about whether it's OK to make some
change where it's difficult to image the user impact being all that
severe - like this one, for example (or removing the special case for
no %-escapes in log_filename).
...Robert
Andrew Dunstan <andrew@dunslane.net> writes:
We support back branches for a long time for a reason.
I think that's really the bottom line here. If we insist on new major
releases always being bug-compatible with prior releases, our ability to
improve the software will go to zero. The solution we have opted for
instead is to support back branches for a long time and to avoid making
that type of change in a back branch.
regards, tom lane