what's up with IDENTIFIER_LOOKUP_EXPR?

Started by Robert Haasalmost 9 years ago6 messageshackers
Jump to latest
#1Robert Haas
robertmhaas@gmail.com

plpgsql has an enum called IdentifierLookup which includes a value
IDENTIFIER_LOOKUP_EXPR which is declared like this:

IDENTIFIER_LOOKUP_EXPR /* In SQL expression --- special case */

It regrettably does not explain what exactly is special about it, and
AFAICT, neither does any other comment. If I replace every usage of
IDENTIFIER_LOOKUP_EXPR with IDENTIFIER_LOOKUP_NORMAL, the regression
tests pass nonetheless. It was introduced by
01f7d29902cb27fb698e5078d72cb837398e074c, committed by Tom in 2010:

commit 01f7d29902cb27fb698e5078d72cb837398e074c
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sun Jan 10 17:15:18 2010 +0000

Improve plpgsql's handling of record field references by forcing
all potential
field references in SQL expressions to have RECFIELD datum-array entries at
parse time. If it turns out that the reference is actually to a SQL column,
the RECFIELD entry is useless, but it costs little. This allows
us to get rid
of the previous use of FieldSelect applied to a whole-row Param
for the record
variable; which was not only slower than a direct RECFIELD reference, but
failed for references to system columns of a trigger's NEW or OLD record.
Per report and fix suggestion from Dean Rasheed.

The rule, as far as I can tell from reading the code, is that
IDENTIFIER_LOOKUP_NORMAL looks up words, double-words (e.g. x.y), and
triple-words (e.g x.y.z), IDENTIFIER_LOOKUP_EXPR looks up only
double-words and triple-words, and IDENTIFIER_LOOKUP_DECLARE looks up
nothing. But it's not clear to me exactly what the motivation for
that is. plpgsql_parse_word says:

/*
* We should do nothing in DECLARE sections. In SQL expressions, there's
* no need to do anything either --- lookup will happen when the
* expression is compiled.
*/

...but that doesn't really explain why we go out of our way to have
this third state, because the wording implies that it doesn't
particularly matter one way or the other.

Any help appreciated.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#1)
Re: what's up with IDENTIFIER_LOOKUP_EXPR?

Robert Haas <robertmhaas@gmail.com> writes:

plpgsql has an enum called IdentifierLookup which includes a value
IDENTIFIER_LOOKUP_EXPR which is declared like this:
IDENTIFIER_LOOKUP_EXPR /* In SQL expression --- special case */
It regrettably does not explain what exactly is special about it, and
AFAICT, neither does any other comment. If I replace every usage of
IDENTIFIER_LOOKUP_EXPR with IDENTIFIER_LOOKUP_NORMAL, the regression
tests pass nonetheless.

AFAICS, the only place where IDENTIFIER_LOOKUP_EXPR acts differently from
IDENTIFIER_LOOKUP_NORMAL is plpgsql_parse_word(), and what it does is to
prevent a lookup in plpgsql's namestack from occurring, so that an
identifier is reported as "not recognized" even if there is a matching
variable. In the two places that currently select IDENTIFIER_LOOKUP_EXPR,
that seems to be only an optimization, because they will act the same
anyway for all the token types that plpgsql_parse_word could return.
I don't remember if there originally was a bigger reason than that.
But it seems reasonable to be able to signal the lookup machinery that
we're in a SQL expression rather than someplace else; in principle that
could have larger implications than it does right now.

The rule, as far as I can tell from reading the code, is that
IDENTIFIER_LOOKUP_NORMAL looks up words, double-words (e.g. x.y), and
triple-words (e.g x.y.z), IDENTIFIER_LOOKUP_EXPR looks up only
double-words and triple-words, and IDENTIFIER_LOOKUP_DECLARE looks up
nothing. But it's not clear to me exactly what the motivation for
that is. plpgsql_parse_word says:

The doubleword and tripleword cases are slightly different: lookup can't
be turned off completely, because we need to create matching RECFIELD
datums for use later. It's still true that we don't especially care what
token is returned, but the side-effect of creating a RECFIELD entry is
important. Possibly we could shave a few more cycles by making the code
know that explicitly, but it didn't seem worth the complexity at the time.

Feel free to improve the comments if this bugs you ...

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#2)
Re: what's up with IDENTIFIER_LOOKUP_EXPR?

On Thu, May 4, 2017 at 4:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

plpgsql has an enum called IdentifierLookup which includes a value
IDENTIFIER_LOOKUP_EXPR which is declared like this:
IDENTIFIER_LOOKUP_EXPR /* In SQL expression --- special case */
It regrettably does not explain what exactly is special about it, and
AFAICT, neither does any other comment. If I replace every usage of
IDENTIFIER_LOOKUP_EXPR with IDENTIFIER_LOOKUP_NORMAL, the regression
tests pass nonetheless.

AFAICS, the only place where IDENTIFIER_LOOKUP_EXPR acts differently from
IDENTIFIER_LOOKUP_NORMAL is plpgsql_parse_word(), and what it does is to
prevent a lookup in plpgsql's namestack from occurring, so that an
identifier is reported as "not recognized" even if there is a matching
variable. In the two places that currently select IDENTIFIER_LOOKUP_EXPR,
that seems to be only an optimization, because they will act the same
anyway for all the token types that plpgsql_parse_word could return.
I don't remember if there originally was a bigger reason than that.
But it seems reasonable to be able to signal the lookup machinery that
we're in a SQL expression rather than someplace else; in principle that
could have larger implications than it does right now.

Thanks for the explanation.

The rule, as far as I can tell from reading the code, is that
IDENTIFIER_LOOKUP_NORMAL looks up words, double-words (e.g. x.y), and
triple-words (e.g x.y.z), IDENTIFIER_LOOKUP_EXPR looks up only
double-words and triple-words, and IDENTIFIER_LOOKUP_DECLARE looks up
nothing. But it's not clear to me exactly what the motivation for
that is. plpgsql_parse_word says:

The doubleword and tripleword cases are slightly different: lookup can't
be turned off completely, because we need to create matching RECFIELD
datums for use later. It's still true that we don't especially care what
token is returned, but the side-effect of creating a RECFIELD entry is
important. Possibly we could shave a few more cycles by making the code
know that explicitly, but it didn't seem worth the complexity at the time.

The PLPGSQL_DTYPE_* constants are another thing that's not really
documented. You've mentioned that we should get rid of
PLPGSQL_DTYPE_ROW in favor of, uh, whatever's better than that, but
it's not clear to me what that really means, why one way is better
than the other way, or what is involved. I'm not really clear on what
a PLpgSQL_datum_type is in general, or what any of the specific types
actually mean. I'll guess that PLPGSQL_DTYPE_VAR is a variable, but
beyond that...

Feel free to improve the comments if this bugs you ...

I might try to do that at some point, but I don't think my
understanding of all this is clear enough to attempt it at this point.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#3)
Re: what's up with IDENTIFIER_LOOKUP_EXPR?

Robert Haas <robertmhaas@gmail.com> writes:

The PLPGSQL_DTYPE_* constants are another thing that's not really
documented.

Yeah :-(. Complain to Jan sometime.

You've mentioned that we should get rid of
PLPGSQL_DTYPE_ROW in favor of, uh, whatever's better than that, but
it's not clear to me what that really means, why one way is better
than the other way, or what is involved. I'm not really clear on what
a PLpgSQL_datum_type is in general, or what any of the specific types
actually mean. I'll guess that PLPGSQL_DTYPE_VAR is a variable, but
beyond that...

Well, from memory ...

PLpgSQL_datum is really a symbol table entry. The conflict against what
we mean by "Datum" elsewhere is pretty unfortunate.

VAR - variable of scalar type (well, any non-composite type ---
arrays and ranges are this kind of PLpgSQL_datum too, for instance).

REC - variable of composite type, stored as a HeapTuple.

RECFIELD - reference to a field of a REC variable. The datum includes
the field name and a link to the datum for the parent REC variable.
Notice this implies a runtime lookup of the field name whenever we're
accessing the datum; which sucks for performance but it makes life a
lot easier when you think about the possibility of the composite type
changing underneath you.

ARRAYELEM - reference to an element of an array. The datum includes
a subscript expression and a link to the datum for the parent array
variable (which can be a VAR, and I think a RECFIELD too).

ROW - this is where it gets fun. A ROW is effectively a variable
of a possibly-anonymous composite type, and it is defined by a list
(in its own datum) of links to PLpgSQL_datums representing the
individual columns. Typically the member datums would be VARs
but that's not the only possibility.

As I mentioned earlier, the case that ROW is actually well adapted
for is multiple targets in INTO and similar constructs. For example,
if you have

SELECT ...blah blah... INTO a,b,c

then the target of the PLpgSQL_stmt_execsql is represented as a single
ROW datum whose members are the datums for a, b, and c. That's totally
determined by the text of the function and can't change under us.

However ... somebody thought it'd be cute to use the ROW infrastructure
for variables of named composite types, too. So if you have

DECLARE foo some_composite_type;

then the name "foo" refers to a ROW datum, and the plpgsql compiler
generates additional anonymous VAR datums, one for each declared column
in some_composite_type, which become the members of the ROW datum.
The runtime representation is actually that each field value is stored
separately in its datum, as though it were an independent VAR. Field
references "foo.col1" are not compiled into RECFIELD datums; we just look
up the appropriate member datum during compile and make the expression
tree point to that datum directly.

So, this representation is great for speed of access and modification
of individual fields of the composite variable. It sucks when you
want to assign to the composite as a whole or retrieve its value as
a whole, because you have to deconstruct or reconstruct a tuple to
do that. (The REC/RECFIELD approach has approximately the opposite
strengths and weaknesses.) Also, dealing with changes in the named
composite type is a complete fail, because we've built its structure
into the function's symbol table at parse time.

I forget why there's a dtype for EXPR.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#4)
Re: what's up with IDENTIFIER_LOOKUP_EXPR?

On Thu, May 4, 2017 at 6:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

PLpgSQL_datum is really a symbol table entry. The conflict against what
we mean by "Datum" elsewhere is pretty unfortunate.

Yeah. That's particularly bad because datum is a somewhat vague word
under the best of circumstances (like "thing"). Maybe I'm missing
something, but isn't it more like a parse tree than a symbol table
entry? The symbol table entries seem to be based on
PLpgSQL_nsitem_type, not PLpgSQL_datum, and they only come in the
flavors you'd expect to find in a symbol table: label, var, row, rec.
The datums on the other hand seem to consist of every kind of thing
that PLpgSQL might be expected to interpret itself, either as an
rvalue or as an lvalue.

ROW - this is where it gets fun. A ROW is effectively a variable
of a possibly-anonymous composite type, and it is defined by a list
(in its own datum) of links to PLpgSQL_datums representing the
individual columns. Typically the member datums would be VARs
but that's not the only possibility.

As I mentioned earlier, the case that ROW is actually well adapted
for is multiple targets in INTO and similar constructs. For example,
if you have

SELECT ...blah blah... INTO a,b,c

then the target of the PLpgSQL_stmt_execsql is represented as a single
ROW datum whose members are the datums for a, b, and c. That's totally
determined by the text of the function and can't change under us.

However ... somebody thought it'd be cute to use the ROW infrastructure
for variables of named composite types, too. So if you have

DECLARE foo some_composite_type;

then the name "foo" refers to a ROW datum, and the plpgsql compiler
generates additional anonymous VAR datums, one for each declared column
in some_composite_type, which become the members of the ROW datum.
The runtime representation is actually that each field value is stored
separately in its datum, as though it were an independent VAR. Field
references "foo.col1" are not compiled into RECFIELD datums; we just look
up the appropriate member datum during compile and make the expression
tree point to that datum directly.

Ugh.

So, this representation is great for speed of access and modification
of individual fields of the composite variable. It sucks when you
want to assign to the composite as a whole or retrieve its value as
a whole, because you have to deconstruct or reconstruct a tuple to
do that. (The REC/RECFIELD approach has approximately the opposite
strengths and weaknesses.) Also, dealing with changes in the named
composite type is a complete fail, because we've built its structure
into the function's symbol table at parse time.

It would probably be possible to come up with a representation that
allowed both things to be efficient, the way a slot can contain either
a Datum/isnull array or a heap tuple or both. Call the resulting data
structure a record slot. foo.col1 could retain the original column
name and also a cache containing a pointer to the record slot and a
column offset within the slot, so that it can say e.g.
assign_record_slot_column(slot, col, val). But it could also register
the reference with the slot, so that if the slot structure changes,
the cached slot and column offset (or at least the column offset) are
cleared and get recomputed on next access.

I'm not volunteering to do the work, though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#5)
Re: what's up with IDENTIFIER_LOOKUP_EXPR?

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, May 4, 2017 at 6:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

PLpgSQL_datum is really a symbol table entry. The conflict against what
we mean by "Datum" elsewhere is pretty unfortunate.

Yeah. That's particularly bad because datum is a somewhat vague word
under the best of circumstances (like "thing"). Maybe I'm missing
something, but isn't it more like a parse tree than a symbol table
entry? The symbol table entries seem to be based on
PLpgSQL_nsitem_type, not PLpgSQL_datum, and they only come in the
flavors you'd expect to find in a symbol table: label, var, row, rec.

Right, there's this separate name lookup table that maps from names to
PLpgSQL_datums. I personally wouldn't call the PLpgSQL_datum items
parse trees, since they're mostly pretty flat. But if you want to
think of them that way I can't stop you.

One other important, and unfortunate, thing is that the PLpgSQL_datums
aren't (usually) read-only --- they not only hold static symbol-table-like
info but also the actual current values, for those datums that correspond
directly to a stored variable value (VAR and REC, but not ROW or RECFIELD
or ARRAYELEM). This seems messy to me, and it forces a rather expensive
copy step during plpgsql function startup. I'd like to see those duties
separated, so that the actual r/w state is just arrays of Datum/isnull
that we can trivially initialize during function start.

So, this representation is great for speed of access and modification
of individual fields of the composite variable. It sucks when you
want to assign to the composite as a whole or retrieve its value as
a whole, because you have to deconstruct or reconstruct a tuple to
do that. (The REC/RECFIELD approach has approximately the opposite
strengths and weaknesses.) Also, dealing with changes in the named
composite type is a complete fail, because we've built its structure
into the function's symbol table at parse time.

It would probably be possible to come up with a representation that
allowed both things to be efficient,

Yeah, perhaps. It would be good to take two steps back and reconsider
the whole data structure.

I'm not volunteering to do the work, though.

It's not at the top of my priority list either, but I'd like to see it
happen sometime.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers