Rethinking plpgsql's assignment implementation

Started by Tom Laneover 5 years ago21 messageshackers

tgl@sss.pgh.pa.us

over 5 years ago

We've had complaints in the past about how plpgsql can't handle
assignments to fields in arrays of records [1]/messages/by-id/A3691E98-CCA5-4DEB-B43C-92AD0437E09E@mikatiming.de, that is cases like

arrayvar[n].field := something;

and also complaints about how plpgsql can't handle assignments
to array slices [2]/messages/by-id/1070.1451345954@sss.pgh.pa.us, ie

arrayvar[m:n] := something;

As of commit c7aba7c14, we have another problem, namely that
plpgsql's subscripted assignment only works for regular arrays;
it won't work for other types that might define subscript
assignment handlers.

So I started to think about how to fix that, and eventually
decided that what we ought to do is nuke plpgsql's array-assignment
code altogether. The core code already has support for everything
we want here in the context of field/element assignments in UPDATE
commands; if we could get plpgsql to make use of that infrastructure
instead of rolling its own, we'd be a lot better off.

The hard part of that is that the core parser will only generate
the structures we need (FieldStores and assignment SubscriptingRefs)
within UPDATE commands. We could export the relevant functions
(particularly transformAssignmentIndirection); but that won't help
plpgsql very much, because it really wants to be able to run all this
stuff through SPI. That means we have to have SQL syntax that can
generate an expression of that form.

That led me to think about introducing a new statement, say

SET variable_name opt_indirection := a_expr

where opt_indirection is gram.y's symbol for "field selections and/or
subscripts". The idea here is that a plpgsql statement like

x[2]/messages/by-id/1070.1451345954@sss.pgh.pa.us.fld := something;

would be parsed using this new statement, producing an expression
that uses an assignment SubscriptingRef and a FieldStore operating
on a Param that gives the initial value of the array-of-composite
variable "x". Then plpgsql would just evaluate this expression and
assign the result to x. Problem solved.

This almost works as-is, modulo annoying parse conflicts against the
existing variants of SET. However there's a nasty little detail
about what "variable_name" can be in plpgsql: it can be either one or
two identifiers, since there might be a block label involved, eg

<<mylabel>> declare x int; begin mylabel.x := ...

Between that and the parse-conflict problem, I ended up
with this syntax:

SET n: variable_name opt_indirection := a_expr

where "n" is an integer literal indicating how many dot-separated names
should be taken as the base variable name. Another annoying point is
that plpgsql historically has allowed fun stuff like

mycount := count(*) from my_table where ...;

that is, after the expression you can have all the rest of an ordinary
SELECT command. That's not terribly hard to deal with, but it means
that this new statement has to have all of SELECT's other options too.

The other area that doesn't quite work without some kind of hack is
that plpgsql's casting rules for which types can be assigned to what
are far laxer than what the core parser thinks should be allowed in
UPDATE. The cast has to happen within the assignment expression
for this to work at all, so plpgsql can't fix it by itself. The
solution I adopted was just to invent a new CoercionContext value
COERCION_PLPGSQL, representing "use pl/pgsql's rules". (Basically
what that means nowadays is to apply CoerceViaIO if assignment cast
lookup doesn't find a cast pathway.)

A happy side-effect of this approach is that it actually makes
some cases faster. In particular I can measure speedups for
(a) assignments to subscripted variables and (b) cases where a
coercion must be performed to produce the result to be assigned.
I believe the reason for this is that the patch effectively
merges what had been separate expressions (subscripts or casts,
respectively) into the main result-producing expression. This
eliminates a nontrivial amount of overhead for plancache validity
checking, execution startup, etc.

Another side-effect is that the report of the statement in error
cases might look different. For example, in v13 a typo in a
subscript expression produces

regression=# do $$ declare x int[]; begin x[!2] = 43; end $$;
ERROR: operator does not exist: ! integer
LINE 1: SELECT !2
^
HINT: No operator matches the given name and argument type. You might need to add an explicit type cast.
QUERY: SELECT !2
CONTEXT: PL/pgSQL function inline_code_block line 1 at assignment

With this patch, you get

regression=# do $$ declare x int[]; begin x[!2] = 43; end $$;
ERROR: operator does not exist: ! integer
LINE 1: SET 1: x[!2] = 43
^
HINT: No operator matches the given name and argument type. You might need to add an explicit type cast.
QUERY: SET 1: x[!2] = 43
CONTEXT: PL/pgSQL function inline_code_block line 1 at assignment

It seems like a clear improvement to me that the whole plpgsql statement
is now quoted, but the "SET n:" bit in front of it might confuse people,
especially if we don't document this new syntax (which I'm inclined not
to, since it's useless in straight SQL). On the other hand, the
"SELECT" that you got with the old code was confusing to novices too.
Maybe something could be done to suppress those prefixes in error
reports? Seems like a matter for another patch. We could also use
some other prefix --- there's nothing particularly magic about the
word "SET" here, except that it already exists as a keyword --- but
I didn't think of anything I liked better.

This is still WIP: I've not added any new regression test cases
nor looked at the docs, and there's more cleanup needed in plpgsql.
But it passes check-world, so I thought I'd put it out for comments.

regards, tom lane

[1]: /messages/by-id/A3691E98-CCA5-4DEB-B43C-92AD0437E09E@mikatiming.de
[2]: /messages/by-id/1070.1451345954@sss.pgh.pa.us

Chapman Flack

chap@anastigmatix.net

over 5 years ago

In reply to: Tom Lane (#1)

Re: Rethinking plpgsql's assignment implementation

On 12/11/20 12:21, Tom Lane wrote:

solution I adopted was just to invent a new CoercionContext value
COERCION_PLPGSQL, representing "use pl/pgsql's rules". (Basically
what that means nowadays is to apply CoerceViaIO if assignment cast
lookup doesn't find a cast pathway.)

That seems like a rule that might be of use in other PLs or extensions;
could it have a more generic name, COERCION_FALLBACK or something?

is now quoted, but the "SET n:" bit in front of it might confuse people,
especially if we don't document this new syntax (which I'm inclined not
to, since it's useless in straight SQL).

If it's true that the only choices for n: are 1: or 2:, maybe it would look
less confusing in an error message to, hmm, decree that this specialized SET
form /always/ takes a two-component name, but accept something special like
ROUTINE.x (or UNNAMED.x or NULL.x or something) for the case where there
isn't a qualifying label in the plpgsql source?

It's still a strange arbitrary creation, but might give more of a hint of
its meaning if it crops up in an error message somewhere.

Regards,
-Chap

Pavel Stehule

pavel.stehule@gmail.com

over 5 years ago

In reply to: Tom Lane (#1)

Re: Rethinking plpgsql's assignment implementation

It is great. I expected much more work.

pá 11. 12. 2020 v 18:21 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal:

We've had complaints in the past about how plpgsql can't handle
assignments to fields in arrays of records [1], that is cases like

arrayvar[n].field := something;

and also complaints about how plpgsql can't handle assignments
to array slices [2], ie

arrayvar[m:n] := something;

As of commit c7aba7c14, we have another problem, namely that
plpgsql's subscripted assignment only works for regular arrays;
it won't work for other types that might define subscript
assignment handlers.

So I started to think about how to fix that, and eventually
decided that what we ought to do is nuke plpgsql's array-assignment
code altogether. The core code already has support for everything
we want here in the context of field/element assignments in UPDATE
commands; if we could get plpgsql to make use of that infrastructure
instead of rolling its own, we'd be a lot better off.

The hard part of that is that the core parser will only generate
the structures we need (FieldStores and assignment SubscriptingRefs)
within UPDATE commands. We could export the relevant functions
(particularly transformAssignmentIndirection); but that won't help
plpgsql very much, because it really wants to be able to run all this
stuff through SPI. That means we have to have SQL syntax that can
generate an expression of that form.

That led me to think about introducing a new statement, say

SET variable_name opt_indirection := a_expr

SQL/PSM (ANSI SQL) defines SET var = expr

If you introduce a new statement - LET, then it can be less confusing for
users, and this statement can be the foundation for schema variables. With
this statement the implementation of schema variables is significantly
simpler.

Regards

Pavel

Show quoted text

where opt_indirection is gram.y's symbol for "field selections and/or
subscripts". The idea here is that a plpgsql statement like

x[2].fld := something;

would be parsed using this new statement, producing an expression
that uses an assignment SubscriptingRef and a FieldStore operating
on a Param that gives the initial value of the array-of-composite
variable "x". Then plpgsql would just evaluate this expression and
assign the result to x. Problem solved.

This almost works as-is, modulo annoying parse conflicts against the
existing variants of SET. However there's a nasty little detail
about what "variable_name" can be in plpgsql: it can be either one or
two identifiers, since there might be a block label involved, eg

<<mylabel>> declare x int; begin mylabel.x := ...

Between that and the parse-conflict problem, I ended up
with this syntax:

SET n: variable_name opt_indirection := a_expr

where "n" is an integer literal indicating how many dot-separated names
should be taken as the base variable name. Another annoying point is
that plpgsql historically has allowed fun stuff like

mycount := count(*) from my_table where ...;

that is, after the expression you can have all the rest of an ordinary
SELECT command. That's not terribly hard to deal with, but it means
that this new statement has to have all of SELECT's other options too.

The other area that doesn't quite work without some kind of hack is
that plpgsql's casting rules for which types can be assigned to what
are far laxer than what the core parser thinks should be allowed in
UPDATE. The cast has to happen within the assignment expression
for this to work at all, so plpgsql can't fix it by itself. The
solution I adopted was just to invent a new CoercionContext value
COERCION_PLPGSQL, representing "use pl/pgsql's rules". (Basically
what that means nowadays is to apply CoerceViaIO if assignment cast
lookup doesn't find a cast pathway.)

A happy side-effect of this approach is that it actually makes
some cases faster. In particular I can measure speedups for
(a) assignments to subscripted variables and (b) cases where a
coercion must be performed to produce the result to be assigned.
I believe the reason for this is that the patch effectively
merges what had been separate expressions (subscripts or casts,
respectively) into the main result-producing expression. This
eliminates a nontrivial amount of overhead for plancache validity
checking, execution startup, etc.

Another side-effect is that the report of the statement in error
cases might look different. For example, in v13 a typo in a
subscript expression produces

regression=# do $$ declare x int[]; begin x[!2] = 43; end $$;
ERROR: operator does not exist: ! integer
LINE 1: SELECT !2
^
HINT: No operator matches the given name and argument type. You might
need to add an explicit type cast.
QUERY: SELECT !2
CONTEXT: PL/pgSQL function inline_code_block line 1 at assignment

With this patch, you get

regression=# do $$ declare x int[]; begin x[!2] = 43; end $$;
ERROR: operator does not exist: ! integer
LINE 1: SET 1: x[!2] = 43
^
HINT: No operator matches the given name and argument type. You might
need to add an explicit type cast.
QUERY: SET 1: x[!2] = 43
CONTEXT: PL/pgSQL function inline_code_block line 1 at assignment

It seems like a clear improvement to me that the whole plpgsql statement
is now quoted, but the "SET n:" bit in front of it might confuse people,
especially if we don't document this new syntax (which I'm inclined not
to, since it's useless in straight SQL). On the other hand, the
"SELECT" that you got with the old code was confusing to novices too.
Maybe something could be done to suppress those prefixes in error
reports? Seems like a matter for another patch. We could also use
some other prefix --- there's nothing particularly magic about the
word "SET" here, except that it already exists as a keyword --- but
I didn't think of anything I liked better.

This is still WIP: I've not added any new regression test cases
nor looked at the docs, and there's more cleanup needed in plpgsql.
But it passes check-world, so I thought I'd put it out for comments.

regards, tom lane

[1]
/messages/by-id/A3691E98-CCA5-4DEB-B43C-92AD0437E09E@mikatiming.de
[2] /messages/by-id/1070.1451345954@sss.pgh.pa.us

Tom Lane

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Chapman Flack (#2)

Re: Rethinking plpgsql's assignment implementation

Chapman Flack <chap@anastigmatix.net> writes:

On 12/11/20 12:21, Tom Lane wrote:

solution I adopted was just to invent a new CoercionContext value
COERCION_PLPGSQL, representing "use pl/pgsql's rules". (Basically
what that means nowadays is to apply CoerceViaIO if assignment cast
lookup doesn't find a cast pathway.)

That seems like a rule that might be of use in other PLs or extensions;
could it have a more generic name, COERCION_FALLBACK or something?

I'm not wedded to that name, but I doubt that it's semantics that we
really want to encourage anyone else to use. In particular, the fact
that it's not a superset of COERCION_EXPLICIT is pretty darn weird,
with little except backwards compatibility to recommend it.

is now quoted, but the "SET n:" bit in front of it might confuse people,
especially if we don't document this new syntax (which I'm inclined not
to, since it's useless in straight SQL).

If it's true that the only choices for n: are 1: or 2:, maybe it would look
less confusing in an error message to, hmm, decree that this specialized SET
form /always/ takes a two-component name, but accept something special like
ROUTINE.x (or UNNAMED.x or NULL.x or something) for the case where there
isn't a qualifying label in the plpgsql source?

As the patch stands, it's still using the RECFIELD code paths, which
means that there could be three-component target variable names
(label.variable.field). If we were to get rid of that and expect
top-level field assignment to also be handled by this new mechanism,
then maybe your idea could be made to work. But I have not tried to
implement that here, as I don't see how to make it work for RECORD-type
variables (where the names and types of the fields aren't determinate).

In any case, that approach still involves inserting some query text
that the user didn't write, so I'm not sure how much confusion it'd
remove. The "SET n:" business at least looks like it's some weird
prefix comparable to "LINE n:", so while people wouldn't understand
it I think they'd easily see it as something the system prefixed
to their query.

Looking a bit ahead, it's not too hard to imagine plpgsql wishing
to pass other sorts of annotations through SPI and down to the core
parser. Maybe we should think about a more general way to do that
in an out-of-band, not-visible-in-the-query-text fashion.

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Tom Lane (#4)

Re: Rethinking plpgsql's assignment implementation

I wrote:

In any case, that approach still involves inserting some query text
that the user didn't write, so I'm not sure how much confusion it'd
remove. The "SET n:" business at least looks like it's some weird
prefix comparable to "LINE n:", so while people wouldn't understand
it I think they'd easily see it as something the system prefixed
to their query.

Looking a bit ahead, it's not too hard to imagine plpgsql wishing
to pass other sorts of annotations through SPI and down to the core
parser. Maybe we should think about a more general way to do that
in an out-of-band, not-visible-in-the-query-text fashion.

I have an idea (no code written yet) about this.

After looking around, it seems like the ParserSetupHook mechanism
is plenty for anything we might want an extension to be able to
change in the behavior of parse analysis. The hooks that we
currently allow that to set affect only the interpretation of
variable names and $N parameter symbols, but we could surely
add much more in that line as needed.

What we lack is any good way for an extension to control the
behavior of raw_parser() (i.e., gram.y). Currently, plpgsql
prefixes "SELECT " to expressions it might want to parse, and
now my current patch proposes to prefix something else to get a
different grammar behavior. Another example of a very similar
problem is typeStringToTypeName(), which prefixes a string it
expects to be a type name with "SELECT NULL::", and then has
to do a bunch of kluges to deal with the underspecification
involved in that. Based on these examples, we need some sort
of "overall goal" option for the raw parser, but maybe not more
than that --- other things you might want tend to fall into the
parse analysis side of things.

So my idea here is to add a parsing-mode option to raw_parser(),
which would be an enum with values like "normal SQL statement",
"expression only", "type name", "plpgsql assignment statement".
The problem I had with not knowing how many dotted names to
absorb at the start of an assignment statement could be finessed
by inventing "assignment1", "assignment2", and "assignment3"
parsing modes; that's a little bit ugly but not enough to make
me think we need a wider API.

As to how it could actually work, I'm noticing that raw_parser
starts out by initializing yyextra's lookahead buffer to empty.
For the parsing modes other than "normal SQL statement", it
could instead inject a lookahead token that is a code that cannot
be generated by the regular lexer. Then gram.y could have
productions like

EXPRESSION_MODE a_expr { ... generate parse tree ... }

where EXPRESSION_MODE is one of these special tokens. And now
we have something that will parse an a_expr, and only an a_expr,
and we don't need any special "SELECT " or any other prefix in
the user-visible source string. Similarly for the other special
parsing modes.

Essentially, this is a way of having a few distinct parsers
that share a common base of productions, without the bloat and
code maintenance issues of building actually-distinct parsers.

A small problem with this is that the images of these special
productions in ECPG would be dead code so far as ECPG is concerned.
For the use-cases I can foresee, there wouldn't be enough special
productions for that to be a deal-breaker. But we could probably
teach the ECPG grammar-building scripts to filter out these
productions if it ever got to be annoying.

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Tom Lane (#5)

Re: Rethinking plpgsql's assignment implementation

I wrote:

So my idea here is to add a parsing-mode option to raw_parser(),
which would be an enum with values like "normal SQL statement",
"expression only", "type name", "plpgsql assignment statement".

Here's a fleshed-out patch series that attacks things that way.
I'm a lot better pleased with this than with my original approach.

0001 creates the basic infrastructure for "raw parse modes", and as
proof of concept simplifies typeStringToTypeName(). There's a minor
functional improvement there, which is that we can now use the core
parser's error cursor position, so instead of

regression=# do $$ declare x int[23/] ; begin end $$;
ERROR: syntax error at or near "/"
LINE 1: do $$ declare x int[23/] ; begin end $$;
^
CONTEXT: invalid type name "int[23/] "

you get

regression=# do $$ declare x int[23/] ; begin end $$;
ERROR: syntax error at or near "/"
LINE 1: do $$ declare x int[23/] ; begin end $$;
^
CONTEXT: invalid type name "int[23/] "

It's possible we could dispense with the error context callback
in typeStringToTypeName altogether, but I've not experimented much.

0002 tackles the next problem, which is to make this feature accessible
through SPI. There are a couple of possibly-controversial choices here.

Following the principle that we should avoid changing documented SPI
interfaces, we need a new version of SPI_prepare to pass RawParseMode
through. This'll be the fourth one :-(, so I decided it was time to
try to make a definition that can stay API-compatible through future
changes. So it takes a struct of options, and I added a promise that
zeroing the struct is enough to guarantee forward compatibility
through future additions.

This leaves both of the previous iterations, SPI_prepare_cursor
and SPI_prepare_params, unused anywhere in the core code.
I suppose we can't kill them (codesearch.debian.net knows of some
external uses) but I propose to mark them deprecated, with an eye
to at least removing their documentation someday.

I did not want to add a RawParseMode parameter to pg_parse_query(),
because that would have affected a larger number of unrelated modules,
and it would not have been great from a header-inclusion footprint
standpoint either. So I chose to pass down the mode from SPI by
having it just call raw_parser() directly instead of going through
pg_parse_query(). Perhaps this is a modularity violation, or perhaps
there's somebody who really wants the extra tracing overhead in
pg_parse_query() to apply to SPI queries. I'm open to discussing
whether this should be done differently.

(However, having made these two patches, I'm now wondering whether
there is any rhyme or reason to the existing state of affairs
with some callers going through pg_parse_query() while others use
raw_parser() directly. It's hard to knock making a different
choice in spi.c unless we have a coherent policy about which to
use where.)

Next, 0003 invents a raw parse mode for plpgsql expressions (which,
in some contexts, can be pretty nearly whole SELECT statements),
and uses that to get plpgsql out of the business of prefixing
"SELECT " to user-written text. I would not have bothered with this
as a standalone fix, but I think it does make for less-confusing
error messages --- we've definitely had novices ask "where'd this
SELECT come from?" in the past. (I cheated a bit on PERFORM, though.
Unlike other places, it needs to allow UNION, so it can't use the
same restricted syntax.)

0004 then reimplements plpgsql assignment. This is essentially the same
patch I submitted before, but redesigned to work with the infrastructure
from 0001-0003.

0005 adds documentation and test cases. It also fixes a couple
of pre-existing problems that the plpgsql parser had with assigning
to sub-fields of record fields, which I discovered while making the
tests.

Finally, 0006 removes plpgsql's ARRAYELEM datum type, on the grounds
that we don't need it anymore. This might be a little controversial
too, because there was still one way to reach the code: GET DIAGNOSTICS
with an array element as target would do so. However, that seems like
a pretty weird corner case. Reviewing the git history, I find that
I added support for that in commit 55caaaeba; but a check of the
associated discussion shows that there was no actual user request for
that, I'd just done it because it was easy and seemed more symmetric.
The amount of code involved here seems way more than is justified by
that one case, so I think we should just take it out and lose the
"feature". (I did think about whether GET DIAGNOSTICS could be
reimplemented on top of the new infrastructure, but it wouldn't be
easy because we don't have a SQL-expression representation of the
GET DIAGNOSTICS values. Moreover, going in that direction would add
an expression evaluation, making GET DIAGNOSTICS slower. So I think
we should just drop it.)

regards, tom lane

Pavel Stehule

pavel.stehule@gmail.com

over 5 years ago

In reply to: Tom Lane (#6)

Re: Rethinking plpgsql's assignment implementation

ne 13. 12. 2020 v 22:41 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal:

I wrote:

So my idea here is to add a parsing-mode option to raw_parser(),
which would be an enum with values like "normal SQL statement",
"expression only", "type name", "plpgsql assignment statement".

Here's a fleshed-out patch series that attacks things that way.
I'm a lot better pleased with this than with my original approach.

0001 creates the basic infrastructure for "raw parse modes", and as
proof of concept simplifies typeStringToTypeName(). There's a minor
functional improvement there, which is that we can now use the core
parser's error cursor position, so instead of

regression=# do $$ declare x int[23/] ; begin end $$;
ERROR: syntax error at or near "/"
LINE 1: do $$ declare x int[23/] ; begin end $$;
^
CONTEXT: invalid type name "int[23/] "

you get

regression=# do $$ declare x int[23/] ; begin end $$;
ERROR: syntax error at or near "/"
LINE 1: do $$ declare x int[23/] ; begin end $$;
^
CONTEXT: invalid type name "int[23/] "

It's possible we could dispense with the error context callback
in typeStringToTypeName altogether, but I've not experimented much.

0002 tackles the next problem, which is to make this feature accessible
through SPI. There are a couple of possibly-controversial choices here.

Following the principle that we should avoid changing documented SPI
interfaces, we need a new version of SPI_prepare to pass RawParseMode
through. This'll be the fourth one :-(, so I decided it was time to
try to make a definition that can stay API-compatible through future
changes. So it takes a struct of options, and I added a promise that
zeroing the struct is enough to guarantee forward compatibility
through future additions.

This leaves both of the previous iterations, SPI_prepare_cursor
and SPI_prepare_params, unused anywhere in the core code.
I suppose we can't kill them (codesearch.debian.net knows of some
external uses) but I propose to mark them deprecated, with an eye
to at least removing their documentation someday.

I did not want to add a RawParseMode parameter to pg_parse_query(),
because that would have affected a larger number of unrelated modules,
and it would not have been great from a header-inclusion footprint
standpoint either. So I chose to pass down the mode from SPI by
having it just call raw_parser() directly instead of going through
pg_parse_query(). Perhaps this is a modularity violation, or perhaps
there's somebody who really wants the extra tracing overhead in
pg_parse_query() to apply to SPI queries. I'm open to discussing
whether this should be done differently.

(However, having made these two patches, I'm now wondering whether
there is any rhyme or reason to the existing state of affairs
with some callers going through pg_parse_query() while others use
raw_parser() directly. It's hard to knock making a different
choice in spi.c unless we have a coherent policy about which to
use where.)

Next, 0003 invents a raw parse mode for plpgsql expressions (which,
in some contexts, can be pretty nearly whole SELECT statements),
and uses that to get plpgsql out of the business of prefixing
"SELECT " to user-written text. I would not have bothered with this
as a standalone fix, but I think it does make for less-confusing
error messages --- we've definitely had novices ask "where'd this
SELECT come from?" in the past. (I cheated a bit on PERFORM, though.
Unlike other places, it needs to allow UNION, so it can't use the
same restricted syntax.)

0004 then reimplements plpgsql assignment. This is essentially the same
patch I submitted before, but redesigned to work with the infrastructure
from 0001-0003.

0005 adds documentation and test cases. It also fixes a couple
of pre-existing problems that the plpgsql parser had with assigning
to sub-fields of record fields, which I discovered while making the
tests.

Finally, 0006 removes plpgsql's ARRAYELEM datum type, on the grounds
that we don't need it anymore. This might be a little controversial
too, because there was still one way to reach the code: GET DIAGNOSTICS
with an array element as target would do so. However, that seems like
a pretty weird corner case. Reviewing the git history, I find that
I added support for that in commit 55caaaeba; but a check of the
associated discussion shows that there was no actual user request for
that, I'd just done it because it was easy and seemed more symmetric.
The amount of code involved here seems way more than is justified by
that one case, so I think we should just take it out and lose the
"feature". (I did think about whether GET DIAGNOSTICS could be
reimplemented on top of the new infrastructure, but it wouldn't be
easy because we don't have a SQL-expression representation of the
GET DIAGNOSTICS values. Moreover, going in that direction would add
an expression evaluation, making GET DIAGNOSTICS slower. So I think
we should just drop it.)

It is a really great patch. I did fast check and I didn't find any
functionality issue

--
-- Name: footype; Type: TYPE; Schema: public; Owner: pavel
--

CREATE TYPE public.footype AS (
a integer,
b integer
);

ALTER TYPE public.footype OWNER TO pavel;

--
-- Name: bootype; Type: TYPE; Schema: public; Owner: pavel
--

CREATE TYPE public.bootype AS (
a integer,
f public.footype
);

ALTER TYPE public.bootype OWNER TO pavel;

--
-- Name: cootype; Type: TYPE; Schema: public; Owner: pavel
--

CREATE TYPE public.cootype AS (
a integer,
b integer[]
);

ALTER TYPE public.cootype OWNER TO pavel;

--
-- Name: dootype; Type: TYPE; Schema: public; Owner: pavel
--

CREATE TYPE public.dootype AS (
a integer,
b public.footype,
c public.footype[]
);

ALTER TYPE public.dootype OWNER TO pavel;

--
-- PostgreSQL database dump complete
--

postgres=# do $$
<<lab>>
declare
a footype[];
b bootype;
ba bootype[];
c cootype[];
d dootype[];
x int default 1;
begin
a[10] := row(10,20);
a[11] := (30,40);
a[3] := (0,0);
a[3].a := 100;
raise notice '%', a;
b.a := 100;
b.f.a := 1000;
raise notice '%', b;
ba[0] := b;

ba[0].a = 33; ba[0].f := row(33,33);
lab.ba[0].f.a := 1000000;
raise notice '%', ba;
c[0].a := 10000;
c[0].b := ARRAY[1,2,4];
lab.c[0].b[1] := 10000;
raise notice '% %', c, c[0].b[x];

d[0].a := 100;
d[0].b.a := 101;
d[0].c[x+1].a := 102;
raise notice '%', d;
end;
$$;
NOTICE:
[3:11]={"(100,0)",NULL,NULL,NULL,NULL,NULL,NULL,"(10,20)","(30,40)"}
NOTICE: (100,"(1000,)")
NOTICE: [0:0]={"(33,\"(1000000,33)\")"}
NOTICE: [0:0]={"(10000,\"{10000,2,4}\")"} 10000
NOTICE: [0:0]={"(100,\"(101,)\",\"[2:2]={\"\"(102,)\"\"}\")"}
DO

Regards

Pavel

Show quoted text

regards, tom lane

Pavel Stehule

pavel.stehule@gmail.com

over 5 years ago

In reply to: Pavel Stehule (#7)

Re: Rethinking plpgsql's assignment implementation

I checked a performance and it looks so access to record's field is faster,
but an access to arrays field is significantly slower

do $$
declare
a int[];
aux int;
rep boolean default true;
begin
for i in 1..5000
loop
a[i]:= 5000 - i;
end loop;

raise notice '%', a[1:10];

while rep
loop
rep := false;
for i in 1..5000
loop
if a[i] > a[i+1] then
aux := a[i];
a[i] := a[i+1]; a[i+1] := aux;
rep := true;
end if;
end loop;
end loop;

raise notice '%', a[1:10];

end;
$$;

This code is about 3x slower than master (40 sec x 12 sec). I believe so
this is a worst case scenario

I tested pi calculation

CREATE OR REPLACE FUNCTION pi_est_1(n int)
RETURNS numeric AS $$
DECLARE
accum double precision DEFAULT 1.0;
c1 double precision DEFAULT 2.0;
c2 double precision DEFAULT 1.0;
BEGIN
FOR i IN 1..n
LOOP
accum := accum * ((c1 * c1) / (c2 * (c2 + 2.0)));
c1 := c1 + 2.0;
c2 := c2 + 2.0;
END LOOP;
RETURN accum * 2.0;
END;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION pi_est_2(n int)
RETURNS numeric AS $$
DECLARE
accum double precision DEFAULT 1.0;
c1 double precision DEFAULT 2.0;
c2 double precision DEFAULT 1.0;
BEGIN
FOR i IN 1..n
LOOP
accum := accum * ((c1 * c1) / (c2 * (c2 + double precision '2.0')));
c1 := c1 + double precision '2.0';
c2 := c2 + double precision '2.0';
END LOOP;
RETURN accum * double precision '2.0';
END;
$$ LANGUAGE plpgsql;

And the performance is 10% slower than on master

Interesting point - the master is about 5% faster than pg13

Tom Lane

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Pavel Stehule (#8)

Re: Rethinking plpgsql's assignment implementation

Pavel Stehule <pavel.stehule@gmail.com> writes:

I checked a performance and it looks so access to record's field is faster,
but an access to arrays field is significantly slower

Hmm, I'd drawn the opposite conclusion in my own testing ...

for i in 1..5000
loop
if a[i] > a[i+1] then
aux := a[i];
a[i] := a[i+1]; a[i+1] := aux;
rep := true;
end if;
end loop;

... but I now see that I'd not checked cases like "a[i] := a[j]".
exec_check_rw_parameter() is being too conservative about whether
it can optimize a case like that. The attached incremental patch
fixes it.

I tested pi calculation
...
And the performance is 10% slower than on master

Can't reproduce that here. For the record, I get the following
timings (medians of three runs) for your test cases:

HEAD:

sort: Time: 13974.709 ms (00:13.975)
pi_est_1(10000000): Time: 3537.482 ms (00:03.537)
pi_est_2(10000000): Time: 3546.557 ms (00:03.547)

Patch v1:

sort: Time: 47053.892 ms (00:47.054)
pi_est_1(10000000): Time: 3456.078 ms (00:03.456)
pi_est_2(10000000): Time: 3451.347 ms (00:03.451)

+ exec_check_rw_parameter fix:

sort: Time: 12199.724 ms (00:12.200)
pi_est_1(10000000): Time: 3357.955 ms (00:03.358)
pi_est_2(10000000): Time: 3367.526 ms (00:03.368)

I'm inclined to think that the differences in the pi calculation
timings are mostly chance effects; there's certainly no reason
why exec_check_rw_parameter should affect that test case at all.

regards, tom lane

#10

Pavel Stehule

pavel.stehule@gmail.com

over 5 years ago

In reply to: Tom Lane (#9)

Re: Rethinking plpgsql's assignment implementation

po 14. 12. 2020 v 17:25 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal:

Pavel Stehule <pavel.stehule@gmail.com> writes:

I checked a performance and it looks so access to record's field is

faster,

but an access to arrays field is significantly slower

Hmm, I'd drawn the opposite conclusion in my own testing ...

for i in 1..5000
loop
if a[i] > a[i+1] then
aux := a[i];
a[i] := a[i+1]; a[i+1] := aux;
rep := true;
end if;
end loop;

... but I now see that I'd not checked cases like "a[i] := a[j]".
exec_check_rw_parameter() is being too conservative about whether
it can optimize a case like that. The attached incremental patch
fixes it.

I tested pi calculation
...
And the performance is 10% slower than on master

Can't reproduce that here. For the record, I get the following
timings (medians of three runs) for your test cases:

HEAD:

sort: Time: 13974.709 ms (00:13.975)
pi_est_1(10000000): Time: 3537.482 ms (00:03.537)
pi_est_2(10000000): Time: 3546.557 ms (00:03.547)

Patch v1:

sort: Time: 47053.892 ms (00:47.054)
pi_est_1(10000000): Time: 3456.078 ms (00:03.456)
pi_est_2(10000000): Time: 3451.347 ms (00:03.451)

+ exec_check_rw_parameter fix:

sort: Time: 12199.724 ms (00:12.200)
pi_est_1(10000000): Time: 3357.955 ms (00:03.358)
pi_est_2(10000000): Time: 3367.526 ms (00:03.368)

I'm inclined to think that the differences in the pi calculation
timings are mostly chance effects; there's certainly no reason
why exec_check_rw_parameter should affect that test case at all.

performance patch helps lot of for sort - with patch it is faster 5-10%
than master 10864 x 12122 ms

I found probably reason why patched was slower

I used

CFLAGS="-fno-omit-frame-pointer -Wall -Wmissing-prototypes
-Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute
-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard
-g -O2 -Werror=switch"

With these options the pi test was slower. When I used default, then there
is no difference.

So it can be very good feature, new code has same speed or it is faster

Regards

Pavel

Show quoted text

regards, tom lane

#11

Tom Lane

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Pavel Stehule (#10)

Re: Rethinking plpgsql's assignment implementation

I realized that the speedup patch I posted yesterday is flawed: it's
too aggressive about applying the R/W param mechanism, instead of
not aggressive enough.

To review, the point of that logic is that if we have an assignment
like
arrayvar := array_append(arrayvar, some-scalar-expression);
a naive implementation would have array_append construct an entire
new array, which we'd then have to copy into plpgsql's variable
storage. Instead, if the array variable is in expanded-array
format (which plpgsql encourages it to be) then we can pass the
array parameter as a "read/write expanded datum", which array_append
recognizes as license to scribble right on its input and return the
modified input; that takes only O(1) time not O(N). Then plpgsql's
assignment code notices that the expression result datum is the same
pointer already stored in the variable, so it does nothing.

With the patch at hand, a subscripted assignment a[i] := x becomes,
essentially,
a := subscriptingref(a, i, x);
and we need to make the same sort of transformation to allow
array_set_element to scribble right on the original value of "a"
instead of making a copy.

However, we can't simply not consider the source expression "x",
as I proposed yesterday. For example, if we have
a := subscriptingref(a, i, f(array_append(a, x)));
it's not okay for array_append() to scribble on "a". The R/W
param mechanism normally disallows any additional references to
the target variable, which would prevent this error, but I broke
that safety check with the 0007 patch.

After thinking about this awhile, I decided that plpgsql's R/W param
mechanism is really misdesigned. Instead of requiring the assignment
source expression to be such that *all* its references to the target
variable could be passed as R/W, we really want to identify *one*
reference to the target variable to be passed as R/W, allowing any other
ones to be passed read/only as they would be by default. As long as the
R/W reference is a direct argument to the top-level (hence last to be
executed) function in the expression, there is no harm in R/O references
being passed to other lower parts of the expression. Nor is there any
use-case for more than one argument of the top-level function being R/W.

So the attached rewrite of the 0007 patch reimplements that logic to
identify one single Param that references the target variable, and
make only that Param pass a read/write reference, not any other
Params referencing the target variable. This is a good change even
without considering the assignment-reimplementation proposal, because
even before this patchset we could have cases like
arrayvar := array_append(arrayvar, arrayvar[i]);
The existing code would be afraid to optimize this, but it's in fact
safe.

I also re-attach the 0001-0006 patches, which have not changed, just
to keep the cfbot happy.

regards, tom lane

#12

Pavel Stehule

pavel.stehule@gmail.com

over 5 years ago

In reply to: Tom Lane (#11)

Re: Rethinking plpgsql's assignment implementation

út 15. 12. 2020 v 21:18 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal:

I realized that the speedup patch I posted yesterday is flawed: it's
too aggressive about applying the R/W param mechanism, instead of
not aggressive enough.

To review, the point of that logic is that if we have an assignment
like
arrayvar := array_append(arrayvar, some-scalar-expression);
a naive implementation would have array_append construct an entire
new array, which we'd then have to copy into plpgsql's variable
storage. Instead, if the array variable is in expanded-array
format (which plpgsql encourages it to be) then we can pass the
array parameter as a "read/write expanded datum", which array_append
recognizes as license to scribble right on its input and return the
modified input; that takes only O(1) time not O(N). Then plpgsql's
assignment code notices that the expression result datum is the same
pointer already stored in the variable, so it does nothing.

With the patch at hand, a subscripted assignment a[i] := x becomes,
essentially,
a := subscriptingref(a, i, x);
and we need to make the same sort of transformation to allow
array_set_element to scribble right on the original value of "a"
instead of making a copy.

However, we can't simply not consider the source expression "x",
as I proposed yesterday. For example, if we have
a := subscriptingref(a, i, f(array_append(a, x)));
it's not okay for array_append() to scribble on "a". The R/W
param mechanism normally disallows any additional references to
the target variable, which would prevent this error, but I broke
that safety check with the 0007 patch.

After thinking about this awhile, I decided that plpgsql's R/W param
mechanism is really misdesigned. Instead of requiring the assignment
source expression to be such that *all* its references to the target
variable could be passed as R/W, we really want to identify *one*
reference to the target variable to be passed as R/W, allowing any other
ones to be passed read/only as they would be by default. As long as the
R/W reference is a direct argument to the top-level (hence last to be
executed) function in the expression, there is no harm in R/O references
being passed to other lower parts of the expression. Nor is there any
use-case for more than one argument of the top-level function being R/W.

So the attached rewrite of the 0007 patch reimplements that logic to
identify one single Param that references the target variable, and
make only that Param pass a read/write reference, not any other
Params referencing the target variable. This is a good change even
without considering the assignment-reimplementation proposal, because
even before this patchset we could have cases like
arrayvar := array_append(arrayvar, arrayvar[i]);
The existing code would be afraid to optimize this, but it's in fact
safe.

I also re-attach the 0001-0006 patches, which have not changed, just
to keep the cfbot happy.

I run some performance tests and it looks very well.

regards, tom lane

Show quoted text

#13

Pavel Stehule

pavel.stehule@gmail.com

over 5 years ago

In reply to: Pavel Stehule (#12)

Re: Rethinking plpgsql's assignment implementation

I repeated tests. I wrote a set of simple functions. It is a synthetical
test, but I think it can identify potential problems well.

I calculated the average of 3 cycles and I checked the performance of each
function. I didn't find any problem. The total execution time is well too.
Patched code is about 11% faster than master (14sec x 15.8sec). So there is
new important functionality with nice performance benefits.

make check-world passed

Regards

Pavel

#14

Pavel Stehule

pavel.stehule@gmail.com

over 5 years ago

In reply to: Pavel Stehule (#13)

Re: Rethinking plpgsql's assignment implementation

so 26. 12. 2020 v 19:00 odesílatel Pavel Stehule <pavel.stehule@gmail.com>
napsal:

Hi

I repeated tests. I wrote a set of simple functions. It is a synthetical
test, but I think it can identify potential problems well.

I calculated the average of 3 cycles and I checked the performance of each
function. I didn't find any problem. The total execution time is well too.
Patched code is about 11% faster than master (14sec x 15.8sec). So there is
new important functionality with nice performance benefits.

make check-world passed

I played with plpgsql_check tests and again I didn't find any significant
issue of this patch. I am very satisfied with implementation.

Now, the behavior of SELECT INTO is behind the assign statement and this
fact should be documented. Usually we don't need to use array's fields
here, but somebody can try it.

Regards

Pavel

Show quoted text

Regards

Pavel

#15

Tom Lane

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Pavel Stehule (#14)

Re: Rethinking plpgsql's assignment implementation

Pavel Stehule <pavel.stehule@gmail.com> writes:

Now, the behavior of SELECT INTO is behind the assign statement and this
fact should be documented. Usually we don't need to use array's fields
here, but somebody can try it.

It's been behind all along --- this patch didn't really change that.
But I don't mind documenting it more clearly.

regards, tom lane

#16

Pavel Stehule

pavel.stehule@gmail.com

over 5 years ago

In reply to: Tom Lane (#15)

Re: Rethinking plpgsql's assignment implementation

po 28. 12. 2020 v 0:55 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal:

Pavel Stehule <pavel.stehule@gmail.com> writes:

Now, the behavior of SELECT INTO is behind the assign statement and this
fact should be documented. Usually we don't need to use array's fields
here, but somebody can try it.

It's been behind all along --- this patch didn't really change that.
But I don't mind documenting it more clearly.

Pavel

Show quoted text

regards, tom lane

#17

Pavel Stehule

pavel.stehule@gmail.com

over 5 years ago

In reply to: Pavel Stehule (#16)

Re: Rethinking plpgsql's assignment implementation

I continue in review.

I found inconsistency in work with slicings (this is not directly related
to this patch, but can be interesting, because with new functionality the
array slicings can be edited more often).

a = array[1,2,3,4,5];
a[1:5] = 10; -- correctly fails, although for some people can be more
natural semantic setting a[1..5] to value 10

a[1:5] = NULL; does nothing - no fail, no value change ??? Is it correct

a[1:5] = ARRAY[1]; -- correctly fails ERROR: source array too small

but

a[1:5] = ARRAY[1,2,3,4,5,6]; -- this statement works, but 6 is ignored. Is
it correct? I expected "source array too big"

More, this behave is not documented

anything other looks well, all tests passed, and in my benchmarks I don't
see any slowdowns , so I'll mark this patch as ready for committer

Regards

Pavel

#18

Tom Lane

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Pavel Stehule (#17)

Re: Rethinking plpgsql's assignment implementation

Pavel Stehule <pavel.stehule@gmail.com> writes:

I found inconsistency in work with slicings (this is not directly related
to this patch, but can be interesting, because with new functionality the
array slicings can be edited more often).

a = array[1,2,3,4,5];
a[1:5] = 10; -- correctly fails, although for some people can be more
natural semantic setting a[1..5] to value 10
a[1:5] = NULL; does nothing - no fail, no value change ??? Is it correct
a[1:5] = ARRAY[1]; -- correctly fails ERROR: source array too small
but
a[1:5] = ARRAY[1,2,3,4,5,6]; -- this statement works, but 6 is ignored. Is
it correct? I expected "source array too big"

Hm. All of these behaviors have existed for a long time in the context
of UPDATE statements:

regression=# create table t1 (a int[]);
CREATE TABLE
regression=# insert into t1 values(array[1,2,3,4,5]);
INSERT 0 1
regression=# table t1;
a
-------------
{1,2,3,4,5}
(1 row)

regression=# update t1 set a[1:5] = 10;
ERROR: subscripted assignment to "a" requires type integer[] but expression is of type integer
regression=# update t1 set a[1:5] = null;
UPDATE 1
regression=# table t1;
a
-------------
{1,2,3,4,5}
(1 row)

(Note that in this example, the null is implicitly typed as int[];
so it's not like the prior example.)

regression=# update t1 set a[1:5] = array[1];
ERROR: source array too small
regression=# update t1 set a[1:5] = array[1,2,3,4,6,5];
UPDATE 1
regression=# table t1;
a
-------------
{1,2,3,4,6}
(1 row)

I agree this is inconsistent, but given the way this patch works,
we'd have to change UPDATE's behavior if we want plpgsql to do
something different. Not sure if we can get away with that.

anything other looks well, all tests passed, and in my benchmarks I don't
see any slowdowns , so I'll mark this patch as ready for committer

Thanks!

regards, tom lane

#19

Pavel Stehule

pavel.stehule@gmail.com

over 5 years ago

In reply to: Tom Lane (#18)

Re: Rethinking plpgsql's assignment implementation

ne 3. 1. 2021 v 19:07 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal:

Pavel Stehule <pavel.stehule@gmail.com> writes:

I found inconsistency in work with slicings (this is not directly related
to this patch, but can be interesting, because with new functionality the
array slicings can be edited more often).

a = array[1,2,3,4,5];
a[1:5] = 10; -- correctly fails, although for some people can be more
natural semantic setting a[1..5] to value 10
a[1:5] = NULL; does nothing - no fail, no value change ??? Is it correct
a[1:5] = ARRAY[1]; -- correctly fails ERROR: source array too small
but
a[1:5] = ARRAY[1,2,3,4,5,6]; -- this statement works, but 6 is ignored.

Is

it correct? I expected "source array too big"

Hm. All of these behaviors have existed for a long time in the context
of UPDATE statements:

regression=# create table t1 (a int[]);
CREATE TABLE
regression=# insert into t1 values(array[1,2,3,4,5]);
INSERT 0 1
regression=# table t1;
a
-------------
{1,2,3,4,5}
(1 row)

regression=# update t1 set a[1:5] = 10;
ERROR: subscripted assignment to "a" requires type integer[] but
expression is of type integer
regression=# update t1 set a[1:5] = null;
UPDATE 1
regression=# table t1;
a
-------------
{1,2,3,4,5}
(1 row)

(Note that in this example, the null is implicitly typed as int[];
so it's not like the prior example.)

I understand

regression=# update t1 set a[1:5] = array[1];
ERROR: source array too small
regression=# update t1 set a[1:5] = array[1,2,3,4,6,5];
UPDATE 1
regression=# table t1;
a
-------------
{1,2,3,4,6}
(1 row)

I agree this is inconsistent, but given the way this patch works,
we'd have to change UPDATE's behavior if we want plpgsql to do
something different. Not sure if we can get away with that.

Yes, the UPDATE should be changed. This is not a pretty important corner
case. But any inconsistency can be messy for users.

I don't see any interesting use case for current behavior, but it is a
corner case.

anything other looks well, all tests passed, and in my benchmarks I don't
see any slowdowns , so I'll mark this patch as ready for committer

Thanks!

with pleasure

Regards

Pavel

Show quoted text

regards, tom lane

#20

Pavel Stehule

pavel.stehule@gmail.com

over 5 years ago

In reply to: Pavel Stehule (#19)

Re: Rethinking plpgsql's assignment implementation

Now, I am testing subscribing on the jsonb feature, and I found one issue,
that is not supported by parser.

When the target is scalar, then all is ok. But we can have a plpgsql array
of jsonb values.

postgres=# do $$
declare j jsonb[];
begin
j[1] = '{"b":"Ahoj"}';
raise notice '%', j;
raise notice '%', (j[1])['b'];
end
$$;
NOTICE: {"{\"b\": \"Ahoj\"}"}
NOTICE: "Ahoj"
DO

Parenthesis work well in expressions, but are not supported on the left
side of assignment.

postgres=# do $$
declare j jsonb[];
begin
(j[1])['b'] = '"Ahoj"';
raise notice '%', j;
raise notice '%', j[1]['b'];
end
$$;
ERROR: syntax error at or near "("
LINE 4: (j[1])['b'] = '"Ahoj"';
^

Regards

Pavel

#21

Pavel Stehule

pavel.stehule@gmail.com

over 5 years ago

In reply to: Pavel Stehule (#20)

Rethinking plpgsql's assignment implementation

Attachments:

Attachments:

Attachments:

Attachments:

Attachments: