making update/delete of inheritance trees scale better
Here is a sketch for implementing the design that Tom described here:
/messages/by-id/357.1550612935@sss.pgh.pa.us
In short, we would like to have only one plan for ModifyTable to get
tuples out of to update/delete, not N for N child result relations as
is done currently.
I suppose things are the way they are because creating a separate plan
for each result relation makes the job of ModifyTable node very
simple, which is currently this:
1. Take the plan's output tuple, extract the tupleid of the tuple to
update/delete in the currently active result relation,
2. If delete, go to 3, else if update, filter out the junk columns
from the above tuple
3. Call ExecUpdate()/ExecDelete() on the result relation with the new
tuple, if any
If we make ModifyTable do a little more work for the inheritance case,
we can create only one plan but without "expanding" the targetlist.
That is, it will contain entries only for attributes that are assigned
values in the SET clause. This makes the plan reusable across result
relations, because all child relations must have those attributes,
even though the attribute numbers might be different. Anyway, the
work that ModifyTable will now have to do is this:
1. Take the plan's output tuple, extract tupleid of the tuple to
update/delete and "tableoid"
2. Select the result relation to operate on using the tableoid
3. If delete, go to 4, else if update, fetch the tuple identified by
tupleid from the result relation and fill in the unassigned columns
using that "old" tuple, also filtering out the junk columns
4. Call ExecUpdate()/ExecDelete() on the result relation with the new
tuple, if any
I do think that doing this would be worthwhile even if we may be
increasing ModifyTable's per-row overhead slightly, because planning
overhead of the current approach is very significant, especially for
partition trees with beyond a couple of thousand partitions. As to
how bad the problem is, trying to create a generic plan for `update
foo set ... where key = $1`, where foo has over 2000 partitions,
causes OOM even on a machine with 6GB of memory.
The one plan shared by all result relations will be same as the one we
would get if the query were SELECT, except it will contain junk
attributes such as ctid needed to identify tuples and a new "tableoid"
junk attribute if multiple result relations will be present due to
inheritance. One major way in which this targetlist differs from the
current per-result-relation plans is that it won't be passed through
expand_targetlist(), because the set of unassigned attributes may not
be unique among children. As mentioned above, those missing
attributes will be filled by ModifyTable doing some extra work,
whereas previously they would have come with the plan's output tuple.
For child result relations that are foreign tables, their FDW adds
junk attribute(s) to the query’s targetlist by updating it in-place
(AddForeignUpdateTargets). However, as the child tables will no
longer get their own parsetree, we must use some hack around this
interface to obtain the foreign table specific junk attributes and add
them to the original/parent query’s targetlist. Assuming that all or
most of the children will belong to the same FDW, we will end up with
only a handful such junk columns in the final targetlist. I am not
sure if it's worthwhile to change the API of AddForeignUpdateTargets
to require FDWs to not scribble on the passed-in parsetree as part of
this patch.
As for how ModifyTable will create the new tuple for updates, I have
decided to use a ProjectionInfo for each result relation, which
projects a full, *clean* tuple ready to be put back into the relation.
When projecting, plan’s output tuple serves as OUTER tuple and the old
tuple fetched to fill unassigned attributes serves as SCAN tuple. By
having this ProjectionInfo also serve as the “junk filter”, we don't
need JunkFilters. The targetlist that this projection computes is
same as that of the result-relation-specific plan. Initially, I
thought to generate this "expanded" targetlist in
ExecInitModifyTable(). But as it can be somewhat expensive, doing it
only once in the planner seemed like a good idea. These
per-result-relations targetlists are carried in the ModifyTable node.
To identify the result relation from the tuple produced by the plan,
“tableoid” junk column will be used. As the tuples for different
result relations won’t necessarily come out in the order in which
result relations are laid out in the ModifyTable node, we need a way
to map the tableoid value to result relation indexes. I have decided
to use a hash table here.
A couple of things that I didn't think very hard what to do about now,
but may revisit later.
* We will no longer be able use DirectModify APIs to push updates to
remote servers for foreign child result relations
* Over in [1]/messages/by-id/CA+HiwqGXmP3-S9y=OQHyJyeWnZSOmcxBGdgAMWcLUOsnPTL88w@mail.gmail.com, I have said that we get run-time pruning for free for
ModifyTable because the plan we are using is same as that for SELECT,
although now I think that I hadn't thought that through. With the PoC
patch that I have:
prepare q as update foo set a = 250001 where a = $1;
set plan_cache_mode to 'force_generic_plan';
explain execute q(1);
QUERY PLAN
--------------------------------------------------------------------
Update on foo (cost=0.00..142.20 rows=40 width=14)
Update on foo_1
Update on foo_2 foo
Update on foo_3 foo
Update on foo_4 foo
-> Append (cost=0.00..142.20 rows=40 width=14)
Subplans Removed: 3
-> Seq Scan on foo_1 (cost=0.00..35.50 rows=10 width=14)
Filter: (a = $1)
(9 rows)
While it's true that we will never have to actually update foo_2,
foo_3, and foo_4, ModifyTable still sets up its ResultRelInfos, which
ideally it shouldn't. Maybe we'll need to do something about that
after all.
I will post the patch shortly.
--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com
[1]: /messages/by-id/CA+HiwqGXmP3-S9y=OQHyJyeWnZSOmcxBGdgAMWcLUOsnPTL88w@mail.gmail.com
On Fri, May 8, 2020 at 7:03 PM Amit Langote <amitlangote09@gmail.com> wrote:
Here is a sketch for implementing the design that Tom described here:
/messages/by-id/357.1550612935@sss.pgh.pa.usIn short, we would like to have only one plan for ModifyTable to get
tuples out of to update/delete, not N for N child result relations as
is done currently.I suppose things are the way they are because creating a separate plan
for each result relation makes the job of ModifyTable node very
simple, which is currently this:1. Take the plan's output tuple, extract the tupleid of the tuple to
update/delete in the currently active result relation,
2. If delete, go to 3, else if update, filter out the junk columns
from the above tuple
3. Call ExecUpdate()/ExecDelete() on the result relation with the new
tuple, if anyIf we make ModifyTable do a little more work for the inheritance case,
we can create only one plan but without "expanding" the targetlist.
That is, it will contain entries only for attributes that are assigned
values in the SET clause. This makes the plan reusable across result
relations, because all child relations must have those attributes,
even though the attribute numbers might be different. Anyway, the
work that ModifyTable will now have to do is this:1. Take the plan's output tuple, extract tupleid of the tuple to
update/delete and "tableoid"
2. Select the result relation to operate on using the tableoid
3. If delete, go to 4, else if update, fetch the tuple identified by
tupleid from the result relation and fill in the unassigned columns
using that "old" tuple, also filtering out the junk columns
4. Call ExecUpdate()/ExecDelete() on the result relation with the new
tuple, if anyI do think that doing this would be worthwhile even if we may be
increasing ModifyTable's per-row overhead slightly, because planning
overhead of the current approach is very significant, especially for
partition trees with beyond a couple of thousand partitions. As to
how bad the problem is, trying to create a generic plan for `update
foo set ... where key = $1`, where foo has over 2000 partitions,
causes OOM even on a machine with 6GB of memory.
Per row overhead would be incurred for every row whereas the plan time
overhead is one-time or in case of a prepared statement almost free.
So we need to compare it esp. when there are 2000 partitions and all
of them are being updated. But generally I agree that this would be a
better approach. It might help using PWJ when the result relation
joins with other partitioned table. I am not sure whether that
effectively happens today by partition pruning. More on this later.
The one plan shared by all result relations will be same as the one we
would get if the query were SELECT, except it will contain junk
attributes such as ctid needed to identify tuples and a new "tableoid"
junk attribute if multiple result relations will be present due to
inheritance. One major way in which this targetlist differs from the
current per-result-relation plans is that it won't be passed through
expand_targetlist(), because the set of unassigned attributes may not
be unique among children. As mentioned above, those missing
attributes will be filled by ModifyTable doing some extra work,
whereas previously they would have come with the plan's output tuple.For child result relations that are foreign tables, their FDW adds
junk attribute(s) to the query’s targetlist by updating it in-place
(AddForeignUpdateTargets). However, as the child tables will no
longer get their own parsetree, we must use some hack around this
interface to obtain the foreign table specific junk attributes and add
them to the original/parent query’s targetlist. Assuming that all or
most of the children will belong to the same FDW, we will end up with
only a handful such junk columns in the final targetlist. I am not
sure if it's worthwhile to change the API of AddForeignUpdateTargets
to require FDWs to not scribble on the passed-in parsetree as part of
this patch.
What happens if there's a mixture of foreign and local partitions or
mixture of FDWs? Injecting junk columns from all FDWs in the top level
target list will cause error because those attributes won't be
available everywhere.
As for how ModifyTable will create the new tuple for updates, I have
decided to use a ProjectionInfo for each result relation, which
projects a full, *clean* tuple ready to be put back into the relation.
When projecting, plan’s output tuple serves as OUTER tuple and the old
tuple fetched to fill unassigned attributes serves as SCAN tuple. By
having this ProjectionInfo also serve as the “junk filter”, we don't
need JunkFilters. The targetlist that this projection computes is
same as that of the result-relation-specific plan. Initially, I
thought to generate this "expanded" targetlist in
ExecInitModifyTable(). But as it can be somewhat expensive, doing it
only once in the planner seemed like a good idea. These
per-result-relations targetlists are carried in the ModifyTable node.To identify the result relation from the tuple produced by the plan,
“tableoid” junk column will be used. As the tuples for different
result relations won’t necessarily come out in the order in which
result relations are laid out in the ModifyTable node, we need a way
to map the tableoid value to result relation indexes. I have decided
to use a hash table here.
Can we plan the scan query to add a sort node to order the rows by tableoid?
A couple of things that I didn't think very hard what to do about now,
but may revisit later.* We will no longer be able use DirectModify APIs to push updates to
remote servers for foreign child result relations
If we convert a whole DML into partitionwise DML (just as it happens
today unintentionally), we should be able to use DirectModify. PWJ
will help there. But even we can detect that the scan underlying a
particular partition can be evaluated completely on the node same as
where the partition resides, we should be able to use DirectModify.
But if we are not able to support this optimization, the queries which
benefit from it for today won't perform well. I think we need to think
about this now instead of leave for later. Otherwise, make it so that
we use the old way when there are foreign partitions and new way
otherwise.
* Over in [1], I have said that we get run-time pruning for free for
ModifyTable because the plan we are using is same as that for SELECT,
although now I think that I hadn't thought that through. With the PoC
patch that I have:prepare q as update foo set a = 250001 where a = $1;
set plan_cache_mode to 'force_generic_plan';
explain execute q(1);
QUERY PLAN
--------------------------------------------------------------------
Update on foo (cost=0.00..142.20 rows=40 width=14)
Update on foo_1
Update on foo_2 foo
Update on foo_3 foo
Update on foo_4 foo
-> Append (cost=0.00..142.20 rows=40 width=14)
Subplans Removed: 3
-> Seq Scan on foo_1 (cost=0.00..35.50 rows=10 width=14)
Filter: (a = $1)
(9 rows)While it's true that we will never have to actually update foo_2,
foo_3, and foo_4, ModifyTable still sets up its ResultRelInfos, which
ideally it shouldn't. Maybe we'll need to do something about that
after all.
* Tuple re-routing during UPDATE. For now it's disabled so your design
should work. But we shouldn't design this feature in such a way that
it comes in the way to enable tuple re-routing in future :).
--
Best Wishes,
Ashutosh Bapat
Hi Ashutosh,
Thanks for chiming in.
On Mon, May 11, 2020 at 9:58 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
On Fri, May 8, 2020 at 7:03 PM Amit Langote <amitlangote09@gmail.com> wrote:
I do think that doing this would be worthwhile even if we may be
increasing ModifyTable's per-row overhead slightly, because planning
overhead of the current approach is very significant, especially for
partition trees with beyond a couple of thousand partitions. As to
how bad the problem is, trying to create a generic plan for `update
foo set ... where key = $1`, where foo has over 2000 partitions,
causes OOM even on a machine with 6GB of memory.Per row overhead would be incurred for every row whereas the plan time
overhead is one-time or in case of a prepared statement almost free.
So we need to compare it esp. when there are 2000 partitions and all
of them are being updated.
I assume that such UPDATEs would be uncommon.
But generally I agree that this would be a
better approach. It might help using PWJ when the result relation
joins with other partitioned table.
It does, because the plan below ModifyTable is same as if the query
were SELECT instead of UPDATE; with my PoC:
explain (costs off) update foo set a = foo2.a + 1 from foo foo2 where
foo.a = foo2.a;
QUERY PLAN
--------------------------------------------------
Update on foo
Update on foo_1
Update on foo_2
-> Append
-> Merge Join
Merge Cond: (foo_1.a = foo2_1.a)
-> Sort
Sort Key: foo_1.a
-> Seq Scan on foo_1
-> Sort
Sort Key: foo2_1.a
-> Seq Scan on foo_1 foo2_1
-> Merge Join
Merge Cond: (foo_2.a = foo2_2.a)
-> Sort
Sort Key: foo_2.a
-> Seq Scan on foo_2
-> Sort
Sort Key: foo2_2.a
-> Seq Scan on foo_2 foo2_2
(20 rows)
as opposed to what you get today:
explain (costs off) update foo set a = foo2.a + 1 from foo foo2 where
foo.a = foo2.a;
QUERY PLAN
--------------------------------------------------
Update on foo
Update on foo_1
Update on foo_2
-> Merge Join
Merge Cond: (foo_1.a = foo2.a)
-> Sort
Sort Key: foo_1.a
-> Seq Scan on foo_1
-> Sort
Sort Key: foo2.a
-> Append
-> Seq Scan on foo_1 foo2
-> Seq Scan on foo_2 foo2_1
-> Merge Join
Merge Cond: (foo_2.a = foo2.a)
-> Sort
Sort Key: foo_2.a
-> Seq Scan on foo_2
-> Sort
Sort Key: foo2.a
-> Append
-> Seq Scan on foo_1 foo2
-> Seq Scan on foo_2 foo2_1
(23 rows)
For child result relations that are foreign tables, their FDW adds
junk attribute(s) to the query’s targetlist by updating it in-place
(AddForeignUpdateTargets). However, as the child tables will no
longer get their own parsetree, we must use some hack around this
interface to obtain the foreign table specific junk attributes and add
them to the original/parent query’s targetlist. Assuming that all or
most of the children will belong to the same FDW, we will end up with
only a handful such junk columns in the final targetlist. I am not
sure if it's worthwhile to change the API of AddForeignUpdateTargets
to require FDWs to not scribble on the passed-in parsetree as part of
this patch.What happens if there's a mixture of foreign and local partitions or
mixture of FDWs? Injecting junk columns from all FDWs in the top level
target list will cause error because those attributes won't be
available everywhere.
That is a good question and something I struggled with ever since I
started started thinking about implementing this.
For the problem that FDWs may inject junk columns that could neither
be present in local tables (root parent and other local children) nor
other FDWs, I couldn't think of any solution other than to restrict
what those junk columns can be -- to require them to be either "ctid",
"wholerow", or a set of only *inherited* user columns. I think that's
what Tom was getting at when he said the following in the email I
cited in my first email:
"...It gets a bit harder if the tree contains some foreign tables,
because they might have different concepts of row identity, but I'd
think in most cases you could still combine those into a small number
of output columns."
Maybe I misunderstood what Tom said, but I can't imagine how to let
these junk columns be anything that *all* tables contained in an
inheritance tree, especially the root parent, cannot emit, if they are
to be emitted out of a single plan.
As for how ModifyTable will create the new tuple for updates, I have
decided to use a ProjectionInfo for each result relation, which
projects a full, *clean* tuple ready to be put back into the relation.
When projecting, plan’s output tuple serves as OUTER tuple and the old
tuple fetched to fill unassigned attributes serves as SCAN tuple. By
having this ProjectionInfo also serve as the “junk filter”, we don't
need JunkFilters. The targetlist that this projection computes is
same as that of the result-relation-specific plan. Initially, I
thought to generate this "expanded" targetlist in
ExecInitModifyTable(). But as it can be somewhat expensive, doing it
only once in the planner seemed like a good idea. These
per-result-relations targetlists are carried in the ModifyTable node.To identify the result relation from the tuple produced by the plan,
“tableoid” junk column will be used. As the tuples for different
result relations won’t necessarily come out in the order in which
result relations are laid out in the ModifyTable node, we need a way
to map the tableoid value to result relation indexes. I have decided
to use a hash table here.Can we plan the scan query to add a sort node to order the rows by tableoid?
Hmm, I am afraid that some piece of partitioning code that assumes a
certain order of result relations, and that order is not based on
sorting tableoids.
A couple of things that I didn't think very hard what to do about now,
but may revisit later.* We will no longer be able use DirectModify APIs to push updates to
remote servers for foreign child result relationsIf we convert a whole DML into partitionwise DML (just as it happens
today unintentionally), we should be able to use DirectModify. PWJ
will help there. But even we can detect that the scan underlying a
particular partition can be evaluated completely on the node same as
where the partition resides, we should be able to use DirectModify.
I remember Fujita-san mentioned something like this, but I haven't
looked into how feasible it would be given the current DirectModify
interface.
But if we are not able to support this optimization, the queries which
benefit from it for today won't perform well. I think we need to think
about this now instead of leave for later. Otherwise, make it so that
we use the old way when there are foreign partitions and new way
otherwise.
I would very much like find a solution for this, which hopefully isn't
to fall back to using the old way.
* Tuple re-routing during UPDATE. For now it's disabled so your design
should work. But we shouldn't design this feature in such a way that
it comes in the way to enable tuple re-routing in future :).
Sorry, what is tuple re-routing and why does this new approach get in its way?
--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com
On Mon, May 11, 2020 at 8:58 AM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
What happens if there's a mixture of foreign and local partitions or
mixture of FDWs? Injecting junk columns from all FDWs in the top level
target list will cause error because those attributes won't be
available everywhere.
I think that we're talking about a plan like this:
Update
-> Append
-> a bunch of children
I believe that you'd want to have happen here is for each child to
emit the row identity columns that it knows about, and emit NULL for
the others. Then when you do the Append you end up with a row format
that includes all the individual identity columns, but for any
particular tuple, only one set of such columns is populated and the
others are all NULL. There doesn't seem to be any execution-time
problem with such a representation, but there might be a planning-time
problem with building it, because when you're writing a tlist for the
Append node, what varattno are you going to use for the columns that
exist only in one particular child and not the others? The fact that
setrefs processing happens so late seems like an annoyance in this
case.
Maybe it would be easier to have one Update note per kind of row
identity, i.e. if there's more than one such notion then...
Placeholder
-> Update
-> Append
-> all children with one notion of row identity
-> Update
-> Append
-> all children with another notion of row identity
...and so forth.
But I'm not sure.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
I believe that you'd want to have happen here is for each child to
emit the row identity columns that it knows about, and emit NULL for
the others. Then when you do the Append you end up with a row format
that includes all the individual identity columns, but for any
particular tuple, only one set of such columns is populated and the
others are all NULL.
Yeah, that was what I'd imagined in my earlier thinking about this.
There doesn't seem to be any execution-time
problem with such a representation, but there might be a planning-time
problem with building it,
Possibly. We manage to cope with not-all-alike children now, of course,
but I think it might be true that no one plan node has Vars from
dissimilar children. Even so, the Vars are self-identifying, so it
seems like this ought to be soluble.
regards, tom lane
On Mon, May 11, 2020 at 2:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
There doesn't seem to be any execution-time
problem with such a representation, but there might be a planning-time
problem with building it,Possibly. We manage to cope with not-all-alike children now, of course,
but I think it might be true that no one plan node has Vars from
dissimilar children. Even so, the Vars are self-identifying, so it
seems like this ought to be soluble.
If the parent is RTI 1, and the children are RTIs 2..6, what
varno/varattno will we use in RTI 1's tlist to represent a column that
exists in both RTI 2 and RTI 3 but not in RTI 1, 4, 5, or 6?
I suppose the answer is 2 - or 3, but I guess we'd pick the first
child as the representative of the class. We surely can't use varno 1,
because then there's no varattno that makes any sense. But if we use
2, now we have the tlist for RTI 1 containing expressions with a
child's RTI as the varno. I could be wrong, but I think that's going
to make setrefs.c throw up and die, and I wouldn't be very surprised
if there were a bunch of other things that crashed and burned, too. I
think we have quite a bit of code that expects to be able to translate
between parent-rel expressions and child-rel expressions, and that's
going to be pretty problematic here.
Maybe your answer is - let's just fix all that stuff. That could well
be right, but my first reaction is to think that it sounds hard.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
If the parent is RTI 1, and the children are RTIs 2..6, what
varno/varattno will we use in RTI 1's tlist to represent a column that
exists in both RTI 2 and RTI 3 but not in RTI 1, 4, 5, or 6?
Fair question. We don't have any problem representing the column
as it exists in any one of those children, but we lack a notation
for the "union" or whatever you want to call it, except in the case
where the parent relation has a corresponding column. Still, this
doesn't seem that hard to fix. My inclination would be to invent
dummy parent-rel columns (possibly with negative attnums? not sure if
that'd be easier or harder than adding them in the positive direction)
to represent such "union" columns. This concept would only need to
exist within the planner I think, since after setrefs.c there'd be no
trace of those dummy columns.
I think we have quite a bit of code that expects to be able to translate
between parent-rel expressions and child-rel expressions, and that's
going to be pretty problematic here.
... shrug. Sure, we'll need to be able to do that mapping. Why will
it be any harder than any other parent <-> child mapping? The planner
would know darn well what the mapping is while it's inventing the
dummy columns, so it just has to keep that info around for use later.
Maybe your answer is - let's just fix all that stuff. That could well
be right, but my first reaction is to think that it sounds hard.
I have to think that it'll net out as less code, and certainly less
complicated code, than trying to extend inheritance_planner in its
current form to do what we wish it'd do.
regards, tom lane
On Mon, May 11, 2020 at 4:22 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
If the parent is RTI 1, and the children are RTIs 2..6, what
varno/varattno will we use in RTI 1's tlist to represent a column that
exists in both RTI 2 and RTI 3 but not in RTI 1, 4, 5, or 6?Fair question. We don't have any problem representing the column
as it exists in any one of those children, but we lack a notation
for the "union" or whatever you want to call it, except in the case
where the parent relation has a corresponding column. Still, this
doesn't seem that hard to fix. My inclination would be to invent
dummy parent-rel columns (possibly with negative attnums? not sure if
that'd be easier or harder than adding them in the positive direction)
to represent such "union" columns.
Ah, that makes sense. If we can invent dummy columns on the parent
rel, then most of what I was worrying about no longer seems very
worrying.
I'm not sure what's involved in inventing such dummy columns, though.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, May 11, 2020 at 8:11 PM Amit Langote <amitlangote09@gmail.com> wrote:
Per row overhead would be incurred for every row whereas the plan time
overhead is one-time or in case of a prepared statement almost free.
So we need to compare it esp. when there are 2000 partitions and all
of them are being updated.I assume that such UPDATEs would be uncommon.
Yes, 2000 partitions being updated would be rare. But many rows from
the same partition being updated may not be that common. We have to
know how much is that per row overhead and updating how many rows it
takes to beat the planning time overhead. If the number of rows is
very large, we are good.
But generally I agree that this would be a
better approach. It might help using PWJ when the result relation
joins with other partitioned table.It does, because the plan below ModifyTable is same as if the query
were SELECT instead of UPDATE; with my PoC:explain (costs off) update foo set a = foo2.a + 1 from foo foo2 where
foo.a = foo2.a;
QUERY PLAN
--------------------------------------------------
Update on foo
Update on foo_1
Update on foo_2
-> Append
-> Merge Join
Merge Cond: (foo_1.a = foo2_1.a)
-> Sort
Sort Key: foo_1.a
-> Seq Scan on foo_1
-> Sort
Sort Key: foo2_1.a
-> Seq Scan on foo_1 foo2_1
-> Merge Join
Merge Cond: (foo_2.a = foo2_2.a)
-> Sort
Sort Key: foo_2.a
-> Seq Scan on foo_2
-> Sort
Sort Key: foo2_2.a
-> Seq Scan on foo_2 foo2_2
(20 rows)
Wonderful. That looks good.
Can we plan the scan query to add a sort node to order the rows by tableoid?
Hmm, I am afraid that some piece of partitioning code that assumes a
certain order of result relations, and that order is not based on
sorting tableoids.
I am suggesting that we override that order (if any) in
create_modifytable_path() or create_modifytable_plan() by explicitly
ordering the incoming paths on tableoid. May be using MergeAppend.
* Tuple re-routing during UPDATE. For now it's disabled so your design
should work. But we shouldn't design this feature in such a way that
it comes in the way to enable tuple re-routing in future :).Sorry, what is tuple re-routing and why does this new approach get in its way?
An UPDATE causing a tuple to move to a different partition. It would
get in its way since the tuple will be located based on tableoid,
which will be the oid of the old partition. But I think this approach
has higher chance of being able to solve that problem eventually
rather than the current approach.
--
Best Wishes,
Ashutosh Bapat
On Tue, May 12, 2020 at 5:25 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, May 11, 2020 at 4:22 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
If the parent is RTI 1, and the children are RTIs 2..6, what
varno/varattno will we use in RTI 1's tlist to represent a column that
exists in both RTI 2 and RTI 3 but not in RTI 1, 4, 5, or 6?Fair question. We don't have any problem representing the column
as it exists in any one of those children, but we lack a notation
for the "union" or whatever you want to call it, except in the case
where the parent relation has a corresponding column. Still, this
doesn't seem that hard to fix. My inclination would be to invent
dummy parent-rel columns (possibly with negative attnums? not sure if
that'd be easier or harder than adding them in the positive direction)
to represent such "union" columns.Ah, that makes sense. If we can invent dummy columns on the parent
rel, then most of what I was worrying about no longer seems very
worrying.
IIUC, the idea is to have "dummy" columns in the top parent's
reltarget for every junk TLE added to the top-level targetlist by
child tables' FDWs that the top parent itself can't emit. But we allow
these FDW junk TLEs to contain any arbitrary expression, not just
plain Vars [1]https://www.postgresql.org/docs/current/fdw-callbacks.html#FDW-CALLBACKS-UPDATE, so what node type are these dummy parent columns? I
can see from add_vars_to_targetlist() that we allow only Vars and
PlaceHolderVars to be present in a relation's reltarget->exprs, but
neither of those seem suitable for the task.
Once we get something in the parent's reltarget->exprs representing
these child expressions, from there they go back into child
reltargets, so it would appear that our appendrel transformation code
must somehow be taught to deal with these dummy columns.
--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com
[1]: https://www.postgresql.org/docs/current/fdw-callbacks.html#FDW-CALLBACKS-UPDATE
"...If the extra expressions are more complex than simple Vars, they
must be run through eval_const_expressions before adding them to the
targetlist."
Amit Langote <amitlangote09@gmail.com> writes:
On Tue, May 12, 2020 at 5:25 AM Robert Haas <robertmhaas@gmail.com> wrote:
Ah, that makes sense. If we can invent dummy columns on the parent
rel, then most of what I was worrying about no longer seems very
worrying.
IIUC, the idea is to have "dummy" columns in the top parent's
reltarget for every junk TLE added to the top-level targetlist by
child tables' FDWs that the top parent itself can't emit. But we allow
these FDW junk TLEs to contain any arbitrary expression, not just
plain Vars [1], so what node type are these dummy parent columns?
We'd have to group the children into groups that share the same
row-identity column type. This is why I noted way-back-when that
it'd be a good idea to discourage FDWs from being too wild about
what they use for row identity.
(Also, just to be totally clear: I am *not* envisioning this as a
mechanism for FDWs to inject whatever computations they darn please
into query trees. It's for the row identity needed by UPDATE/DELETE,
and nothing else. That being the case, it's hard to understand why
the bottom-level Vars wouldn't be just plain Vars --- maybe "system
column" Vars or something like that, but still just Vars, not
expressions.)
regards, tom lane
On Wed, 13 May 2020 at 00:54, Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
On Mon, May 11, 2020 at 8:11 PM Amit Langote <amitlangote09@gmail.com> wrote:
Per row overhead would be incurred for every row whereas the plan time
overhead is one-time or in case of a prepared statement almost free.
So we need to compare it esp. when there are 2000 partitions and all
of them are being updated.I assume that such UPDATEs would be uncommon.
Yes, 2000 partitions being updated would be rare. But many rows from
the same partition being updated may not be that common. We have to
know how much is that per row overhead and updating how many rows it
takes to beat the planning time overhead. If the number of rows is
very large, we are good.
Rows from a non-parallel Append should arrive in order. If you were
worried about the performance of finding the correct ResultRelInfo for
the tuple that we just got, then we could just cache the tableOid and
ResultRelInfo for the last row, and if that tableoid matches on this
row, just use the same ResultRelInfo as last time. That'll save
doing the hash table lookup in all cases, apart from when the Append
changes to the next child subplan. Not sure exactly how that'll fit
in with the foreign table discussion that's going on here though.
Another option would be to not use tableoid and instead inject an INT4
Const (0 to nsubplans) into each subplan's targetlist that serves as
the index into an array of ResultRelInfos.
As for which ResultRelInfos to initialize, couldn't we just have the
planner generate an OidList of all the ones that we could need.
Basically, all the non-pruned partitions. Perhaps we could even be
pretty lazy about building those ResultRelInfos during execution too.
We'd need to grab the locks first, but, without staring at the code, I
doubt there's a reason we'd need to build them all upfront. That
would help in cases where pruning didn't prune much, but due to
something else in the WHERE clause, the results only come from some
small subset of partitions.
David
On Tue, May 12, 2020 at 9:54 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
On Mon, May 11, 2020 at 8:11 PM Amit Langote <amitlangote09@gmail.com> wrote:
Per row overhead would be incurred for every row whereas the plan time
overhead is one-time or in case of a prepared statement almost free.
So we need to compare it esp. when there are 2000 partitions and all
of them are being updated.I assume that such UPDATEs would be uncommon.
Yes, 2000 partitions being updated would be rare. But many rows from
the same partition being updated may not be that common. We have to
know how much is that per row overhead and updating how many rows it
takes to beat the planning time overhead. If the number of rows is
very large, we are good.
Maybe I am misunderstanding you, but the more the rows to update, the
more overhead we will be paying with the new approach.
Can we plan the scan query to add a sort node to order the rows by tableoid?
Hmm, I am afraid that some piece of partitioning code that assumes a
certain order of result relations, and that order is not based on
sorting tableoids.I am suggesting that we override that order (if any) in
create_modifytable_path() or create_modifytable_plan() by explicitly
ordering the incoming paths on tableoid. May be using MergeAppend.
So, we will need to do 2 things:
1. Implicitly apply an ORDER BY tableoid clause
2. Add result relation RTIs to ModifyTable.resultRelations in the
order of their RTE's relid.
Maybe we can do that as a separate patch. Also, I am not sure if it
will get in the way of someone wanting to have ORDER BY LIMIT for
updates.
* Tuple re-routing during UPDATE. For now it's disabled so your design
should work. But we shouldn't design this feature in such a way that
it comes in the way to enable tuple re-routing in future :).Sorry, what is tuple re-routing and why does this new approach get in its way?
An UPDATE causing a tuple to move to a different partition. It would
get in its way since the tuple will be located based on tableoid,
which will be the oid of the old partition. But I think this approach
has higher chance of being able to solve that problem eventually
rather than the current approach.
Again, I don't think I understand. We do currently (as of v11)
re-route tuples when UPDATE causes them to move to a different
partition, which, gladly, continues to work with my patch.
So how it works is like this: for a given "new" tuple, ExecUpdate()
checks if the tuple would violate the partition constraint of the
result relation that was passed along with the tuple. If it does, the
new tuple will be moved, by calling ExecDelete() to delete it from the
current relation, followed by ExecInsert() to find the new home for
the tuple. The only thing that changes with the new approach is how
ExecModifyTable() chooses a result relation to pass to ExecUpdate()
for a given "new" tuple it has fetched from the plan, which is quite
independent from the tuple re-routing mechanism proper.
--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com
On Wed, May 13, 2020 at 8:52 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Wed, 13 May 2020 at 00:54, Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:On Mon, May 11, 2020 at 8:11 PM Amit Langote <amitlangote09@gmail.com> wrote:
Per row overhead would be incurred for every row whereas the plan time
overhead is one-time or in case of a prepared statement almost free.
So we need to compare it esp. when there are 2000 partitions and all
of them are being updated.I assume that such UPDATEs would be uncommon.
Yes, 2000 partitions being updated would be rare. But many rows from
the same partition being updated may not be that common. We have to
know how much is that per row overhead and updating how many rows it
takes to beat the planning time overhead. If the number of rows is
very large, we are good.Rows from a non-parallel Append should arrive in order. If you were
worried about the performance of finding the correct ResultRelInfo for
the tuple that we just got, then we could just cache the tableOid and
ResultRelInfo for the last row, and if that tableoid matches on this
row, just use the same ResultRelInfo as last time. That'll save
doing the hash table lookup in all cases, apart from when the Append
changes to the next child subplan.
That would be a more common case, yes. Not when a join is involved though.
Not sure exactly how that'll fit
in with the foreign table discussion that's going on here though.
Foreign table discussion is concerned with what the only top-level
targetlist should look like given that different result relations may
require different row-identifying junk columns, due to possibly
belonging to different FDWs. Currently that's not a thing to worry
about, because each result relation has its own plan and hence the
targetlist.
Another option would be to not use tableoid and instead inject an INT4
Const (0 to nsubplans) into each subplan's targetlist that serves as
the index into an array of ResultRelInfos.
That may be a bit fragile, considering how volatile that number
(result relation index) can be if you figure in run-time pruning, but
maybe worth considering.
As for which ResultRelInfos to initialize, couldn't we just have the
planner generate an OidList of all the ones that we could need.
Basically, all the non-pruned partitions.
Why would replacing list of RT indexes by OIDs be better?
Perhaps we could even be
pretty lazy about building those ResultRelInfos during execution too.
We'd need to grab the locks first, but, without staring at the code, I
doubt there's a reason we'd need to build them all upfront. That
would help in cases where pruning didn't prune much, but due to
something else in the WHERE clause, the results only come from some
Late ResultRelInfo initialization is worth considering, given that
doing it for tuple-routing target relations works. I don't know why
we are still Initializing them all in InitPlan(), because the only
justification given for doing so that I know of is that it prevents
lock-upgrade. I think we discussed somewhat recently that that is not
really a hazard.
--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com
On Tue, May 12, 2020 at 10:57 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Amit Langote <amitlangote09@gmail.com> writes:
On Tue, May 12, 2020 at 5:25 AM Robert Haas <robertmhaas@gmail.com> wrote:
Ah, that makes sense. If we can invent dummy columns on the parent
rel, then most of what I was worrying about no longer seems very
worrying.IIUC, the idea is to have "dummy" columns in the top parent's
reltarget for every junk TLE added to the top-level targetlist by
child tables' FDWs that the top parent itself can't emit. But we allow
these FDW junk TLEs to contain any arbitrary expression, not just
plain Vars [1], so what node type are these dummy parent columns?We'd have to group the children into groups that share the same
row-identity column type. This is why I noted way-back-when that
it'd be a good idea to discourage FDWs from being too wild about
what they use for row identity.
I understood the part about having a dummy parent column for each
group of children that use the same junk attribute. I think we must
group them using resname + row-identity Var type though, not just the
latter, because during execution, the FDWs look up the junk columns by
name. If two FDWs add junk Vars of the same type, say, 'tid', but use
different resname, say, "ctid" and "rowid", respectively, we must add
two dummy parent columns.
(Also, just to be totally clear: I am *not* envisioning this as a
mechanism for FDWs to inject whatever computations they darn please
into query trees. It's for the row identity needed by UPDATE/DELETE,
and nothing else. That being the case, it's hard to understand why
the bottom-level Vars wouldn't be just plain Vars --- maybe "system
column" Vars or something like that, but still just Vars, not
expressions.)
I suppose we would need to explicitly check that and cause an error if
the contained expression is not a plain Var. Neither the interface
we've got nor the documentation discourages them from putting just
about any expression into the junk TLE.
Based on an off-list chat with Robert, I started looking into whether
it would make sense to drop the middleman Append (or MergeAppend)
altogether, if only to avoid having to invent a representation for
parent targetlist that is never actually computed. However, it's not
hard to imagine that any new book-keeping code to manage child plans,
even though perhaps cheaper in terms of cycles spent than
inheritance_planner(), would add complexity to the main planner. It
would also be a shame to lose useful functionality that we get by
having an Append present, such as run-time pruning and partitionwise
joins.
--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com
On Wed, 13 May 2020 at 19:02, Amit Langote <amitlangote09@gmail.com> wrote:
As for which ResultRelInfos to initialize, couldn't we just have the
planner generate an OidList of all the ones that we could need.
Basically, all the non-pruned partitions.Why would replacing list of RT indexes by OIDs be better?
TBH, I didn't refresh my memory of the code before saying that.
However, if we have a list of RT index for which rangetable entries we
must build ResultRelInfos for, then why is it a problem that plan-time
pruning is not allowing you to eliminate the excess ResultRelInfos,
like you mentioned in:
On Sat, 9 May 2020 at 01:33, Amit Langote <amitlangote09@gmail.com> wrote:
prepare q as update foo set a = 250001 where a = $1;
set plan_cache_mode to 'force_generic_plan';
explain execute q(1);
QUERY PLAN
--------------------------------------------------------------------
Update on foo (cost=0.00..142.20 rows=40 width=14)
Update on foo_1
Update on foo_2 foo
Update on foo_3 foo
Update on foo_4 foo
-> Append (cost=0.00..142.20 rows=40 width=14)
Subplans Removed: 3
-> Seq Scan on foo_1 (cost=0.00..35.50 rows=10 width=14)
Filter: (a = $1)
(9 rows)
Shouldn't you just be setting the ModifyTablePath.resultRelations to
the non-pruned RT indexes?
Perhaps we could even be
pretty lazy about building those ResultRelInfos during execution too.
We'd need to grab the locks first, but, without staring at the code, I
doubt there's a reason we'd need to build them all upfront. That
would help in cases where pruning didn't prune much, but due to
something else in the WHERE clause, the results only come from someLate ResultRelInfo initialization is worth considering, given that
doing it for tuple-routing target relations works. I don't know why
we are still Initializing them all in InitPlan(), because the only
justification given for doing so that I know of is that it prevents
lock-upgrade. I think we discussed somewhat recently that that is not
really a hazard.
Looking more closely at ExecGetRangeTableRelation(), we'll already
have the lock by that time, there's an Assert to verify that too.
It'll have been acquired either during planning or during
AcquireExecutorLocks(). So it seems doing anything for delaying the
building of ResultRelInfos wouldn't need to account for taking the
lock at a different time.
David
On Thu, May 14, 2020 at 7:55 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Wed, 13 May 2020 at 19:02, Amit Langote <amitlangote09@gmail.com> wrote:
As for which ResultRelInfos to initialize, couldn't we just have the
planner generate an OidList of all the ones that we could need.
Basically, all the non-pruned partitions.Why would replacing list of RT indexes by OIDs be better?
TBH, I didn't refresh my memory of the code before saying that.
However, if we have a list of RT index for which rangetable entries we
must build ResultRelInfos for, then why is it a problem that plan-time
pruning is not allowing you to eliminate the excess ResultRelInfos,
like you mentioned in:On Sat, 9 May 2020 at 01:33, Amit Langote <amitlangote09@gmail.com> wrote:
prepare q as update foo set a = 250001 where a = $1;
set plan_cache_mode to 'force_generic_plan';
explain execute q(1);
QUERY PLAN
--------------------------------------------------------------------
Update on foo (cost=0.00..142.20 rows=40 width=14)
Update on foo_1
Update on foo_2 foo
Update on foo_3 foo
Update on foo_4 foo
-> Append (cost=0.00..142.20 rows=40 width=14)
Subplans Removed: 3
-> Seq Scan on foo_1 (cost=0.00..35.50 rows=10 width=14)
Filter: (a = $1)
(9 rows)Shouldn't you just be setting the ModifyTablePath.resultRelations to
the non-pruned RT indexes?
Oh, that example is showing run-time pruning for a generic plan. If
planner prunes partitions, of course, their result relation indexes
are not present in ModifyTablePath.resultRelations.
Perhaps we could even be
pretty lazy about building those ResultRelInfos during execution too.
We'd need to grab the locks first, but, without staring at the code, I
doubt there's a reason we'd need to build them all upfront. That
would help in cases where pruning didn't prune much, but due to
something else in the WHERE clause, the results only come from someLate ResultRelInfo initialization is worth considering, given that
doing it for tuple-routing target relations works. I don't know why
we are still Initializing them all in InitPlan(), because the only
justification given for doing so that I know of is that it prevents
lock-upgrade. I think we discussed somewhat recently that that is not
really a hazard.Looking more closely at ExecGetRangeTableRelation(), we'll already
have the lock by that time, there's an Assert to verify that too.
It'll have been acquired either during planning or during
AcquireExecutorLocks(). So it seems doing anything for delaying the
building of ResultRelInfos wouldn't need to account for taking the
lock at a different time.
Yep, I think it might be worthwhile to delay ResultRelInfo building
for UPDATE/DELETE too. I would like to leave that for another patch
though.
--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com
On Wed, May 13, 2020 at 9:21 AM Amit Langote <amitlangote09@gmail.com> wrote:
Maybe I am misunderstanding you, but the more the rows to update, the
more overhead we will be paying with the new approach.
Yes, that's right. How much is that compared to the current planning
overhead. How many rows it takes for that overhead to be comparable to
the current planning overhead.
But let's not sweat on that point much right now.
So, we will need to do 2 things:
1. Implicitly apply an ORDER BY tableoid clause
2. Add result relation RTIs to ModifyTable.resultRelations in the
order of their RTE's relid.Maybe we can do that as a separate patch. Also, I am not sure if it
will get in the way of someone wanting to have ORDER BY LIMIT for
updates.
It won't. But may be David's idea is better.
* Tuple re-routing during UPDATE. For now it's disabled so your design
should work. But we shouldn't design this feature in such a way that
it comes in the way to enable tuple re-routing in future :).Sorry, what is tuple re-routing and why does this new approach get in its way?
An UPDATE causing a tuple to move to a different partition. It would
get in its way since the tuple will be located based on tableoid,
which will be the oid of the old partition. But I think this approach
has higher chance of being able to solve that problem eventually
rather than the current approach.Again, I don't think I understand. We do currently (as of v11)
re-route tuples when UPDATE causes them to move to a different
partition, which, gladly, continues to work with my patch.
Ah! Ok. I missed that part then.
So how it works is like this: for a given "new" tuple, ExecUpdate()
checks if the tuple would violate the partition constraint of the
result relation that was passed along with the tuple. If it does, the
new tuple will be moved, by calling ExecDelete() to delete it from the
current relation, followed by ExecInsert() to find the new home for
the tuple. The only thing that changes with the new approach is how
ExecModifyTable() chooses a result relation to pass to ExecUpdate()
for a given "new" tuple it has fetched from the plan, which is quite
independent from the tuple re-routing mechanism proper.
Thanks for the explanation.
--
Best Wishes,
Ashutosh Bapat
So, I think I have a patch that seems to work, but not all the way,
more on which below.
Here is the commit message in the attached patch.
===
Subject: [PATCH] Overhaul UPDATE's targetlist processing
Instead of emitting the full tuple matching the target table's tuple
descriptor, make the plan emit only the attributes that are assigned
values in the SET clause, plus row-identity junk attributes as before.
This allows us to avoid making a separate plan for each target
relation in the inheritance case, because the only reason it is so
currently is to account for the fact that each target relations may
have a set of attributes that is different from others. Having only
one plan suffices, because the set of assigned attributes must be same
in all the result relations.
While the plan will now produce only the assigned attributes and
row-identity junk attributes, other columns' values are filled by
refetching the old tuple. To that end, there will be a targetlist for
each target relation to compute the full tuple, that is, by combining
the values from the plan tuple and the old tuple, but they are passed
separately in the ModifyTable node.
Implementation notes:
* In the inheritance case, as the same plan produces tuples to be
updated from multiple result relations, the tuples now need to also
identity which table they come from, so an additional junk attribute
"tableoid" is present in that case.
* Considering that the inheritance set may contain foreign tables that
require a different (set of) row-identity junk attribute(s), the plan
needs to emit multiple distinct junk attributes. When transposed to a
child scan node, this targetlist emits a non-NULL value for the junk
attribute that's valid for the child relation and NULL for others.
* Executor and FDW execution APIs can no longer assume any specific
order in which the result relations will be processed. For each
tuple to be updated/deleted, result relation is selected by looking it
up in a hash table using the "tableoid" value as the key.
* Since the plan does not emit values for all the attributes, FDW APIs
may not assume that the individual column values in the TupleTableSlot
containing the plan tuple are accessible by their attribute numbers.
TODO:
* Reconsider having only one plan!
* Update FDW handler docs to reflect the API changes
===
Regarding the first TODO, it is to address the limitation that FDWs
will no longer be able push the *whole* child UPDATE/DELETE query down
to the remote server, including any joins, which is allowed at the
moment via PlanDirectModify API. The API seems to have been designed
with an assumption that the child scan/join node is the top-level
plan, but that's no longer the case. If we consider bypassing the
Append and allow ModifyTable to access the child scan/join nodes
directly, maybe we can allow that. I haven't updated the expected
output of postgres_fdw regression tests for now pending this.
A couple of things in the patch that I feel slightly uneasy about:
* Result relations are now appendrel children in the planner.
Normally, any wholerow Vars in the child relation's reltarget->exprs
get a ConvertRowType added on top to convert it back to the parent's
reltype, because that's what the client expects in the SELECT case.
In the result relation case, the executor expects to see child
wholerow Vars themselves, not their parent versions.
* FDW's ExecFoeignUpdate() API expects that the NEW tuple passed to it
match the target foreign table reltype, so that it can access the
target attributes in the tuple by attribute numbers. Considering that
the plan no longer builds the full tuple itself, I made the executor
satisfy that expectation by filling the missing attributes' values
using the target table's wholerow attribute. That is, we now *always*
fetch the wholerow attributes for UPDATE, not just when there are
row-level triggers that need it. I think that's unfortunate. Maybe,
the correct way is asking the FDWs to translate (setrefs.c style) the
target attribute numbers appropriately to access the plan's output
tuple.
I will add the patch to the next CF. I haven't yet fully checked the
performance considerations of the new approach, but will do so in the
coming days.
--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com
Attachments:
v1-0001-Overhaul-UPDATE-s-targetlist-processing.patchapplication/octet-stream; name=v1-0001-Overhaul-UPDATE-s-targetlist-processing.patchDownload+1801-1571
On Tue, Jun 2, 2020 at 1:15 PM Amit Langote <amitlangote09@gmail.com> wrote:
So, I think I have a patch that seems to work, but not all the way,
more on which below.Here is the commit message in the attached patch.
===
Subject: [PATCH] Overhaul UPDATE's targetlist processingInstead of emitting the full tuple matching the target table's tuple
descriptor, make the plan emit only the attributes that are assigned
values in the SET clause, plus row-identity junk attributes as before.
This allows us to avoid making a separate plan for each target
relation in the inheritance case, because the only reason it is so
currently is to account for the fact that each target relations may
have a set of attributes that is different from others. Having only
one plan suffices, because the set of assigned attributes must be same
in all the result relations.While the plan will now produce only the assigned attributes and
row-identity junk attributes, other columns' values are filled by
refetching the old tuple. To that end, there will be a targetlist for
each target relation to compute the full tuple, that is, by combining
the values from the plan tuple and the old tuple, but they are passed
separately in the ModifyTable node.Implementation notes:
* In the inheritance case, as the same plan produces tuples to be
updated from multiple result relations, the tuples now need to also
identity which table they come from, so an additional junk attribute
"tableoid" is present in that case.* Considering that the inheritance set may contain foreign tables that
require a different (set of) row-identity junk attribute(s), the plan
needs to emit multiple distinct junk attributes. When transposed to a
child scan node, this targetlist emits a non-NULL value for the junk
attribute that's valid for the child relation and NULL for others.* Executor and FDW execution APIs can no longer assume any specific
order in which the result relations will be processed. For each
tuple to be updated/deleted, result relation is selected by looking it
up in a hash table using the "tableoid" value as the key.* Since the plan does not emit values for all the attributes, FDW APIs
may not assume that the individual column values in the TupleTableSlot
containing the plan tuple are accessible by their attribute numbers.TODO:
* Reconsider having only one plan!
* Update FDW handler docs to reflect the API changes
===
I divided that into two patches:
1. Make the plan producing tuples to be updated emit only the columns
that are actually updated. postgres_fdw test fails unless you also
apply the patch I posted at [1]/messages/by-id/CA+HiwqE_UK1jTSNrjb8mpTdivzd3dum6mK--xqKq0Y9VmfwWQA@mail.gmail.com, because there is an unrelated bug in
UPDATE tuple routing code that manifests due to some changes of this
patch.
2. Due to 1, inheritance_planner() is no longer needed, that is,
inherited update/delete can be handled by pulling the rows to
update/delete from only one plan, not one per child result relation.
This one makes that so.
There are some unsolved problems having to do with foreign tables in
both 1 and 2:
In 1, FDW update APIs still assume that the plan produces "full" tuple
for update. That needs to be fixed so that FDWs deal with getting
only the updated columns in the plan's output targetlist.
In 2, still haven't figured out a way to call PlanDirectModify() on
child foreign tables. Lacking that, inherited updates on foreign
tables are now slower, because they are not pushed down. I'd like to
figure something out to fix that situation.
--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com
[1]: /messages/by-id/CA+HiwqE_UK1jTSNrjb8mpTdivzd3dum6mK--xqKq0Y9VmfwWQA@mail.gmail.com