ON CONFLICT DO UPDATE for partitioned tables

Started by Alvaro Herreraabout 8 years ago53 messageshackers
Jump to latest
#1Alvaro Herrera
alvherre@2ndquadrant.com

I updated Amit Langote's patch for INSERT ON CONFLICT DO UPDATE[1]/messages/by-id/c1651d5b-7bd6-b7e7-e1cc-16ecfe2c0da5@lab.ntt.co.jp.
Following the lead of edd44738bc88 ("Be lazier about partition tuple
routing.") this incarnation only does the necessary push-ups for the
specific partition that needs it, at execution time. As far as I can
tell, it works as intended.

I chose to refuse the case where the DO UPDATE clause causes the tuple
to move to another partition (i.e. you're updating the partition key of
the tuple). While it's probably possible to implement that, it doesn't
seem a very productive use of time.

However, there is a shortcoming in the design: it fails if there are
multiple levels of partitioning, because there is no easy (efficient)
way to map the index OID more than one level. I had already mentioned
this shortcoming to Amit's patch. So this case (which I added in the
regression tests) fails unexpectedly:

-- multiple-layered partitioning
create table parted_conflict_test (a int primary key, b text) partition by range (a);
create table parted_conflict_test_1 partition of parted_conflict_test
for values from (0) to (10000) partition by range (a);
create table parted_conflict_test_1_1 partition of parted_conflict_test_1
for values from (0) to (100);
insert into parted_conflict_test values ('10', 'ten');
insert into parted_conflict_test values ('10', 'ten two')
on conflict (a) do update set b = excluded.b;
ERROR: invalid ON CONFLICT DO UPDATE specification
DETAIL: An inferred index was not found in partition "parted_conflict_test_1_1".

So the problem here is that my MapPartitionIndexList() implementation is
too stupid. I think it would be smarter to use the ResultRelInfo
instead of bare Relation, for one. But that still doesn't answer how to
find a "path" from root to leaf partition, which is what I'd need to
verify that there are valid pg_inherits relationships for the partition
indexes. I'm probably missing something about the partitionDesc or
maybe the partitioned_rels lists that helps me do it efficiently, but I
hope figure out soon.

One idea I was toying with is to add RelationData->rd_children as a list
of OIDs of children relations. So it should be possible to walk the
list from the root to the desired descendant, without having to scan
pg_inherits repeatedly ... although that would probably require doing
relation_open() for every index, which sounds undesirable.

(ISTM that having RelationData->rd_children might be a good idea in
general anyway -- I mean to speed up some code that currently scans
pg_inherits via find_inheritance_children. However, since the partition
descriptor itself is already in relcache, maybe this doesn't matter too
much.)

Another idea is to abandon the notion that we need to find a path from
parent index to descendant index, and just run the inference algorithm
again on the partition. I'm not sure how much I like this idea, yet.

Anyway, while this is still WIP, I think it works correctly for the case
where there is a single partition level.

[1]: /messages/by-id/c1651d5b-7bd6-b7e7-e1cc-16ecfe2c0da5@lab.ntt.co.jp

--
�lvaro Herrer

Attachments:

v1-0001-fix-ON-CONFLICT-DO-UPDATE-for-partitioned-tables.patchtext/plain; charset=us-asciiDownload+311-19
#2Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Alvaro Herrera (#1)
Re: ON CONFLICT DO UPDATE for partitioned tables

On 2018/02/28 9:46, Alvaro Herrera wrote:

I updated Amit Langote's patch for INSERT ON CONFLICT DO UPDATE[1].
Following the lead of edd44738bc88 ("Be lazier about partition tuple
routing.") this incarnation only does the necessary push-ups for the
specific partition that needs it, at execution time. As far as I can
tell, it works as intended.

Thanks. I too have been meaning to send an updated (nay, significantly
rewritten) version of the patch, but you beat me to it.

I chose to refuse the case where the DO UPDATE clause causes the tuple
to move to another partition (i.e. you're updating the partition key of
the tuple). While it's probably possible to implement that, it doesn't
seem a very productive use of time.

Probably a good idea to save it for later.

However, there is a shortcoming in the design: it fails if there are
multiple levels of partitioning, because there is no easy (efficient)
way to map the index OID more than one level. I had already mentioned
this shortcoming to Amit's patch. So this case (which I added in the
regression tests) fails unexpectedly:

-- multiple-layered partitioning
create table parted_conflict_test (a int primary key, b text) partition by range (a);
create table parted_conflict_test_1 partition of parted_conflict_test
for values from (0) to (10000) partition by range (a);
create table parted_conflict_test_1_1 partition of parted_conflict_test_1
for values from (0) to (100);
insert into parted_conflict_test values ('10', 'ten');
insert into parted_conflict_test values ('10', 'ten two')
on conflict (a) do update set b = excluded.b;
ERROR: invalid ON CONFLICT DO UPDATE specification
DETAIL: An inferred index was not found in partition "parted_conflict_test_1_1".

So the problem here is that my MapPartitionIndexList() implementation is
too stupid. I think it would be smarter to use the ResultRelInfo
instead of bare Relation, for one. But that still doesn't answer how to
find a "path" from root to leaf partition, which is what I'd need to
verify that there are valid pg_inherits relationships for the partition
indexes. I'm probably missing something about the partitionDesc or
maybe the partitioned_rels lists that helps me do it efficiently, but I
hope figure out soon.

One idea I was toying with is to add RelationData->rd_children as a list
of OIDs of children relations. So it should be possible to walk the
list from the root to the desired descendant, without having to scan
pg_inherits repeatedly ... although that would probably require doing
relation_open() for every index, which sounds undesirable.

(ISTM that having RelationData->rd_children might be a good idea in
general anyway -- I mean to speed up some code that currently scans
pg_inherits via find_inheritance_children. However, since the partition
descriptor itself is already in relcache, maybe this doesn't matter too
much.)

Another idea is to abandon the notion that we need to find a path from
parent index to descendant index, and just run the inference algorithm
again on the partition. I'm not sure how much I like this idea, yet.

Anyway, while this is still WIP, I think it works correctly for the case
where there is a single partition level.

Actually, after your comment on my original patch [1]/messages/by-id/20171227225031.osh6vunpuhsath25@alvherre.pgsql, I did make it work
for multiple levels by teaching the partition initialization code to find
a given partition's indexes that are inherited from the root table (that
is the table mentioned in command). So, after a tuple is routed to a
partition, we switch from the original arbiterIndexes list to the one we
created for the partition, which must contain OIDs corresponding to those
in the original list. After all, for each of the parent's indexes that
the planner put into the original arbiterIndexes list, there must exist an
index in each of the leaf partitions.

I had also observed when working on the patch that various TupleTableSlots
used by the ON CONFLICT DO UPDATE code must be based on TupleDesc of the
inheritance-translated target list (DO UPDATE SET target list). In fact,
that has to take into account also the dropped columns; we may have
dropped columns either in parent or in a partition or in both at same or
different attnum positions. That means, simple map_partition_varattnos()
translation doesn't help in this case.

For example, with your patch (sorry, I know you said it's a WIP patch), I
see either a crash or errors when dealing with such differing attribute
numbers:

drop table p;
create table p (a int, b text) partition by list (a);
create table p12 (b text, a int);
alter table p attach partition p12 for values in (1, 2);
alter table p drop b, add b text;
create table p4 partition of p for values in (4);
create unique index on p (a);

insert into p values (1, 'a') on conflict (a) do update set b = excluded.b;

insert into p values (1, 'b') on conflict (a) do update set b = excluded.b;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

insert into p values (4, 'a') on conflict (a) do update set b = excluded.b;

postgres=# insert into p values (4, 'b') on conflict (a) do update set b =
excluded.b;
ERROR: attribute number 3 exceeds number of columns 2

I attach my patch here for your reference, which I polished this morning
after seeing your email and the patch. It works for most of the cases, as
you can see in the updated tests in insert_conflict.sql. Since I agree
with you that we should, for now, error out if DO UPDATE causes row
movement, I adopted the code from your patch for that.

Thanks,
Amit

[1]: /messages/by-id/20171227225031.osh6vunpuhsath25@alvherre.pgsql
/messages/by-id/20171227225031.osh6vunpuhsath25@alvherre.pgsql

Attachments:

v1-0001-Fix-ON-CONFLICT-to-work-with-partitioned-tables.patchtext/plain; charset=UTF-8; name=v1-0001-Fix-ON-CONFLICT-to-work-with-partitioned-tables.patchDownload+405-65
#3Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#1)
Re: ON CONFLICT DO UPDATE for partitioned tables

On Tue, Feb 27, 2018 at 7:46 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

I updated Amit Langote's patch for INSERT ON CONFLICT DO UPDATE[1].
Following the lead of edd44738bc88 ("Be lazier about partition tuple
routing.") this incarnation only does the necessary push-ups for the
specific partition that needs it, at execution time. As far as I can
tell, it works as intended.

I chose to refuse the case where the DO UPDATE clause causes the tuple
to move to another partition (i.e. you're updating the partition key of
the tuple). While it's probably possible to implement that, it doesn't
seem a very productive use of time.

I would have thought that to be the only case we could support with
the current infrastructure. Doesn't a correct implementation for any
other case require a global index?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#4Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Robert Haas (#3)
Re: ON CONFLICT DO UPDATE for partitioned tables

On 2018/03/01 1:03, Robert Haas wrote:

On Tue, Feb 27, 2018 at 7:46 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

I updated Amit Langote's patch for INSERT ON CONFLICT DO UPDATE[1].
Following the lead of edd44738bc88 ("Be lazier about partition tuple
routing.") this incarnation only does the necessary push-ups for the
specific partition that needs it, at execution time. As far as I can
tell, it works as intended.

I chose to refuse the case where the DO UPDATE clause causes the tuple
to move to another partition (i.e. you're updating the partition key of
the tuple). While it's probably possible to implement that, it doesn't
seem a very productive use of time.

I would have thought that to be the only case we could support with
the current infrastructure. Doesn't a correct implementation for any
other case require a global index?

I'm thinking that Alvaro is talking here about the DO UPDATE action part,
not the conflict checking part. The latter will definitely require global
indexes if conflict were to be checked on columns not containing the
partition key.

The case Alvaro mentions arises after checking the conflict, presumably
using an inherited unique index whose keys must include the partition
keys. If the conflict action is DO UPDATE and its SET clause changes
partition key columns such that the row will have to change the partition,
then the current patch will result in an error. I think that's because
making update row movement work in this case will require some adjustments
to what 2f178441044 (Allow UPDATE to move rows between partitions)
implemented. We wouldn't have things set up in the ModifyTableState that
the current row-movement code depends on being set up; for example, there
wouldn't be per-subplan ResultRelInfo's in the ON CONFLICT DO UPDATE case.
The following Assert in ExecUpdate() will fail for instance:

map_index = resultRelInfo - mtstate->resultRelInfo;
Assert(map_index >= 0 && map_index < mtstate->mt_nplans);

Thanks,
Amit

#5Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Amit Langote (#2)
Re: ON CONFLICT DO UPDATE for partitioned tables

Amit Langote wrote:

Actually, after your comment on my original patch [1], I did make it work
for multiple levels by teaching the partition initialization code to find
a given partition's indexes that are inherited from the root table (that
is the table mentioned in command). So, after a tuple is routed to a
partition, we switch from the original arbiterIndexes list to the one we
created for the partition, which must contain OIDs corresponding to those
in the original list. After all, for each of the parent's indexes that
the planner put into the original arbiterIndexes list, there must exist an
index in each of the leaf partitions.

Oh, your solution for this seems simple enough. Silly me, I was trying
to implement it in a quite roundabout way. Thanks. (I do wonder if we
should save the "root" reloid in the relcache).

I had also observed when working on the patch that various TupleTableSlots
used by the ON CONFLICT DO UPDATE code must be based on TupleDesc of the
inheritance-translated target list (DO UPDATE SET target list). In fact,
that has to take into account also the dropped columns; we may have
dropped columns either in parent or in a partition or in both at same or
different attnum positions. That means, simple map_partition_varattnos()
translation doesn't help in this case.

Yeah, I was aware these corner cases could become a problem though I
hadn't gotten around to testing them yet. Thanks for all your work on
this.

The usage of the few optimizer/prep/ functions that are currently static
doesn't fill me with joy. These functions have weird APIs because
they're static so we don't rally care, but once we export them we should
strive to be more careful. I'd rather stay away from just exporting
them all, so I chose to encapsulate those calls in a single function and
export only expand_targetlist from preptlist.c, keeping the others
static in prepunion.c. In the attached patch set, I put an API change
(work on tupdescs rather than full-blown relations) for a couple of
those functions as 0001, then your patch as 0002, then a few fixups of
my own. (0002 is not bit-by-bit identical to yours; I think I had to
fix some merge conflict with 0001, but should be pretty much the same).

But looking further, I think there is much cruft that has accumulated in
those functions (because they've gotten simplified over time), and we
could do some additional cleanup surgery. For example, there is no
reason to pass a list pointer to make_inh_translation_list(); we could
just return it. And then we don't have to cons up a fake AppendRelInfo
with all dummy values that adjust_inherited_tlist doesn't even care
about. I think there was a point for all these contortions back at some
point (visible by checking git history of this code), but it all seems
useless now.

Re. the "ugly hack" comments in adjust_inherited_tlist(), I'm confused:
expand_targetlist() runs *after*, not before, so how could it have
affected the result? I'm obviously confused about what
expand_targetlist call this comment is talking about. Anyway I wanted
to make it use resjunk entries instead, but that broke some other case
that I didn't have time to research yesterday. I'll get back to this
soon, but in the meantime, here's what I have.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#6Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Amit Langote (#2)
Re: ON CONFLICT DO UPDATE for partitioned tables

Amit Langote wrote:

Actually, after your comment on my original patch [1], I did make it work
for multiple levels by teaching the partition initialization code to find
a given partition's indexes that are inherited from the root table (that
is the table mentioned in command). So, after a tuple is routed to a
partition, we switch from the original arbiterIndexes list to the one we
created for the partition, which must contain OIDs corresponding to those
in the original list. After all, for each of the parent's indexes that
the planner put into the original arbiterIndexes list, there must exist an
index in each of the leaf partitions.

Oh, your solution for this seems simple enough. Silly me, I was trying
to implement it in a quite roundabout way. Thanks. (I do wonder if we
should save the "root" reloid in the relcache).

I had also observed when working on the patch that various TupleTableSlots
used by the ON CONFLICT DO UPDATE code must be based on TupleDesc of the
inheritance-translated target list (DO UPDATE SET target list). In fact,
that has to take into account also the dropped columns; we may have
dropped columns either in parent or in a partition or in both at same or
different attnum positions. That means, simple map_partition_varattnos()
translation doesn't help in this case.

Yeah, I was aware these corner cases could become a problem though I
hadn't gotten around to testing them yet. Thanks for all your work on
this.

The usage of the few optimizer/prep/ functions that are currently static
doesn't fill me with joy. These functions have weird APIs because
they're static so we don't rally care, but once we export them we should
strive to be more careful. I'd rather stay away from just exporting
them all, so I chose to encapsulate those calls in a single function and
export only expand_targetlist from preptlist.c, keeping the others
static in prepunion.c. In the attached patch set, I put an API change
(work on tupdescs rather than full-blown relations) for a couple of
those functions as 0001, then your patch as 0002, then a few fixups of
my own. (0002 is not bit-by-bit identical to yours; I think I had to
fix some merge conflict with 0001, but should be pretty much the same).

But looking further, I think there is much cruft that has accumulated in
those functions (because they've gotten simplified over time), and we
could do some additional cleanup surgery. For example, there is no
reason to pass a list pointer to make_inh_translation_list(); we could
just return it. And then we don't have to cons up a fake AppendRelInfo
with all dummy values that adjust_inherited_tlist doesn't even care
about. I think there was a point for all these contortions back at some
point (visible by checking git history of this code), but it all seems
useless now.

Re. the "ugly hack" comments in adjust_inherited_tlist(), I'm confused:
expand_targetlist() runs *after*, not before, so how could it have
affected the result? I'm obviously confused about what
expand_targetlist call this comment is talking about. Anyway I wanted
to make it use resjunk entries instead, but that broke some other case
that I didn't have time to research yesterday. I'll get back to this
soon, but in the meantime, here's what I have.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

v3-0001-Make-some-static-functions-work-on-TupleDesc-rath.patchtext/plain; charset=us-asciiDownload+27-21
v3-0002-Fix-ON-CONFLICT-to-work-with-partitioned-tables.patchtext/plain; charset=us-asciiDownload+404-60
v3-0003-fixups.patchtext/plain; charset=us-asciiDownload+115-93
#7Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Alvaro Herrera (#6)
Re: ON CONFLICT DO UPDATE for partitioned tables

On 2018/03/03 0:36, Alvaro Herrera wrote:

Amit Langote wrote:

Actually, after your comment on my original patch [1], I did make it work
for multiple levels by teaching the partition initialization code to find
a given partition's indexes that are inherited from the root table (that
is the table mentioned in command). So, after a tuple is routed to a
partition, we switch from the original arbiterIndexes list to the one we
created for the partition, which must contain OIDs corresponding to those
in the original list. After all, for each of the parent's indexes that
the planner put into the original arbiterIndexes list, there must exist an
index in each of the leaf partitions.

Oh, your solution for this seems simple enough. Silly me, I was trying
to implement it in a quite roundabout way. Thanks. (I do wonder if we
should save the "root" reloid in the relcache).

Do you mean store the root reloid for "any" partition (table or index)?

I had also observed when working on the patch that various TupleTableSlots
used by the ON CONFLICT DO UPDATE code must be based on TupleDesc of the
inheritance-translated target list (DO UPDATE SET target list). In fact,
that has to take into account also the dropped columns; we may have
dropped columns either in parent or in a partition or in both at same or
different attnum positions. That means, simple map_partition_varattnos()
translation doesn't help in this case.

Yeah, I was aware these corner cases could become a problem though I
hadn't gotten around to testing them yet. Thanks for all your work on
this.

The usage of the few optimizer/prep/ functions that are currently static
doesn't fill me with joy. These functions have weird APIs because
they're static so we don't rally care, but once we export them we should
strive to be more careful. I'd rather stay away from just exporting
them all, so I chose to encapsulate those calls in a single function and
export only expand_targetlist from preptlist.c, keeping the others
static in prepunion.c. In the attached patch set, I put an API change
(work on tupdescs rather than full-blown relations) for a couple of
those functions as 0001, then your patch as 0002, then a few fixups of
my own. (0002 is not bit-by-bit identical to yours; I think I had to
fix some merge conflict with 0001, but should be pretty much the same).

But looking further, I think there is much cruft that has accumulated in
those functions (because they've gotten simplified over time), and we
could do some additional cleanup surgery. For example, there is no
reason to pass a list pointer to make_inh_translation_list(); we could
just return it. And then we don't have to cons up a fake AppendRelInfo
with all dummy values that adjust_inherited_tlist doesn't even care
about. I think there was a point for all these contortions back at some
point (visible by checking git history of this code), but it all seems
useless now.

Yeah, I think it might as well work to fix up these interfaces the way
you've done.

Re. the "ugly hack" comments in adjust_inherited_tlist(), I'm confused:
expand_targetlist() runs *after*, not before, so how could it have
affected the result? I'm obviously confused about what
expand_targetlist call this comment is talking about. Anyway I wanted
to make it use resjunk entries instead, but that broke some other case
that I didn't have time to research yesterday. I'll get back to this
soon, but in the meantime, here's what I have.

Hmm. I can imagine how the newly added comments in
adjust_inherited_tlist() may be confusing. With this patch, we're now
calling adjust_inherited_tlist() from the executor, whereas it was
designed only to be run in the planner prep phase. What we're passing to
it from the executor is the DO UPDATE SET's targetlist that has undergone
the expand_targetlist() treatment by the planner. Maybe, we need to
update the adjust_inherited_tlist() comments to reflect the expansion of
its scope due to this patch.

Some comments on 0003:

+ * Fist, fix the target entries' resnos, by using inheritance
translation.

First

+ appinfo.parent_reltype = InvalidOid; // parentRel->rd_rel->reltype;

I guess you won't retain that comment. :)

+ result_tl = expand_targetlist(result_tl, CMD_UPDATE,

Should we add a CmdType argument to adjust_and_expand_partition_tlist()
and use it instead of hard-coding CMD_UPDATE here?

Thanks,
Amit

#8Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Alvaro Herrera (#6)
Re: ON CONFLICT DO UPDATE for partitioned tables

On Fri, Mar 2, 2018 at 9:06 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:

Re. the "ugly hack" comments in adjust_inherited_tlist(), I'm confused:
expand_targetlist() runs *after*, not before, so how could it have
affected the result?

If I understand correctly, planner must have called expand_targetlist()
once for the parent relation's descriptor and added any dropped columns
from the parent relation. So we may not find mapped attributes for those
dropped columns in the parent. I haven't actually tested this case though.

I wonder if it makes sense to actually avoid expanding on-conflict-set
targetlist in case the target is a partition table and deal with it during
execution, like we are doing now.

I'm obviously confused about what
expand_targetlist call this comment is talking about. Anyway I wanted
to make it use resjunk entries instead, but that broke some other case
that I didn't have time to research yesterday. I'll get back to this
soon, but in the meantime, here's what I have.

+           conv_setproj =
+
 adjust_and_expand_partition_tlist(RelationGetDescr(firstResultRel),
+                                                 RelationGetDescr(partrel),
+
 RelationGetRelationName(partrel),
+                                                 firstVarno,
+                                                 conv_setproj);

Aren't we are adding back Vars referring to the parent relation, after
having converted the existing ones to use partition relation's varno? May
be that works because missing attributes are already added during planning
and expand_targetlist() here only adds dropped columns, which are just NULL
constants.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#9Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Alvaro Herrera (#6)
Re: ON CONFLICT DO UPDATE for partitioned tables

On Fri, Mar 2, 2018 at 9:06 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:

@@ -106,6 +120,9 @@ typedef struct PartitionTupleRouting
    int         num_subplan_partition_offsets;
    TupleTableSlot *partition_tuple_slot;
    TupleTableSlot *root_tuple_slot;
+   List      **partition_arbiter_indexes;
+   TupleTableSlot **partition_conflproj_slots;
+   TupleTableSlot **partition_existing_slots;
 } PartitionTupleRouting;

I am curious why you decided to add these members to PartitionTupleRouting
structure. Wouldn't ResultRelationInfo be a better place to track these or
is there some rule that we follow?

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#10Etsuro Fujita
fujita.etsuro@lab.ntt.co.jp
In reply to: Pavan Deolasee (#9)
Re: ON CONFLICT DO UPDATE for partitioned tables

(2018/03/16 19:43), Pavan Deolasee wrote:

On Fri, Mar 2, 2018 at 9:06 PM, Alvaro Herrera <alvherre@2ndquadrant.com
<mailto:alvherre@2ndquadrant.com>> wrote:

@@ -106,6 +120,9 @@ typedef struct PartitionTupleRouting
int         num_subplan_partition_offsets;
TupleTableSlot *partition_tuple_slot;
TupleTableSlot *root_tuple_slot;
+   List      **partition_arbiter_indexes;
+   TupleTableSlot **partition_conflproj_slots;
+   TupleTableSlot **partition_existing_slots;
} PartitionTupleRouting;

I am curious why you decided to add these members to
PartitionTupleRouting structure. Wouldn't ResultRelationInfo be a better
place to track these or is there some rule that we follow?

I just started reviewing the patch, so maybe I'm missing something, but
I think it would be a good idea to have these in that structure, not in
ResultRelInfo, because these would be required only for partitions
chosen via tuple routing.

Best regards,
Etsuro Fujita

#11Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Etsuro Fujita (#10)
Re: ON CONFLICT DO UPDATE for partitioned tables

On Fri, Mar 16, 2018 at 5:13 PM, Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp>
wrote:

(2018/03/16 19:43), Pavan Deolasee wrote:

On Fri, Mar 2, 2018 at 9:06 PM, Alvaro Herrera <alvherre@2ndquadrant.com
<mailto:alvherre@2ndquadrant.com>> wrote:

@@ -106,6 +120,9 @@ typedef struct PartitionTupleRouting

int         num_subplan_partition_offsets;
TupleTableSlot *partition_tuple_slot;
TupleTableSlot *root_tuple_slot;
+   List      **partition_arbiter_indexes;
+   TupleTableSlot **partition_conflproj_slots;
+   TupleTableSlot **partition_existing_slots;
} PartitionTupleRouting;

I am curious why you decided to add these members to

PartitionTupleRouting structure. Wouldn't ResultRelationInfo be a better
place to track these or is there some rule that we follow?

I just started reviewing the patch, so maybe I'm missing something, but I
think it would be a good idea to have these in that structure, not in
ResultRelInfo, because these would be required only for partitions chosen
via tuple routing.

Hmm, yeah, probably you're right. The reason I got confused is because the
patch uses ri_onConflictSetProj/ri_onConflictSetWhere to store the
converted projection info/where clause for a given partition in its
ResultRelationInfo. So I was wondering if we can
move mt_arbiterindexes, mt_existing and mt_conflproj to ResultRelInfo and
then just use that per-partition structures too.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#12Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Pavan Deolasee (#11)
Re: ON CONFLICT DO UPDATE for partitioned tables

Pavan Deolasee wrote:

Hmm, yeah, probably you're right. The reason I got confused is because the
patch uses ri_onConflictSetProj/ri_onConflictSetWhere to store the
converted projection info/where clause for a given partition in its
ResultRelationInfo. So I was wondering if we can
move mt_arbiterindexes, mt_existing and mt_conflproj to ResultRelInfo and
then just use that per-partition structures too.

I wonder if the proposed structure is good actually.

Some notes as I go along.

1. ModifyTableState->mt_arbiterindexes is just a copy of
ModifyTable->arbiterIndexes. So why do we need it? For an
unpartitioned relation we can just use
ModifyTableState.ps->arbiterIndexes. Now, for each partition we need to
map these indexes onto the partition indexes. Not sure where to put
these; I'm tempted to say ResultRelInfo is the place. Maybe the list
should always be in ResultRelInfo instead of the state/plan node? Not
sure.

2. We don't need mt_existing to be per-relation; a single tuple slot can
do for all partitions; we just need to ExecSlotSetDescriptor to the
partition's descriptor whenever the slot is going to be used. (This
means the slot no longer has a fixed tupdesc. That seems OK).

3. ModifyTableState->mt_conflproj is more or less the same thing; the
same TTS can be reused by all the different projections, as long as we
set the descriptor before projecting. So we don't
need PartitionTupleRouting->partition_conflproj_slots, but we do need a
descriptor to be used when needed. Let's say
PartitionTupleRouting->partition_confl_tupdesc

I'll experiment with these ideas and see where that leads me.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#13Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#12)
Re: ON CONFLICT DO UPDATE for partitioned tables

So ExecInsert receives the ModifyTableState, and separately it receives
arbiterIndexes and the OnConflictAction, both of which are members of
the passed ModifyTableState. I wonder why does it do that; wouldn't it
be simpler to extract those members from the node?

With the patch proposed upthread, we receive arbiterIndexes as a
parameter and if the table is a partition we summarily ignore those and
use the list as extracted from the PartitionRoutingInfo. This is
confusing and pointless. It seems to me that the logic ought to be "if
partition then use the list in PartitionRoutingInfo; if not partition
use it from ModifyTableState". This requires changing as per above,
i.e. make the arbiter index list not part of the ExecInsert's API.

The OnConflictAction doesn't matter much; not passing it is merely a
matter of cleanliness.

Or is there another reason to pass the index list?

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In reply to: Alvaro Herrera (#13)
Re: ON CONFLICT DO UPDATE for partitioned tables

On Fri, Mar 16, 2018 at 11:21 AM, Alvaro Herrera
<alvherre@alvh.no-ip.org> wrote:

So ExecInsert receives the ModifyTableState, and separately it receives
arbiterIndexes and the OnConflictAction, both of which are members of
the passed ModifyTableState. I wonder why does it do that; wouldn't it
be simpler to extract those members from the node?

Or is there another reason to pass the index list?

It works that way pretty much by accident, as far as I can tell.
Removing the two extra arguments sounds like a good idea.

--
Peter Geoghegan

#15Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Peter Geoghegan (#14)
Re: ON CONFLICT DO UPDATE for partitioned tables

Peter Geoghegan wrote:

On Fri, Mar 16, 2018 at 11:21 AM, Alvaro Herrera
<alvherre@alvh.no-ip.org> wrote:

So ExecInsert receives the ModifyTableState, and separately it receives
arbiterIndexes and the OnConflictAction, both of which are members of
the passed ModifyTableState. I wonder why does it do that; wouldn't it
be simpler to extract those members from the node?

Or is there another reason to pass the index list?

It works that way pretty much by accident, as far as I can tell.
Removing the two extra arguments sounds like a good idea.

Great, thanks.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#16Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#12)
Re: ON CONFLICT DO UPDATE for partitioned tables

Another thing I noticed is that the split of the ON CONFLICT slot and
its corresponding projection is pretty weird. The projection is in
ResultRelInfo, but the slot itself is in ModifyTableState. You can't
make the projection work without a corresponding slot initialized with
the correct descriptor, so splitting it this way doesn't make a lot of
sense to me.

(Now, TBH the split between resultRelInfo->ri_projectReturning and
ModifyTableState->ps.ps_ResultTupleSlot, which is the slot that the
returning project uses, doesn't make a lot of sense to me either; so
maybe there some reason that I'm just not seeing. But I digress.)

So I want to propose that we move the slot to be together with the
projection node that it serves, ie. we put the slot in ResultRelInfo:

typedef struct ResultRelInfo
{
...
/* for computing ON CONFLICT DO UPDATE SET */
TupleTableSlot *ri_onConflictProjSlot;
ProjectionInfo *ri_onConflictSetProj;

and with this the structure makes more sense. So ExecInitModifyTable
does this

/* create target slot for UPDATE SET projection */
tupDesc = ExecTypeFromTL((List *) node->onConflictSet,
relationDesc->tdhasoid);
resultRelInfo->ri_onConflictProjSlot =
ExecInitExtraTupleSlot(mtstate->ps.state, tupDesc);

/* build UPDATE SET projection state */
resultRelInfo->ri_onConflictSetProj =
ExecBuildProjectionInfo(node->onConflictSet, econtext,
resultRelInfo->ri_onConflictProjSlot,
&mtstate->ps, relationDesc);

and then ExecOnConflictUpdate can simply do this:

/* Project the new tuple version */
ExecProject(resultRelInfo->ri_onConflictSetProj);

/* Execute UPDATE with projection */
*returning = ExecUpdate(mtstate, &tuple.t_self, NULL,
resultRelInfo->ri_onConflictProjSlot, planSlot,
&mtstate->mt_epqstate, mtstate->ps.state,
canSetTag);

Now, maybe there is some reason I'm missing for the on conflict slot for
the projection to be in ModifyTableState rather than resultRelInfo. But
this code passes all current tests, so I don't know what that reason
would be.

Overall, the resulting code looks simpler to me than the previous
arrangements.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#17Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#16)
Re: ON CONFLICT DO UPDATE for partitioned tables

Hi,

On 2018-03-16 18:23:44 -0300, Alvaro Herrera wrote:

Another thing I noticed is that the split of the ON CONFLICT slot and
its corresponding projection is pretty weird. The projection is in
ResultRelInfo, but the slot itself is in ModifyTableState. You can't
make the projection work without a corresponding slot initialized with
the correct descriptor, so splitting it this way doesn't make a lot of
sense to me.

(Now, TBH the split between resultRelInfo->ri_projectReturning and
ModifyTableState->ps.ps_ResultTupleSlot, which is the slot that the
returning project uses, doesn't make a lot of sense to me either; so
maybe there some reason that I'm just not seeing. But I digress.)

The projections for different child tables / child plans can look
different, therefore it can't be stored in ModifyTableState itself. No?

The slot's descriptor is changed to be "appropriate" when necessary:
if (slot->tts_tupleDescriptor != RelationGetDescr(resultRelationDesc))
ExecSetSlotDescriptor(slot, RelationGetDescr(resultRelationDesc));

So I want to propose that we move the slot to be together with the
projection node that it serves, ie. we put the slot in ResultRelInfo:

That'll mean we have a good number of additional slots in some cases. I
don't think the overhead of that is going to break the bank, but it's
worth considering.

Greetings,

Andres Freund

#18Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#17)
Re: ON CONFLICT DO UPDATE for partitioned tables

Andres Freund wrote:

Hi,

On 2018-03-16 18:23:44 -0300, Alvaro Herrera wrote:

Another thing I noticed is that the split of the ON CONFLICT slot and
its corresponding projection is pretty weird. The projection is in
ResultRelInfo, but the slot itself is in ModifyTableState. You can't
make the projection work without a corresponding slot initialized with
the correct descriptor, so splitting it this way doesn't make a lot of
sense to me.

(Now, TBH the split between resultRelInfo->ri_projectReturning and
ModifyTableState->ps.ps_ResultTupleSlot, which is the slot that the
returning project uses, doesn't make a lot of sense to me either; so
maybe there some reason that I'm just not seeing. But I digress.)

The projections for different child tables / child plans can look
different, therefore it can't be stored in ModifyTableState itself. No?

The slot's descriptor is changed to be "appropriate" when necessary:
if (slot->tts_tupleDescriptor != RelationGetDescr(resultRelationDesc))
ExecSetSlotDescriptor(slot, RelationGetDescr(resultRelationDesc));

Grumble. This stuff looks like full of cheats to me, but I won't touch
it anyway.

So I want to propose that we move the slot to be together with the
projection node that it serves, ie. we put the slot in ResultRelInfo:

That'll mean we have a good number of additional slots in some cases. I
don't think the overhead of that is going to break the bank, but it's
worth considering.

Good point.

I think what I should be doing is the same as the returning stuff: keep
a tupdesc around, and use a single slot, whose descriptor is changed
just before the projection.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#19Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#18)
Re: ON CONFLICT DO UPDATE for partitioned tables

Alvaro Herrera wrote:

I think what I should be doing is the same as the returning stuff: keep
a tupdesc around, and use a single slot, whose descriptor is changed
just before the projection.

Yes, this works, though it's ugly. Not any uglier than what's already
there, though, so I think it's okay.

The only thing that I remain unhappy about this patch is the whole
adjust_and_expand_partition_tlist() thing. I fear we may be doing
redundant and/or misplaced work. I'll look into it next week.

0001 should be pretty much ready to push -- adjustments to ExecInsert
and ModifyTableState I already mentioned.

0002 is stuff I would like to get rid of completely -- changes to
planner code so that it better supports functionality we need for
0003.

0003 is the main patch. Compared to the previous version, this one
reuses slots by switching them to different tupdescs as needed.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

v4-0001-Simplify-ExecInsert-API-re.-ON-CONFLICT-data.patchtext/plain; charset=us-asciiDownload+9-12
v4-0002-Make-some-static-functions-work-on-TupleDesc-rath.patchtext/plain; charset=us-asciiDownload+36-33
v4-0003-Fix-ON-CONFLICT-to-work-with-partitioned-tables.patchtext/plain; charset=us-asciiDownload+443-86
#20Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Pavan Deolasee (#8)
Re: ON CONFLICT DO UPDATE for partitioned tables

On 2018/03/05 18:04, Pavan Deolasee wrote:

On Fri, Mar 2, 2018 at 9:06 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:

Re. the "ugly hack" comments in adjust_inherited_tlist(), I'm confused:
expand_targetlist() runs *after*, not before, so how could it have
affected the result?

If I understand correctly, planner must have called expand_targetlist()
once for the parent relation's descriptor and added any dropped columns
from the parent relation. So we may not find mapped attributes for those
dropped columns in the parent. I haven't actually tested this case though.

I wonder if it makes sense to actually avoid expanding on-conflict-set
targetlist in case the target is a partition table and deal with it during
execution, like we are doing now.

I'm obviously confused about what
expand_targetlist call this comment is talking about. Anyway I wanted
to make it use resjunk entries instead, but that broke some other case
that I didn't have time to research yesterday. I'll get back to this
soon, but in the meantime, here's what I have.

+           conv_setproj =
+
adjust_and_expand_partition_tlist(RelationGetDescr(firstResultRel),
+                                                 RelationGetDescr(partrel),
+
RelationGetRelationName(partrel),
+                                                 firstVarno,
+                                                 conv_setproj);

Aren't we are adding back Vars referring to the parent relation, after
having converted the existing ones to use partition relation's varno? May
be that works because missing attributes are already added during planning
and expand_targetlist() here only adds dropped columns, which are just NULL
constants.

I think this suggestion to defer onConflictSet target list expansion to
execution time is a good one. So, in preprocess_targetlist(), we'd only
perform exapand_targetlist on the onConflictSet list if the table is not a
partitioned table. For partitioned tables, we don't know which
partition(s) will be affected, so it's useless to do the expansion there.
Instead it's better to expand in ExecInitPartitionInfo(), possibly after
changing original TargetEntry nodes to have correct resnos due to
attribute number differences (calling adjust_inherited_tlist(), etc.).

And then since we're not expanding the target list in the planner anymore,
we don't need to install those hacks in adjust_inherited_tlist() like the
patch currently does to ignore entries for dropped columns in the parent
the plan-time expansion adds.

Thanks,
Amit

#21Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Alvaro Herrera (#19)
#22Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Amit Langote (#21)
#23Etsuro Fujita
fujita.etsuro@lab.ntt.co.jp
In reply to: Alvaro Herrera (#19)
#24Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Etsuro Fujita (#23)
#25Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Amit Langote (#24)
#26Etsuro Fujita
fujita.etsuro@lab.ntt.co.jp
In reply to: Amit Langote (#24)
#27Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Amit Langote (#25)
#28Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Etsuro Fujita (#26)
#29Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Amit Langote (#28)
#30Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Pavan Deolasee (#29)
#31Etsuro Fujita
fujita.etsuro@lab.ntt.co.jp
In reply to: Amit Langote (#28)
#32Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Amit Langote (#30)
#33Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Amit Langote (#30)
#34Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Alvaro Herrera (#33)
#35Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Amit Langote (#34)
#36Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Alvaro Herrera (#35)
#37Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Amit Langote (#36)
#38Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Amit Langote (#37)
#39Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Amit Langote (#38)
#40Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#39)
#41Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Alvaro Herrera (#39)
#42Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Amit Langote (#41)
#43Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Amit Langote (#41)
#44Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Amit Langote (#42)
#45Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Alvaro Herrera (#43)
#46Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Alvaro Herrera (#44)
#47Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Amit Langote (#46)
#48Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Alvaro Herrera (#47)
#49Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Amit Langote (#48)
#50Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#49)
In reply to: Robert Haas (#50)
#52Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#49)
#53Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Alvaro Herrera (#52)